Re: Using Tika/Nutch to analyze a website

Jukka Zitting Mon, 16 Apr 2007 00:53:27 -0700

Hi,

On 4/16/07, Ian Holsman <[EMAIL PROTECTED]> wrote:

I was planning on using nutch and UIMA to analyze to perform entity
extraction, and noticed that you mention that Tika would be designed
to do this.


i was wondering how things were going with Tika, as it doesn't seem
like there is any code/design plans checked in (except for the
proposal).


Thanks for the interest! As you noticed, we're just getting started
and haven't yet achieved much.

So I would like to spark the discussion.

i would like to:
- use nutch to fetch the pages (HTML) from the site
- UIMA to analyze them and extract interesting information.
- mysql, or possibly HBase to store versioned/historical output of
this analysis, for possible further reporting on (stats, and page
timelines)

is Tika going to be able to do this for me?


Certainly not all of it. In this scheme Tika would most naturally fit
as a component used by UIMA to parse the HTML pages. The main benefit
of using Tika instead of a native HTML parser in this case would be
that you could easily extend the application to also analyze other
types of document like PDFs, etc.

BR,

Jukka Zitting

Re: Using Tika/Nutch to analyze a website

Reply via email to