[max] Combining extracted data from multiple location before analysing and indexing.

Max S Sat, 08 Aug 2009 14:53:55 -0700

Hi all, 

I'm working on a project to configure Nutch to crawl a few image-rich sites.


Ideally, the approach would be to crawl the site by going through the
folllowing steps:

1. Inject crawldb with urls to specific categories on the sites and to set a
limit to the crawl depth to focus the crawl to a few section. 
2. Crawl and extract text & outlink from html pages
3. Fetch outlink contents and determine the content type of the retrieved
data
4. If it's a JPG, extract metadata
5. Combine extracted text from the html with image metadata and analyse the
information
6. Index the results from the analysis. 

I have managed to integrate a JPG parser but I can't see how I can retain
the text extracted from HTML in memory and then combine these together with
image metadata before sending them to analyser? Anyone have any idea?

Regards
Max

[max] Combining extracted data from multiple location before analysing and indexing.

Reply via email to