Hi all, 

I'm working on a project to configure Nutch to crawl a few image-rich sites.

Ideally, the approach would be to crawl the site by going through the
folllowing steps:

1. Inject crawldb with urls to specific categories on the sites and to set a
limit to the crawl depth to focus the crawl to a few section. 
2. Crawl and extract text & outlink from html pages
3. Fetch outlink contents and determine the content type of the retrieved
data
4. If it's a JPG, extract metadata
5. Combine extracted text from the html with image metadata and analyse the
information
6. Index the results from the analysis. 

I have managed to integrate a JPG parser but I can't see how I can retain
the text extracted from HTML in memory and then combine these together with
image metadata before sending them to analyser? Anyone have any idea?

Regards
Max



Reply via email to