Hi all, I'm working on a project to configure Nutch to crawl a few image-rich sites.
Ideally, the approach would be to crawl the site by going through the folllowing steps: 1. Inject crawldb with urls to specific categories on the sites and to set a limit to the crawl depth to focus the crawl to a few section. 2. Crawl and extract text & outlink from html pages 3. Fetch outlink contents and determine the content type of the retrieved data 4. If it's a JPG, extract metadata 5. Combine extracted text from the html with image metadata and analyse the information 6. Index the results from the analysis. I have managed to integrate a JPG parser but I can't see how I can retain the text extracted from HTML in memory and then combine these together with image metadata before sending them to analyser? Anyone have any idea? Regards Max
