Getting images into the content cache and image links into the search results

Wade Dugas Mon, 02 Aug 2010 08:03:46 -0700

I am not able to get Nutch 1.2 to crawl jpeg images (or images of any type). 
Parse Tika is supposed to


be able to parse them, but what needs to be done to have them fetched and 
indexed?

I have updated regex-urlfilter.txt and suffix-urlfilter.txt to not skip jpeg 
images:
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|bmp|BMP)$


### prohibit these
# pictures
# removed .jpg

I have also tried adding the mime type to parse-plugins.xml Although parse-tika 
should automatically parse the image/jpeg Mime type, the 

fetcher doesn't seem to pick them up.

 <mimeType name="image/jpeg">
                <plugin id="parse-tika" />
        </mimeType>

I would like to parse the images and store them in the content cache and I want 
them returned in the search results.

I really want to know if I have a bad config or the functionality is not in 
Nutch. It seems I may need to write my own plugin, which I can then contribute 
back to the project.

Perhaps there is documentation I am not finding, but if it is not there, I can 
write some once I gain a better understanding of Nutch.

Any help would be most appreciated.

Thanks,
Wade

Getting images into the content cache and image links into the search results

Reply via email to