I am not able to get Nutch 1.2 to crawl jpeg images (or images of any type).
Parse Tika is supposed to
be able to parse them, but what needs to be done to have them fetched and
indexed?
I have updated regex-urlfilter.txt and suffix-urlfilter.txt to not skip jpeg
images:
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|bmp|BMP)$
### prohibit these
# pictures
# removed .jpg
I have also tried adding the mime type to parse-plugins.xml Although parse-tika
should automatically parse the image/jpeg Mime type, the
fetcher doesn't seem to pick them up.
<mimeType name="image/jpeg">
<plugin id="parse-tika" />
</mimeType>
I would like to parse the images and store them in the content cache and I want
them returned in the search results.
I really want to know if I have a bad config or the functionality is not in
Nutch. It seems I may need to write my own plugin, which I can then contribute
back to the project.
Perhaps there is documentation I am not finding, but if it is not there, I can
write some once I gain a better understanding of Nutch.
Any help would be most appreciated.
Thanks,
Wade