Bradford Stephens wrote:
Greetings,

IIRC, Lucene (which Nutch uses for document indexing) actually indexes data
types via plugins. So if you have a plugin for PDF parsing (I believe there
is one), then you would be able to do what you wish for it.

Cheers,
Bradford

On Thu, Feb 26, 2009 at 11:40 AM, Robert Edmiston <robert.edmis...@gmail.com
wrote:

I have been tasked by my boss of finding out if Nutch indexes content in an
image in a pdf document via OCR and then recognize it as text. So in other
words, if someone uploads a PDF document to our site, and the PDF document
is of an image that is saved as PDF, will nutch search the text within the
image and then catalog the text as part of that PDF document?

Please ask this type of questions on nutch-user list. nutch-agent is primarily for discussing behavior of Nutch-based robots.

To answer your question: Nutch can extract plain text from PDF-s that contain plain text. Those PDFs that contain just images (i.e. text as bitmap pictures) cannot be indexed without using some sort of OCR. It's possible to integrate OCR into Nutch workflow, but currently this is not yet implemented.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to