It is possible to come up with some better parsing algorithms , than
simply doing a Stripper.get text, which is what nutch does right now.  I
am not recommending switching from PDFBox.  I think most important is
that the algorith used in the page does the best  job possible in
preserving the flow of text.  If the text doesn't flow correctly, search
results may be altered, which is why if nutch is about search it must be
able to parse PDF correctly.  Ben Litchfield, the developer of PDFbox,
has noted that he has developed some better parsing technology, and
hopes to share those with us soon.

Another thing to consider is if the pdf is "tagged" then it carries a
XML markup that desribes the flow of text, which was designed to be use
for accessability under section 508.  I think Ben also noted that PDFBOx
did not support pdf tags.
http://www.planetpdf.com/enterprise/article.asp?ContentID=6067

A better parsing strategy may involve the following pseducode:

Determine whther pdf contains tagged content

        If so, 
                parse tagged content so that returned text flows
correctly

        If not

                Determine whether the pdf contains bounding boxes that
indicate that content is contained in tablular format.

                If not, 
                        parse getting stripper.get text

                If so, implement algorithm to extract text from pdf
preserving flow of text


An adiditonal feature may include saving the pdf as html as nutch crawls
the web. 


An example of such algortithms may be found at:
www.tamirhassan.com/final.pdf
http://www.chilisoftware.net/Private/Christian/ideas_for_extracting_data
_from_unstructured_documents.pdf. 


This is something google does very well, and something nutch must match
to compete.

-----Original Message-----
From: John X [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, March 01, 2006 2:12 AM
To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: Re: Nutch Parsing PDFs, and general PDF extraction


On Tue, Feb 28, 2006 at 09:55:18AM -0500, Richard Braman wrote:
> thanks for the help.  I dont know what happenned , but it is working 
> no. Did any other contributros read what I sent about parsing PDFs? I 
> dont think nutch is capable with this based on the text stripper code 
> in parse pdf
>  
> http://64.233.179.104/search?q=cache:QOwcLFXNw5oJ:www.irs.gov/pub/irs-
> pd
> f/f1040.pdf+irs+1040+pdf
>
<http://64.233.179.104/search?q=cache:QOwcLFXNw5oJ:www.irs.gov/pub/irs-p
> df/f1040.pdf+irs+1040+pdf&hl=en&gl=us&ct=clnk&cd=1>
> &hl=en&gl=us&ct=clnk&cd=1
>  
>  
> Its time to implement some real pdf parsing technology.
> any other takers?

Nutch is about search and it relies on 3rd party libraries
to extract text from various mimetypes, including application/pdf.
Whether nutch can correctly extract text from a pdf file largely depends
on the pdf parsing library it uses, currently PDFBox. It won't be very
difficult to switch to other libraries. However it seems hard to find a
free/open implementation that can parse every pdf file in the wild.
There is an alternative: use nutch's parse-ext with a command line pdf
parser/converter, which can just be an executable.

John

Reply via email to