[Nutch-dev] RE: Nutch Parsing PDFs, and general PDF extraction

Richard Braman Thu, 02 Mar 2006 16:45:00 -0800

Hi Ben

>but that the cost of converting PDF to text is already resource
intensive and some users may not want to pay the additional cost to 
>analyze each page.


Agreed. For nutch it could be a simple config parameter to turn that on
or off. Pdf parsing is already optional, maybe there could be
alternaitive pasring strategies when parsing is turned on, to choose one
of the parsing methods (simple, complex1, complex2, etc)

>While PDFs are unstructured, most documents give pretty good results
with the default text extraction.  Usually the extracted text is 
>already in reading order.

Except if there are text and columns, then it goes haywire. For example,
parsing tax instructions always fails, and the content is always layed
out in columns.  Many newspaper articles have the same problem.

>An extremely small percent of PDFs actually include tagged information
Agreed, but that may change with Section 508, at least for government,
which is still the largest volume of pdfs on the net.
Is this hard to support with PDFBox?

>Overall, the easiest thing to do would be to implement good PDF->HTML
conversion capabilities to PDFBox, then Nutch just uses that 
>resulting HTML for indexing and for preview mode.  Until that is done
there is not much the Nutch developers can do.  

Agreed, I want nutch dev to know whats going on because I do think this
functionality is important for nutch's future. Maybe they have some
insights into parsing methods as many of these devs are experts with
ontologies.

Ben, maybe we should move this into pdf box dev list, and any one who is
interested (nutch developers or not) can get in on it.  I would think
nutch should assign this to someone on their team given the importnance
of the fucntionality.

Rich


-----Original Message-----
From: Ben Litchfield [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 02, 2006 4:46 PM
To: Richard Braman
Cc: [email protected]; [EMAIL PROTECTED]
Subject: RE: Nutch Parsing PDFs, and general PDF extraction



To chime in and give my comments.

It is true that better search engine results could be obtained by first
analysing each PDF page and converting it to some other
structure(XML/HTML) before the indexing process.  But that the cost of
converting PDF to text is already resource intensive and some users may
not want to pay the additional cost to analyze each page.

While PDFs are unstructured, most documents give pretty good results
with the default text extraction.  Usually the extracted text is already
in reading order.

An extremely small percent of PDFs actually include tagged information

Converting a PDF to HTML is something that needs to get implemented in
PDFBox, then it is trivial for Nutch to include it.

Overall, the easiest thing to do would be to implement good PDF->HTML
conversion capabilities to PDFBox, then Nutch just uses that resulting
HTML for indexing and for preview mode.  Until that is done there is not
much the Nutch developers can do.

Ben


On Thu, 2 Mar 2006, Richard Braman wrote:

> It is possible to come up with some better parsing algorithms , than 
> simply doing a Stripper.get text, which is what nutch does right now.

> I am not recommending switching from PDFBox.  I think most important 
> is that the algorith used in the page does the best  job possible in 
> preserving the flow of text.  If the text doesn't flow correctly, 
> search results may be altered, which is why if nutch is about search 
> it must be able to parse PDF correctly.  Ben Litchfield, the developer

> of PDFbox, has noted that he has developed some better parsing 
> technology, and hopes to share those with us soon.
>
> Another thing to consider is if the pdf is "tagged" then it carries a 
> XML markup that desribes the flow of text, which was designed to be 
> use for accessability under section 508.  I think Ben also noted that 
> PDFBOx did not support pdf tags. 
> http://www.planetpdf.com/enterprise/article.asp?ContentID=6067
>
> A better parsing strategy may involve the following pseducode:
>
> Determine whther pdf contains tagged content
>
>       If so,
>               parse tagged content so that returned text flows
> correctly
>
>       If not
>
>               Determine whether the pdf contains bounding boxes that
indicate that 
> content is contained in tablular format.
>
>               If not,
>                       parse getting stripper.get text
>
>               If so, implement algorithm to extract text from pdf
preserving flow 
> of text
>
>
> An adiditonal feature may include saving the pdf as html as nutch 
> crawls the web.
>
>
> An example of such algortithms may be found at: 
> www.tamirhassan.com/final.pdf 
> http://www.chilisoftware.net/Private/Christian/ideas_for_extracting_da
> ta
> _from_unstructured_documents.pdf.
>
>
> This is something google does very well, and something nutch must 
> match to compete.
>
> -----Original Message-----
> From: John X [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, March 01, 2006 2:12 AM
> To: [email protected]; [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]
> Subject: Re: Nutch Parsing PDFs, and general PDF extraction
>
>
> On Tue, Feb 28, 2006 at 09:55:18AM -0500, Richard Braman wrote:
> > thanks for the help.  I dont know what happenned , but it is working

> > no. Did any other contributros read what I sent about parsing PDFs? 
> > I dont think nutch is capable with this based on the text stripper 
> > code in parse pdf
> >
> > http://64.233.179.104/search?q=cache:QOwcLFXNw5oJ:www.irs.gov/pub/ir
> > s-
> > pd
> > f/f1040.pdf+irs+1040+pdf
> >
> <http://64.233.179.104/search?q=cache:QOwcLFXNw5oJ:www.irs.gov/pub/irs
> -p
> > df/f1040.pdf+irs+1040+pdf&hl=en&gl=us&ct=clnk&cd=1>
> > &hl=en&gl=us&ct=clnk&cd=1
> >
> >
> > Its time to implement some real pdf parsing technology.
> > any other takers?
>
> Nutch is about search and it relies on 3rd party libraries
> to extract text from various mimetypes, including application/pdf. 
> Whether nutch can correctly extract text from a pdf file largely 
> depends on the pdf parsing library it uses, currently PDFBox. It won't

> be very difficult to switch to other libraries. However it seems hard 
> to find a free/open implementation that can parse every pdf file in 
> the wild. There is an alternative: use nutch's parse-ext with a 
> command line pdf parser/converter, which can just be an executable.
>
> John
>



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] RE: Nutch Parsing PDFs, and general PDF extraction

Reply via email to