I noticed that nutch seems to have some problems parsing pdfs.
 
060226 131210 fetch okay, but can't parse
http://www.irs.gov/pub/irs-pdf/p1828.pdf, reason: failed(2,203):
Content-Type not text/html: application/pdf
 
I am actually working on PDF parsing technology, and have posted the
following message to 2 Open source pdf projects (PDFBox and iText).  If
there is interested from nutch developers on what responses I have
received , and how a collaborative solution may be reached, let me know.
 
-----Original Message-----
From: Richard Braman [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 21, 2006 10:36 AM
To: 'itext-questions@lists.sourceforge.net'; '[EMAIL PROTECTED]';
'[EMAIL PROTECTED]'
Cc: '[EMAIL PROTECTED]'
Subject: Good reading/research on PDF text extraction



In 2003, Tamir Hassan wrote a OS program  <http://www.tamirhassan.com/>
http://www.tamirhassan.com/ to extract text out of PDF tables and
columns and put it into HTML as a part of a University research product.
His algorthims were actually quite sophisticated and well documented in
http://www.tamirhassan.dsl.pipex.com/final.pdf.  

The results were actually quite impressive, as he managed to deal with
columns, etc using what he referred to Intelligent text extraction
algorithm which uses positions to preserve text flow.  He used Jpedal as
his underlying PDF library.

Unfortunately his program was written with an old version of Jpedal and
does not run with the new Jpedal.  This is due to the fact that the
PDFGenericGrouping class he used was changed to PDFGroupingAlgorithms
and moved to non-GPL Jpedal.  The new class also changed some of the old
classes' members from public to private, and deleted some members, which
would make rewriting his app nessesary.

Fast forward to 2005, Christian Leinberger, a colleague of Tamirs,
writes a paper entitled Ideas for extracting data from an unstructured
document
http://www.chilisoftware.net/Private/Christian/ideas_for_extracting_data
_from_unstructured_documents.pdf.  Christian indicated that he is using
the open source  BSD PDFBox as his library for experiementing with
algortihms that can be used to extract text reliabily out of
unstructured PDFs.  

I have contacted these guys and hopefully they will be willing to share
their developments with the PDF community.

As more and more content gets "pushed" into PDF it looses its meaning to
anyone else other than a human reader or a printer.  Machines do not
have the ability to read and parse it reliably in a generic context, and
it requires sophisticated AI algortihms based on ontologies, or  other
big words, to get it out.  If your lucky, you can hack through it and
get what you need. Something to think about the next time you push
content into a PDF, or even HTML.  PDF is a great way to present content
for priting, but it  [EMAIL PROTECTED] , pardon my french, as a primary 
mechanism for
presenting data that may need to be used by a machine somewhere
downstream.

Getting it out has turned into big business for companies who have
developed technology to get into the PDF and get important data out of
it and into another format, usually XML.  This is a growing space and I
hope that there are some more developers interested in solving the
problem created by PDF crazy folks who have managed to shove valuable
data into PDF while failing to maintain that same data in another more
usable format (e.g. XML ,  or at least tagged PDF ).  It is best that
this is done in an open format, because the value of such technolgy is
very high, it is complicated to produce, and very useful to the general
public.

Richard Braman
 <mailto:[EMAIL PROTECTED]> mailto:[EMAIL PROTECTED]
561.748.4002 (voice) 

 <http://www.taxcodesoftware.org/> http://www.taxcodesoftware.org
Free Open Source Tax Software

 

Reply via email to