I noticed that nutch seems to have some problems parsing pdfs. 060226 131210 fetch okay, but can't parse http://www.irs.gov/pub/irs-pdf/p1828.pdf, reason: failed(2,203): Content-Type not text/html: application/pdf I am actually working on PDF parsing technology, and have posted the following message to 2 Open source pdf projects (PDFBox and iText). If there is interested from nutch developers on what responses I have received , and how a collaborative solution may be reached, let me know. -----Original Message----- From: Richard Braman [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 21, 2006 10:36 AM To: 'itext-questions@lists.sourceforge.net'; '[EMAIL PROTECTED]'; '[EMAIL PROTECTED]' Cc: '[EMAIL PROTECTED]' Subject: Good reading/research on PDF text extraction
In 2003, Tamir Hassan wrote a OS program <http://www.tamirhassan.com/> http://www.tamirhassan.com/ to extract text out of PDF tables and columns and put it into HTML as a part of a University research product. His algorthims were actually quite sophisticated and well documented in http://www.tamirhassan.dsl.pipex.com/final.pdf. The results were actually quite impressive, as he managed to deal with columns, etc using what he referred to Intelligent text extraction algorithm which uses positions to preserve text flow. He used Jpedal as his underlying PDF library. Unfortunately his program was written with an old version of Jpedal and does not run with the new Jpedal. This is due to the fact that the PDFGenericGrouping class he used was changed to PDFGroupingAlgorithms and moved to non-GPL Jpedal. The new class also changed some of the old classes' members from public to private, and deleted some members, which would make rewriting his app nessesary. Fast forward to 2005, Christian Leinberger, a colleague of Tamirs, writes a paper entitled Ideas for extracting data from an unstructured document http://www.chilisoftware.net/Private/Christian/ideas_for_extracting_data _from_unstructured_documents.pdf. Christian indicated that he is using the open source BSD PDFBox as his library for experiementing with algortihms that can be used to extract text reliabily out of unstructured PDFs. I have contacted these guys and hopefully they will be willing to share their developments with the PDF community. As more and more content gets "pushed" into PDF it looses its meaning to anyone else other than a human reader or a printer. Machines do not have the ability to read and parse it reliably in a generic context, and it requires sophisticated AI algortihms based on ontologies, or other big words, to get it out. If your lucky, you can hack through it and get what you need. Something to think about the next time you push content into a PDF, or even HTML. PDF is a great way to present content for priting, but it [EMAIL PROTECTED] , pardon my french, as a primary mechanism for presenting data that may need to be used by a machine somewhere downstream. Getting it out has turned into big business for companies who have developed technology to get into the PDF and get important data out of it and into another format, usually XML. This is a growing space and I hope that there are some more developers interested in solving the problem created by PDF crazy folks who have managed to shove valuable data into PDF while failing to maintain that same data in another more usable format (e.g. XML , or at least tagged PDF ). It is best that this is done in an open format, because the value of such technolgy is very high, it is complicated to produce, and very useful to the general public. Richard Braman <mailto:[EMAIL PROTECTED]> mailto:[EMAIL PROTECTED] 561.748.4002 (voice) <http://www.taxcodesoftware.org/> http://www.taxcodesoftware.org Free Open Source Tax Software