Tamir, That looks very interesting. I would be interested in seeing the results of your work. I am interested in information extraction from PDFs in general. I've downloaded and will read your paper.
FWIW - I created an enhanced text extractor for PDFBox that takes a more simplified approach. This extends and enhances the existing PDFTextStripper class (which is itself based on the PDFStreamEngine). This class tries to take a more refined approach to identifying text chunks as paragraphs and is more fully instrumented than the parent class so supports more flexible demarcation. For example, I have a subclass of it that I use to convert the PDF into an XML format, using a simple hierarchy of document/page/article/paragraph to organize the textual content of the PDF. The PDFTextStripper2 class is posted to JIRA here: https://issues.apache.org/jira/browse/PDFBOX-521 This is designed as a parallel / drop in that works similar to the existing PDFTextStripper class - except it has a few additional bells and whistles. -mel -----Original Message----- From: Tamir Hassan [mailto:[email protected]] Sent: Tuesday, November 17, 2009 2:57 AM To: [email protected] Subject: Contributing text grouping/segmentation algorithms to PDFBox? Hi, Back in 2006, two PDFBox developers (Richard Braman, Ben Lichfield) asked me if I was willing to collaborate in the development of text segmentation/grouping algorithms. At that time, I was working on an industrial project and this was not possible because of copyright issues. Since 2008, I have been working on another university project, and have got approval to publish the work documented in the following research paper under an open-source licence: Hassan, T.: Object-Level Document Analysis of PDF Files 2009 ACM Symposium on Document Engineering http://www.dbai.tuwien.ac.at/staff/hassan/files/p47-hassan.pdf This paper describes algorithms for text segmentation as well as grouping of vector graphics into objects. My current code makes use of a class named PDFObjectExtractor, which extends PDFStreamEngine, and obtains the text segments, bitmap and vector graphics as a list of objects. I don't know if PDFBox has any such functionality yet, but I would be more than happy to work on integrating these algorithms into PDFBox. Please would you let me know what would be the best way to go about this. Best regards, Tamir Hassan [email protected] Database and Artificial Intelligence Group Technische UniversitätWien
