Tamir,

That looks very interesting.  I would be interested in seeing the results of 
your work.  I am interested in information extraction from PDFs in general.  
I've downloaded and will read your paper.

FWIW - I created an enhanced text extractor for PDFBox that takes a more 
simplified approach.  This extends and enhances the existing PDFTextStripper 
class (which is itself based on the PDFStreamEngine).  This class tries to take 
a more refined approach to identifying text chunks as paragraphs and is more 
fully instrumented than the parent class so supports more flexible demarcation. 
 For example, I have a subclass of it that I use to convert the PDF into an XML 
format, using a simple hierarchy of document/page/article/paragraph to organize 
the textual content of the PDF. 

The PDFTextStripper2 class is posted to JIRA here:

https://issues.apache.org/jira/browse/PDFBOX-521

This is designed as a parallel / drop in that works similar to the existing 
PDFTextStripper class - except it has a few additional bells and whistles.

-mel

-----Original Message-----
From: Tamir Hassan [mailto:[email protected]] 
Sent: Tuesday, November 17, 2009 2:57 AM
To: [email protected]
Subject: Contributing text grouping/segmentation algorithms to PDFBox?

Hi,

Back in 2006, two PDFBox developers (Richard Braman, Ben Lichfield) asked 
me if I was willing to collaborate in the development of text 
segmentation/grouping algorithms.  At that time, I was working on an 
industrial project and this was not possible because of copyright issues.

Since 2008, I have been working on another university project, and have 
got approval to publish the work documented in the following research 
paper under an open-source licence:

Hassan, T.: Object-Level Document Analysis of PDF Files
2009 ACM Symposium on Document Engineering
http://www.dbai.tuwien.ac.at/staff/hassan/files/p47-hassan.pdf

This paper describes algorithms for text segmentation as well as grouping 
of vector graphics into objects.

My current code makes use of a class named PDFObjectExtractor, which 
extends PDFStreamEngine, and obtains the text segments, bitmap and vector 
graphics as a list of objects.

I don't know if PDFBox has any such functionality yet, but I would be more 
than happy to work on integrating these algorithms into PDFBox.

Please would you let me know what would be the best way to go about this.

Best regards,

Tamir Hassan
[email protected]

Database and Artificial Intelligence Group
Technische UniversitätWien

Reply via email to