Very nice work. As a user of pdfbox rather than a developer, I would think that this would be a very useful addition.
On Mon, Nov 16, 2009 at 11:57 PM, Tamir Hassan <[email protected]>wrote: > Hi, > > Back in 2006, two PDFBox developers (Richard Braman, Ben Lichfield) asked > me if I was willing to collaborate in the development of text > segmentation/grouping algorithms. At that time, I was working on an > industrial project and this was not possible because of copyright issues. > > Since 2008, I have been working on another university project, and have got > approval to publish the work documented in the following research paper > under an open-source licence: > > Hassan, T.: Object-Level Document Analysis of PDF Files > 2009 ACM Symposium on Document Engineering > http://www.dbai.tuwien.ac.at/staff/hassan/files/p47-hassan.pdf > > This paper describes algorithms for text segmentation as well as grouping > of vector graphics into objects. > > My current code makes use of a class named PDFObjectExtractor, which > extends PDFStreamEngine, and obtains the text segments, bitmap and vector > graphics as a list of objects. > > I don't know if PDFBox has any such functionality yet, but I would be more > than happy to work on integrating these algorithms into PDFBox. > > Please would you let me know what would be the best way to go about this. > > Best regards, > > Tamir Hassan > [email protected] > > Database and Artificial Intelligence Group > Technische UniversitätWien -- Ted Dunning, CTO DeepDyve
