Very nice work.  As a user of pdfbox rather than a developer, I would think
that this would be a very useful addition.

On Mon, Nov 16, 2009 at 11:57 PM, Tamir Hassan <[email protected]>wrote:

> Hi,
>
> Back in 2006, two PDFBox developers (Richard Braman, Ben Lichfield) asked
> me if I was willing to collaborate in the development of text
> segmentation/grouping algorithms.  At that time, I was working on an
> industrial project and this was not possible because of copyright issues.
>
> Since 2008, I have been working on another university project, and have got
> approval to publish the work documented in the following research paper
> under an open-source licence:
>
> Hassan, T.: Object-Level Document Analysis of PDF Files
> 2009 ACM Symposium on Document Engineering
> http://www.dbai.tuwien.ac.at/staff/hassan/files/p47-hassan.pdf
>
> This paper describes algorithms for text segmentation as well as grouping
> of vector graphics into objects.
>
> My current code makes use of a class named PDFObjectExtractor, which
> extends PDFStreamEngine, and obtains the text segments, bitmap and vector
> graphics as a list of objects.
>
> I don't know if PDFBox has any such functionality yet, but I would be more
> than happy to work on integrating these algorithms into PDFBox.
>
> Please would you let me know what would be the best way to go about this.
>
> Best regards,
>
> Tamir Hassan
> [email protected]
>
> Database and Artificial Intelligence Group
> Technische UniversitätWien




-- 
Ted Dunning, CTO
DeepDyve

Reply via email to