Andreas Meier created PDFBOX-2998:
-------------------------------------

             Summary: Document layout analysis tools needed
                 Key: PDFBOX-2998
                 URL: https://issues.apache.org/jira/browse/PDFBOX-2998
             Project: PDFBox
          Issue Type: New Feature
          Components: Text extraction
    Affects Versions: 2.0.0
            Reporter: Andreas Meier
            Priority: Blocker


PDFBox will need some document layout analysis tools to extract text correctly.
At the Moment the text of a document is extracted using the position of single 
characters.
This may lead to wrong results, due to the format of the file
For a good extraction, layout analysis and segmentation has to be done in a 
previous step.

https://code.google.com/p/lapdftext

Would be a good solution for a layout analysis tool, unfortunately, it heavily 
relies on other libraries and needs Java 1.6 to run.

The layout analysis tool should segementate the file and return a list or set 
of rectangles.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to