[jira] [Comment Edited] (PDFBOX-2998) Document layout analysis tools needed

John Hewson (JIRA) Mon, 28 Sep 2015 13:28:41 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14933921#comment-14933921
 ]


John Hewson edited comment on PDFBOX-2998 at 9/28/15 8:27 PM:
--------------------------------------------------------------

I've worked extensively with layout analysis and I don't think it's a good fit 
for PDFBox. There's no good general-purpose solution and the field is the topic 
of active academic research. Rule-based extractors such as the one linked to 
aren't particularly great solutions (indeed that one works only for academic 
texts), methods such as maximum likelihood line finding and  voronoi diagrams 
are more promising, though very complex to implement and costly in terms of CPU 
time.

Unless you're planning to do or have done a PhD in this area, then it's 
probably not something you want to spend too much time thinking about. The 
simple approaches don't work very well, and the complex approaches are, well, 
very complex, e.g. Docstrum.

Many of these layout analysis tools are still experimental academic projects, 
as listed by Maruan above. Thomas Breul works on Google Books and the 
open-source OCRopus system (now [ocropy|https://github.com/tmbdev/ocropy]), and 
Tamir Hassan's [PDF Extraction Toolkit|http://www.tamirhassan.com/pdfxtk.html] 
is actually an open-source project built on top of PDFBox.


was (Author: jahewson):
I've worked extensively with layout analysis and I don't think it's a good fit 
for PDFBox. There's no good general-purpose solution and the field is the topic 
of active academic research. Rule-based extractors such as the one linked to 
aren't particularly great solutions (indeed that one works only for academic 
texts), methods such as maximum likelihood line finding and  voronoi diagrams 
are more promising, though very complex to implement and costly in terms of CPU 
time.

Unless you're planning to do or have done a PhD in this area, then it's 
probably not something you want to spend too much time thinking about. The 
simple approaches don't work very well, and the complex approaches are, well, 
very complex, e.g. Docstrum.

Many of these layout analysis tools are still experimental academic projects, 
as listed by Maruan above. Thomas Breul works on Google Books and the 
open-source OCRopus system (now [ocropy|https://github.com/tmbdev/ocropy], and 
Tamir Hassan's [PDF Extraction Toolkit|http://www.tamirhassan.com/pdfxtk.html] 
is actually an open-source project built on top of PDFBox.

> Document layout analysis tools needed
> -------------------------------------
>
>                 Key: PDFBOX-2998
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2998
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Andreas Meier
>
> PDFBox will need some document layout analysis tools to extract text 
> correctly.
> At the Moment the text of a document is extracted using the position of 
> single characters.
> This may lead to wrong results, due to the format of the file
> For a good extraction, layout analysis and segmentation has to be done in a 
> previous step.
> https://code.google.com/p/lapdftext
> Would be a good solution for a layout analysis tool, unfortunately, it 
> heavily relies on other libraries and needs Java 1.6 to run.
> The layout analysis tool should segementate the file and return a list or set 
> of rectangles.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-2998) Document layout analysis tools needed

Reply via email to