[ https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14933921#comment-14933921 ]
John Hewson edited comment on PDFBOX-2998 at 9/28/15 8:25 PM: -------------------------------------------------------------- I've worked extensively with layout analysis and I don't think it's a good fit for PDFBox. There's no good general-purpose solution and the field is the topic of active academic research. Rule-based extractors such as the one linked to aren't particularly great solutions (indeed that one works only for academic texts), methods such as maximum likelihood line finding and voronoi diagrams are more promising, though very complex to implement and costly in terms of CPU time. Unless you're planning to do or have done a PhD in this area, then it's probably not something you want to spend too much time thinking about. The simple approaches don't work very well, and the complex approaches are, well, very complex, e.g. Docstrum. Many of these layout analysis tools are still experimental academic projects, as listed by Maruan above. Thomas Breul works on Google Books and the open-source OCRopus system, and Tamir Hassan's PDF Extraction Toolkit is actually an open-source project built on top of PDFBox. was (Author: jahewson): I've worked extensively with layout analysis and I don't think it's a good fit for PDFBox. There's no good general-purpose solution and the field is the topic of active academic research. Rule-based extractors such as the one linked to aren't particularly great solutions (indeed that one works only for academic texts), methods such as maximum likelihood line finding and voronoi diagrams are more promising, though very complex to implement and costly in terms of CPU time. Unless you're planning to do or have done a PhD in this area, then it's probably not something you want to spend to much time thinking about. The simple approaches don't work very well, and the complex approaches are, well, very complex, e.g. Docstrum. Many of these layout analysis tools are still experimental academic projects, as listed by Maruan above. Thomas Breul works on Google Books and the open-source OCRopus system, and Tamir Hassan's PDF Extraction Toolkit is actually an open-source project built on top of PDFBox. > Document layout analysis tools needed > ------------------------------------- > > Key: PDFBOX-2998 > URL: https://issues.apache.org/jira/browse/PDFBOX-2998 > Project: PDFBox > Issue Type: New Feature > Components: Text extraction > Affects Versions: 2.0.0 > Reporter: Andreas Meier > > PDFBox will need some document layout analysis tools to extract text > correctly. > At the Moment the text of a document is extracted using the position of > single characters. > This may lead to wrong results, due to the format of the file > For a good extraction, layout analysis and segmentation has to be done in a > previous step. > https://code.google.com/p/lapdftext > Would be a good solution for a layout analysis tool, unfortunately, it > heavily relies on other libraries and needs Java 1.6 to run. > The layout analysis tool should segementate the file and return a list or set > of rectangles. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org