[ https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15042460#comment-15042460 ]
John Logan commented on PDFBOX-2998: ------------------------------------ Not sure whether this is a great place to put this comment, but I was having a look at the text extraction bugs and saw the general discussion. One area that could be improved with a little effort is in parameter selection for paragraph detection. I put together a POC of this as this solving this problem helps me out a lot. What I did was create an analyzer, based on the PDFTextStripper code, that stores the collection of drops and indents for a page, and then applies a crude heuristic to determine reasonable threshold values. It appears to function pretty well for the test cases I have where the embedded default values are too low. >From an implementation standpoint the solution is wanting because it's not >very DRY. I originally implemented the solution directly in PDFTextStripper >using a two pass scan, but that makes an already complicated method even more >so. > Enhance the text extraction capabilities > ---------------------------------------- > > Key: PDFBOX-2998 > URL: https://issues.apache.org/jira/browse/PDFBOX-2998 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction > Affects Versions: 2.0.0 > Reporter: Andreas Meier > Attachments: DropCapExample1.pdf, DropCapExample2.pdf, > DropCapExample3.pdf, DropCapExample4.pdf, DropCapExample5.pdf, > DropCapSegmentation.jpg, TextBehindText.pdf > > > PDFBox will need some -document layout analysis tools- enhancement to the > current text extraction to extract text correctly. > At the Moment the text of a document is extracted using the position of > single characters. > This may lead to wrong results, due to the format of the file. > There are good tools such as https://code.google.com/p/lapdftext which we > could use to compare our current output. > Possible enhancements are > - enhance matching of text to a certain line i.e. don't mix up text from > different lines > - better handling of rotated text > - handling of vertical text > - ability to get additional text properties such as font, font size ... > Some of these are already logged as individual tickets -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org