[ https://issues.apache.org/jira/browse/PDFBOX-895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966086#action_12966086 ]
Martijn Brinkers commented on PDFBOX-895: ----------------------------------------- If you disable SuppressDuplicateOverlappingText (i.e., set it to false), text extraction only takes a few seconds. I guess trying to remove duplicate text takes such a long time because the background characters used are only from a small set of characters (d, r, l, u). The algorithm to detect overlap therefore takes a very long time. The PDF format is actually not optimal for text extraction and therefore trying to detect whether a character overlaps or not can be time consuming in cases like this. In this particular situation I think it's better to disable overlap detection. > Infinite recursion when trying to extract text from specific types of PDFs > -------------------------------------------------------------------------- > > Key: PDFBOX-895 > URL: https://issues.apache.org/jira/browse/PDFBOX-895 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.3.1 > Reporter: Panayiotis Vlissidis > Priority: Critical > Attachments: test.pdf > > > Hello and thanks for PDFBox. > We just started using PDFBox for text extraction(through Tika) > and it fails to finish text extraction falling in an infinite loop > and never returning the text. > Please note that this happens only for a specific type of PDF > documents(used for hand writing recognition) such as the one attached. > Not sure if this is a bug of PDFBox or due to the nature of the PDFs, > but I think that PDFBox should at least break out if extraction is not > possible. > I wish I could give you more information but I know nothing about PDF format, > parsing, etc. > Please let me know if you need any information or my help in any way. > Thanks a lot for your time. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.