[ 
https://issues.apache.org/jira/browse/PDFBOX-895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966086#action_12966086
 ] 

Martijn Brinkers commented on PDFBOX-895:
-----------------------------------------

If you disable SuppressDuplicateOverlappingText (i.e., set it to false), text 
extraction only takes a few seconds. I guess trying to remove duplicate text 
takes such a long time because the background characters used are only from a 
small set of characters (d, r, l, u). The algorithm to detect overlap therefore 
takes a very long time. The PDF format is actually not optimal for text 
extraction and therefore trying to detect whether a character overlaps or not 
can be time consuming in cases like this. In this particular situation I think 
it's better to disable overlap detection.

> Infinite recursion when trying to extract text from specific types of PDFs
> --------------------------------------------------------------------------
>
>                 Key: PDFBOX-895
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-895
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.3.1
>            Reporter: Panayiotis Vlissidis
>            Priority: Critical
>         Attachments: test.pdf
>
>
> Hello and thanks for PDFBox.
> We just started using PDFBox for text extraction(through Tika) 
> and it fails to finish text extraction falling in an infinite loop
> and never returning the text.
> Please note that this happens only for a specific type of PDF
> documents(used for hand writing recognition) such as the one attached. 
> Not sure if this is a bug of PDFBox or due to the nature of the PDFs,
> but I think that PDFBox should at least break out if extraction is not 
> possible.
> I wish I could give you more information but I know nothing about PDF format, 
> parsing, etc. 
> Please let me know if you need any information or my help in any way.
> Thanks a lot for your time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to