[ 
https://issues.apache.org/jira/browse/PDFBOX-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844633#comment-16844633
 ] 

Michael Klink commented on PDFBOX-4550:
---------------------------------------

{quote}
The attached file (from PDFBOX-3442) now fails text extraction because of
{panel}
1 beginbfrange
<0000> <ffff> <0000>
endbfrange
{panel}
{quote}

Strictly speaking failing to extract text is _correct_ if that map is the only 
information source available for mapping to Unicode.

If one wants to use data in spite of them being invalid, one has to embrace 
performance issues. (Current policy in PDFBox)

If one wants to prevent performance issues, one has to skip invalid data. (Wish 
of the OP)



> Poor performance with corrupt ToUnicode stream
> ----------------------------------------------
>
>                 Key: PDFBOX-4550
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4550
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Rendering, Text extraction
>    Affects Versions: 2.0.15
>            Reporter: Tilman Hausherr
>            Assignee: Andreas Lehmkühler
>            Priority: Major
>             Fix For: 2.0.16, 3.0.0 PDFBox
>
>         Attachments: PDFBOX-3442-DirectResources.pdf, 
> PDFBOX-3442-DirectResources_unc.pdf, pdnekz1gvl7.pdf
>
>
> A confidential file with lots of corrupt streams has ToUnicode stream with 
> corrupt contents in the beginbfrange segment where start and end have 
> different lengths. This leads to poor performance. Such entries can be 
> skipped.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to