[ 
https://issues.apache.org/jira/browse/PDFBOX-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844111#comment-16844111
 ] 

Tilman Hausherr edited comment on PDFBOX-4550 at 5/20/19 5:06 PM:
------------------------------------------------------------------

I have that file in the regression tests of PDFTextStripper (in 
{{pdfbox/src/test/resources/input}}), so the stripper is called directly 
without the check whether extraction is allowed. In the past it produced a 
text, and after the change it no longer does. The reason is that the interval 
is larger than 255 values. Another difference is that with the previous 
version, one could display the text bounds in PDFDebugger and now no more. I've 
also attached an unencrypted version  [^PDFBOX-3442-DirectResources_unc.pdf] , 
this one shows the same problem with an unmodified ExtractText tool.


was (Author: tilman):
 [^PDFBOX-3442-DirectResources_unc.pdf] I have that file in the regression 
tests of PDFTextStripper (in {{pdfbox/src/test/resources/input}}), so the 
stripper is called directly without the check whether extraction is allowed. In 
the past it produced a text, and after the change it no longer does. The reason 
is that the interval is larger than 255 values. Another difference is that with 
the previous version, one could display the text bounds in PDFDebugger and now 
no more. I've also attached an unencrypted version, this one shows the same 
problem with an unmodified ExtractText tool.

> Poor performance with corrupt ToUnicode stream
> ----------------------------------------------
>
>                 Key: PDFBOX-4550
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4550
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Rendering, Text extraction
>    Affects Versions: 2.0.15
>            Reporter: Tilman Hausherr
>            Assignee: Andreas Lehmkühler
>            Priority: Major
>             Fix For: 2.0.16, 3.0.0 PDFBox
>
>         Attachments: PDFBOX-3442-DirectResources.pdf, 
> PDFBOX-3442-DirectResources_unc.pdf, pdnekz1gvl7.pdf
>
>
> A confidential file with lots of corrupt streams has ToUnicode stream with 
> corrupt contents in the beginbfrange segment where start and end have 
> different lengths. This leads to poor performance. Such entries can be 
> skipped.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to