[ 
https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766479#comment-17766479
 ] 

Andreas Lehmkühler commented on PDFBOX-5682:
--------------------------------------------

I've found the reason for the regression. BaseParser implements a key cache for 
pdfs with huge cross reference tables. As it uses the hashCode of COSObjectKey 
as key for the underlying HashMap there were a chance for collisions so that 
the number of entries in the key cache might be smaller than the number of 
entries within the xref map. Every call of BaseParser.getObjectKey led to a 
rebuild of the key cache. The pdf in question contains 100k objects, so that in 
the end I'm not suprised that this might take a while. ;-)

However, I've fixed that and now everything is speedy again :-) 
[~tallison] please give it a try if possible, once the snapshot is available

[~msahyoun] Would it still make sense to have some benchmark?

> Long/permanent hang in PDFBox 3.x
> ---------------------------------
>
>                 Key: PDFBOX-5682
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5682
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Assignee: Andreas Lehmkühler
>            Priority: Minor
>             Fix For: 3.0.1 PDFBox, 4.0.0
>
>
> I found two files in the regression tests where we're now getting timeouts at 
> 3 minutes where we weren't before.  Unfortunately, PDFBox's export:text works 
> on both, so it is probably another structural feature, perhaps a problem in 
> Tika?
> This file halts after printing out the header for Table 19 on page 46: 
> https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf
> Pure PDFBox's export:text complains multiple times: "Page skipped due to an 
> invalid or missing type null, but it does finish quickly."
> This file halts after extracting {{"854,793,592"}}: 
> https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY
> Pure PDFBox's export:text processes this without problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to