[ 
https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764199#comment-17764199
 ] 

Andreas Lehmkühler commented on PDFBOX-5682:
--------------------------------------------

{quote}It looks like that causes a full parse of the file?{quote}
"getObjectsByType" searches for all indirect objects of the type FILESPEC so 
that all indirect objects have to be loaded on demand which is more or less the 
whole file. In 2.0.x all objects are already loaded and therefore calling 
"getObjectsByType" is less performance consuming compared to 3.0.x.

IMHO there are two possible solutions:
* maybe there some room for improvements when loading of all objects
* don't scan all objects when looking for some special object types like files. 
The example "org.apache.pdfbox.examples.pdmodel.ExtractEmbeddedFiles" shows how 
to get all files using PD-level objects. In 3.0.x this should be the preferred 
way to go as it doesn't scan all indirect objects

> Long/permanent hang in PDFBox 3.x
> ---------------------------------
>
>                 Key: PDFBOX-5682
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5682
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Priority: Minor
>
> I found two files in the regression tests where we're now getting timeouts at 
> 3 minutes where we weren't before.  Unfortunately, PDFBox's export:text works 
> on both, so it is probably another structural feature, perhaps a problem in 
> Tika?
> This file halts after printing out the header for Table 19 on page 46: 
> https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf
> Pure PDFBox's export:text complains multiple times: "Page skipped due to an 
> invalid or missing type null, but it does finish quickly."
> This file halts after extracting {{"854,793,592"}}: 
> https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY
> Pure PDFBox's export:text processes this without problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to