[ 
https://issues.apache.org/jira/browse/PDFBOX-5809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842883#comment-17842883
 ] 

Marcus Korinth commented on PDFBOX-5809:
----------------------------------------

Thank you for your fast response and fixes.

Staying at 2.0 is not really an option as 2.0 has a lot of problems when 
splitting documents. 
This was the original trigger to change to 3.0. Also 3.0 seemed to fix a lot of 
the problems which were present in 2.0.

Unfortunately 3.0.1 introduced a few new bugs which were fixed in 3.0.2, but 
again 3.0.2 has the problem with the slow splitting (which was not the case in 
3.0.1). 
3.0.2 seems to be able to split documents even if they have a "malformed 
content stream" as it just logs a warning and splits it anyways instead of 
throwing an exception.

Regarding (3): A lot of our documents are even bigger, up to 2GB, but most of 
them are in the region of 200-300MB like the sample I have sent you. Our 
documents also have up to 20k pages, but most of them are in the region of 
500-700. The documents are provided and created by 3rd parties, therefore we 
have no control over the incoming documents, but we have to split them and fix 
as much as we can while doing so, to prevent huge file sizes. At the moment we 
have a very stable splitter which is capable of splitting 99.99% of all 
provided documents in a way so that we do not have huge or bloated (big) files. 
For this reason we also have an implementation of the `PDFStreamEngine` (and 
`OperatorProcessor`) which makes sure that images, which are not presented on 
the page, are removed from the resources...

Regarding (3) and `setHighestImportedObjectNumber`: Would it be an option to 
give control wether the method should be called via a config value or a flag?

Thank you very much!


> PDDocument#importPage slowed down by factor 1300
> ------------------------------------------------
>
>                 Key: PDFBOX-5809
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5809
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.0.31, 3.0.2 PDFBox
>            Reporter: Marcus Korinth
>            Priority: Major
>             Fix For: 2.0.32, 4.0.0, 3.0.3 PDFBox
>
>         Attachments: image-2024-04-27-18-50-19-199.png
>
>
> We are using the *PDDocument#importPage* Method in our own splitter where we 
> split pages from a _SourceDocument_ to a _TargetDocument_. In order to do so 
> we first extract the page by using the following code:
> {code:java}
> final PDPage sourcePage = sourceDocument.getPage(pageNumber);
> {code}
> Immediatly afterwards we are calling:
> {code:java}
> final PDPage targetPage = targetDocument.importPage(sourcePage);
> {code}
> This approach worked just fine with *pdfbox 2.0.26*.
> We decided to upgrade to version *3.0.2* since it takles a lot of the 
> problems.
> Unfortunately the *PDDocument#importPage* method slowed down by around 1300 
> times. In Version *2.0.26* it took 15ms in an average. With the latest 
> *3.0.2* it takes 20000 ms in average. That is a huge deal breaker as we 
> usually have to split documents which have several thousand pages.
> Note: The same applies when using *PDDocument#addPage*.
> Note: The problem does not appear in *3.0.1*. But we can't use that since it 
> has other major problems which breaks our application.
> I have prepared an example document with which you can replicate the issue. 
> Due to the file size limitation I had to prepare a WeTransfer-Link for you: 
> https://we.tl/t-lfN2wz7cAs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to