[ 
https://issues.apache.org/jira/browse/PDFBOX-5809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841514#comment-17841514
 ] 

Tilman Hausherr edited comment on PDFBOX-5809 at 4/27/24 4:54 PM:
------------------------------------------------------------------

There are three problems:

1) The file has "beads" (an obscure concept of PDF that is intended for 
following the "flow" of magazine / newspaper articles over several pages), 
which results in many orphan pages in the result. This has been in earlier 
versions too :(

2) The file has popup annotations and some of them indirectly link back to the 
page itself, but it doesn't belong to that document, thus more orphan pages :(

!image-2024-04-27-18-50-19-199.png!

3) A lot of time is spent in {{{}setHighestImportedObjectNumber(){}}}, maybe 
because the file is so big. This is only in 3.0 and higher.

4) The result file is still huge even when removing the beads during 
{{{}importPage(){}}}. A look at it with PDFDebugger shows orphan pages in the 
cross reference view.

(3) and (4) may be because of (1) and (2).


was (Author: tilman):
There are three problems:

1) The file has "beads" (an obscure concept of PDF), which results in many 
orphan pages in the result. This has been in earlier versions too :(

2) The file has popup annotations and some of them indirectly link back to the 
page itself, but it doesn't belong to that document, thus more orphan pages :(

!image-2024-04-27-18-50-19-199.png!

3) A lot of time is spent in {{{}setHighestImportedObjectNumber(){}}}, maybe 
because the file is so big. This is only in 3.0 and higher.

4) The result file is still huge even when removing the beads during 
{{{}importPage(){}}}. A look at it with PDFDebugger shows orphan pages in the 
cross reference view.

(3) and (4) may be because of (1) and (2).

> PDDocument#importPage slowed down by factor 1300
> ------------------------------------------------
>
>                 Key: PDFBOX-5809
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5809
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 3.0.2 PDFBox
>            Reporter: Marcus Korinth
>            Priority: Major
>         Attachments: image-2024-04-27-18-50-19-199.png
>
>
> We are using the *PDDocument#importPage* Method in our own splitter where we 
> split pages from a _SourceDocument_ to a _TargetDocument_. In order to do so 
> we first extract the page by using the following code:
> {code:java}
> final PDPage sourcePage = sourceDocument.getPage(pageNumber);
> {code}
> Immediatly afterwards we are calling:
> {code:java}
> final PDPage targetPage = targetDocument.importPage(sourcePage);
> {code}
> This approach worked just fine with *pdfbox 2.0.26*.
> We decided to upgrade to version *3.0.2* since it takles a lot of the 
> problems.
> Unfortunately the *PDDocument#importPage* method slowed down by around 1300 
> times. In Version *2.0.26* it took 15ms in an average. With the latest 
> *3.0.2* it takes 20000 ms in average. That is a huge deal breaker as we 
> usually have to split documents which have several thousand pages.
> Note: The same applies when using *PDDocument#addPage*.
> Note: The problem does not appear in *3.0.1*. But we can't use that since it 
> has other major problems which breaks our application.
> I have prepared an example document with which you can replicate the issue. 
> Due to the file size limitation I had to prepare a WeTransfer-Link for you: 
> https://we.tl/t-lfN2wz7cAs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to