[ 
https://issues.apache.org/jira/browse/PDFBOX-5809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841514#comment-17841514
 ] 

Tilman Hausherr edited comment on PDFBOX-5809 at 4/28/24 6:21 PM:
------------------------------------------------------------------

There are four problems:

1) The file has "beads" (an obscure concept of PDF that is intended for 
following the "flow" of magazine / newspaper articles over several pages), 
which results in many "orphan pages" in the result (source page dictionaries in 
the result document). This has been in earlier versions too :(

2) The file has popup annotations and some of them indirectly link back to the 
page itself. This is being handled in the splitter, with the exception of the 
parent of popup annotations that are not in the page annotation list, so these 
are missed and they link to the source page dictionary, which would be an 
"orphan page" in the result document :-(

!image-2024-04-27-18-50-19-199.png!

3) A lot of time is spent in {{{}setHighestImportedObjectNumber(){}}}, maybe 
because the file is so big. This is only in 3.0 and higher.

4) The result file is still huge even when removing the beads during 
{{{}importPage(){}}}. A look at it with PDFDebugger shows orphan pages in the 
cross reference view.

(3) and (4) may be because of (1) and (2).


was (Author: tilman):
There are four problems:

1) The file has "beads" (an obscure concept of PDF that is intended for 
following the "flow" of magazine / newspaper articles over several pages), 
which results in many orphan pages in the result. This has been in earlier 
versions too :(

2) The file has popup annotations and some of them indirectly link back to the 
page itself, but it doesn't belong to that document, thus more orphan pages :(

!image-2024-04-27-18-50-19-199.png!

3) A lot of time is spent in {{{}setHighestImportedObjectNumber(){}}}, maybe 
because the file is so big. This is only in 3.0 and higher.

4) The result file is still huge even when removing the beads during 
{{{}importPage(){}}}. A look at it with PDFDebugger shows orphan pages in the 
cross reference view.

(3) and (4) may be because of (1) and (2).

> PDDocument#importPage slowed down by factor 1300
> ------------------------------------------------
>
>                 Key: PDFBOX-5809
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5809
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.0.31, 3.0.2 PDFBox
>            Reporter: Marcus Korinth
>            Priority: Major
>             Fix For: 2.0.32, 4.0.0, 3.0.3 PDFBox
>
>         Attachments: image-2024-04-27-18-50-19-199.png
>
>
> We are using the *PDDocument#importPage* Method in our own splitter where we 
> split pages from a _SourceDocument_ to a _TargetDocument_. In order to do so 
> we first extract the page by using the following code:
> {code:java}
> final PDPage sourcePage = sourceDocument.getPage(pageNumber);
> {code}
> Immediatly afterwards we are calling:
> {code:java}
> final PDPage targetPage = targetDocument.importPage(sourcePage);
> {code}
> This approach worked just fine with *pdfbox 2.0.26*.
> We decided to upgrade to version *3.0.2* since it takles a lot of the 
> problems.
> Unfortunately the *PDDocument#importPage* method slowed down by around 1300 
> times. In Version *2.0.26* it took 15ms in an average. With the latest 
> *3.0.2* it takes 20000 ms in average. That is a huge deal breaker as we 
> usually have to split documents which have several thousand pages.
> Note: The same applies when using *PDDocument#addPage*.
> Note: The problem does not appear in *3.0.1*. But we can't use that since it 
> has other major problems which breaks our application.
> I have prepared an example document with which you can replicate the issue. 
> Due to the file size limitation I had to prepare a WeTransfer-Link for you: 
> https://we.tl/t-lfN2wz7cAs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to