Tilman Hausherr created PDFBOX-6145:
---------------------------------------
Summary: Extremely slow text extraction of single page of large PDF
Key: PDFBOX-6145
URL: https://issues.apache.org/jira/browse/PDFBOX-6145
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 3.0.6 PDFBox, 2.0.35
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
Fix For: 2.0.36, 3.0.7 PDFBox, 4.0.0
happens with
https://www.mouser.ca/catalog/catalogcad/646/dload/pdf/MOUSER.pdf
discovered by showing the first page with PDFDebugger, rendering done in a few
seconds, but display minutes later, this is because of the invisible text
extraction that happens.
The cause is that the stripper goes through all pages, checks whether there is
content, and only then checks whether the page is to be extracted.
Alternatively it can be reproduced with this code
{code:java}
PDFTextStripper s = new PDFTextStripper();
s.setStartPage(1);
s.setEndPage(1);
String text = s.getText(doc);
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]