Tilman Hausherr created PDFBOX-6145:
---------------------------------------

             Summary: Extremely slow text extraction of single page of large PDF
                 Key: PDFBOX-6145
                 URL: https://issues.apache.org/jira/browse/PDFBOX-6145
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 3.0.6 PDFBox, 2.0.35
            Reporter: Tilman Hausherr
            Assignee: Tilman Hausherr
             Fix For: 2.0.36, 3.0.7 PDFBox, 4.0.0


happens with
https://www.mouser.ca/catalog/catalogcad/646/dload/pdf/MOUSER.pdf
discovered by showing the first page with PDFDebugger, rendering done in a few 
seconds, but display minutes later, this is because of the invisible text 
extraction that happens.

The cause is that the stripper goes through all pages, checks whether there is 
content, and only then checks whether the page is to be extracted.

Alternatively it can be reproduced with this code

{code:java}
        PDFTextStripper s = new PDFTextStripper();
        s.setStartPage(1);
        s.setEndPage(1);
        String text = s.getText(doc);
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to