Tilman Hausherr created PDFBOX-5799: ---------------------------------------
Summary: Page with thousands of content streams takes extremely long to render or extract Key: PDFBOX-5799 URL: https://issues.apache.org/jira/browse/PDFBOX-5799 Project: PDFBox Issue Type: Bug Components: Rendering, Text extraction Affects Versions: 3.0.2 PDFBox Reporter: Tilman Hausherr As reported by Erik Branks on the mailing list: {quote}when attempting text extraction from the PDF at [https://d-nb.info/1324982411/34] , either using PDFBox 3.0.0 or PDFBox 4.0.0-SNAPSHOT, the extraction uses about 1,8 GB heap memory and does not seem to terminate. I cancelled the extraction attempt after roughly 20 minutes. Is this another bad PDF or is there a bug in PDFBox?{quote} This happens with pages 230 and 231 (maybe others). Both have thousands of content streams in the content stream array. The profiler suggests that most time is spent in {{SequenceRandomAccessRead.seek()}}. Rendering page 230 with PDFBox 2.0: 50 seconds Rendering page 230 with PDFBox trunk: 2990 seconds Rendering page 231 with PDFBox trunk: 4798 seconds -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org