[ https://issues.apache.org/jira/browse/PDFBOX-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179784#comment-14179784 ]
Maruan Sahyoun commented on PDFBOX-2445: ---------------------------------------- The file uses a lot of small images which are duplicated on each page leading to the memory issue. [~jahewson] couldn’t we probably change PDFTextStripper to not use document.getDocumentCatalog().getAllPages() as I understand that this loads everything? Or did that change already? > Out of Memory - Extract text for Apache_Solr_4.7_Ref_Guide.pdf > -------------------------------------------------------------- > > Key: PDFBOX-2445 > URL: https://issues.apache.org/jira/browse/PDFBOX-2445 > Project: PDFBox > Issue Type: Bug > Components: Parsing > Affects Versions: 1.8.7, 2.0.0 > Reporter: Maruan Sahyoun > -- This message was sent by Atlassian JIRA (v6.3.4#6332)