[ 
https://issues.apache.org/jira/browse/PDFBOX-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179784#comment-14179784
 ] 

Maruan Sahyoun commented on PDFBOX-2445:
----------------------------------------

The file uses a lot of small images which are duplicated on each page leading 
to the memory issue. 

[~jahewson] couldn’t we probably change PDFTextStripper to not use 
document.getDocumentCatalog().getAllPages() as I understand that this loads 
everything? Or did that change already?

> Out of Memory - Extract text for Apache_Solr_4.7_Ref_Guide.pdf
> --------------------------------------------------------------
>
>                 Key: PDFBOX-2445
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2445
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.8.7, 2.0.0
>            Reporter: Maruan Sahyoun
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to