[ 
https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13849344#comment-13849344
 ] 

Maruan Sahyoun commented on PDFBOX-1808:
----------------------------------------

Hi,

could you try using PDDocument.loadNonSeq instead of PDDocument.load? 
loadNonSeq parses PDFs following the Xref entries (which is inline with the PDF 
spec) whereas load parses sequentially which can lead to errors such as the 
last one you are reporting.

BR
Maruan

> PDFTextStripper.getText - hight memory usage
> --------------------------------------------
>
>                 Key: PDFBOX-1808
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1808
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.2, 1.8.3
>         Environment: Windows 7
> Java jdk 1.7.0_45
>            Reporter: Guyenot Jeremy
>            Priority: Critical
>              Labels: performance
>         Attachments: 1808-java char copyof.jpg, 1808-java char 
> copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, 
> 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, 
> s50-1.png, s50-2.png
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Hello,
> i'm trying to extract text from pdfs but i can find that the PDFTextStripper 
> use a lot of memory.
> With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory.
> I also constat that the memory is'nt free after the getText method is called.
> You can see my code bellow:
> double virgule = Math.pow(10, 2);
>               System.out.println("START - Total memory (Mo): " + 
> Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
> PDDocument cd = PDDocument.load(file);
>               System.out.println("PDDocument getNumberOfPages - Nombre de 
> pages: " + cd.getNumberOfPages());
>               System.out.println("PDDocument load - Total memory (Mo): " + 
> Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
> String pdfText = "";
> try{
>       PDFTextStripper stripper = new PDFTextStripper();
>       pdfText = stripper.getText(cd);
>                       System.out.println("PDFTextStripper getText - Total 
> memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * 
> virgule) / virgule);
>       stripper.resetEngine();
>       stripper = null;
>                       System.out.println("PDFTextStripper resetEngine - Total 
> memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * 
> virgule) / virgule);
> }
> finally{
>       if( cd!=null ){
>               cd.close();
>               cd = null;
>                               System.out.println("PDDocument close - Total 
> memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * 
> virgule) / virgule);
>       }
> }
> retour = new TextField(fieldName, pdfText, Field.Store.NO);
>               System.out.println("TextField - Total memory (Mo): " + 
> Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule);
> And the result into my output window:
> START - Total memory (Mo): 95.0
> PDDocument getNumberOfPages - Nombre de pages: 2676
> PDDocument load - Total memory (Mo): 121.0
> PDFTextStripper getText - Total memory (Mo): 757.0
> PDFTextStripper resetEngine - Total memory (Mo): 757.0
> PDDocument close - Total memory (Mo): 757.0
> TextField - Total memory (Mo): 757.0
> pdfText - Total memory (Mo): 757.0
> I also try to call System.gc() but the memory use is the same.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to