[ https://issues.apache.org/jira/browse/PDFBOX-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13849340#comment-13849340 ]
Guyenot Jeremy commented on PDFBOX-1808: ---------------------------------------- When a file get me this message into the output window of netbeans: déc. 16, 2013 5:39:54 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSStream WARNING: Specified stream length 1550 is wrong. Fall back to reading stream until 'endstream'. the memory is increase. Do you know why? Logs: -- START - Total memory (Mo): 95.0 -- File : D:\Armoires\DEVEARM\mphh\ocr\2\1450\2 - SITUATION\AUTRES ELEMENTS DE SITUATION\Reprise adulte_001.pdf déc. 16, 2013 5:39:54 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSStream WARNING: Specified stream length 1550 is wrong. Fall back to reading stream until 'endstream'. ----- PDFParser.getPDDocument - Total memory (Mo): 95.0 ----- PDFTextStripper.getText - Total memory (Mo): 121.0 ----- ALL closes - Total memory (Mo): 121.0 > PDFTextStripper.getText - hight memory usage > -------------------------------------------- > > Key: PDFBOX-1808 > URL: https://issues.apache.org/jira/browse/PDFBOX-1808 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.8.2, 1.8.3 > Environment: Windows 7 > Java jdk 1.7.0_45 > Reporter: Guyenot Jeremy > Priority: Critical > Labels: performance > Attachments: 1808-java char copyof.jpg, 1808-java char > copyofrange.jpg, 1808-java usage.jpg, 1808-pdfbox usage.jpg, > 1808-snapshot.nps, DOSSIER DE CANDIDATURE_001.pdf, s5-1.png, s5-2.png, > s50-1.png, s50-2.png > > Original Estimate: 72h > Remaining Estimate: 72h > > Hello, > i'm trying to extract text from pdfs but i can find that the PDFTextStripper > use a lot of memory. > With a pdf that have 2676 pages (for a 4.6Mo size) it use 1.5Go memory. > I also constat that the memory is'nt free after the getText method is called. > You can see my code bellow: > double virgule = Math.pow(10, 2); > System.out.println("START - Total memory (Mo): " + > Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule); > PDDocument cd = PDDocument.load(file); > System.out.println("PDDocument getNumberOfPages - Nombre de > pages: " + cd.getNumberOfPages()); > System.out.println("PDDocument load - Total memory (Mo): " + > Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule); > String pdfText = ""; > try{ > PDFTextStripper stripper = new PDFTextStripper(); > pdfText = stripper.getText(cd); > System.out.println("PDFTextStripper getText - Total > memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * > virgule) / virgule); > stripper.resetEngine(); > stripper = null; > System.out.println("PDFTextStripper resetEngine - Total > memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * > virgule) / virgule); > } > finally{ > if( cd!=null ){ > cd.close(); > cd = null; > System.out.println("PDDocument close - Total > memory (Mo): " + Math.round((Runtime.getRuntime().totalMemory()/1000000) * > virgule) / virgule); > } > } > retour = new TextField(fieldName, pdfText, Field.Store.NO); > System.out.println("TextField - Total memory (Mo): " + > Math.round((Runtime.getRuntime().totalMemory()/1000000) * virgule) / virgule); > And the result into my output window: > START - Total memory (Mo): 95.0 > PDDocument getNumberOfPages - Nombre de pages: 2676 > PDDocument load - Total memory (Mo): 121.0 > PDFTextStripper getText - Total memory (Mo): 757.0 > PDFTextStripper resetEngine - Total memory (Mo): 757.0 > PDDocument close - Total memory (Mo): 757.0 > TextField - Total memory (Mo): 757.0 > pdfText - Total memory (Mo): 757.0 > I also try to call System.gc() but the memory use is the same. -- This message was sent by Atlassian JIRA (v6.1.4#6159)