[jira] [Comment Edited] (PDFBOX-3284) Big Pdf parsing to text - Out of memory

John Hewson (JIRA) Fri, 25 Mar 2016 10:01:35 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212079#comment-15212079
 ]


John Hewson edited comment on PDFBOX-3284 at 3/25/16 5:01 PM:
--------------------------------------------------------------

First of all -Xmx768M isn't _that_ much memory. I'd recommend 1-2GB. I've 
parsed 100MB+ PDFs with PDFBox with this amount of memory. As Tilman says, 
often memory usage is due to how you're opening files, and sometimes its due to 
a particular PDF (e.g. a file which includes a single giant image). Remember 
that while the PDF file may only be 23MB PDFBox has to handle its uncompressed 
contents, parse that into various data structures, and load all the fonts from 
disk and parse them into various memory structures too, which can start using 
up quite a bit of memory.

Personally I've had the best results using a 32-bit JVM and opening the PDF 
directly from a File with no scratch file. Feel free to upload the problem PDF 
and we can see if there's something specific about that file which is causing 
the problem.



was (Author: jahewson):
First of all -Xmx768M isn't _that_ much memory. I'd recommend 1-2GB. I've 
parsed 100MB+ PDFs with PDFBox with this amount of memory. As Tilman says, 
often memory usage is due to how you're opening files, and sometimes its due to 
a particular PDF (e.g. a file which includes a single giant image).

Personally I've had the best results using a 32-bit JVM and opening the PDF 
directly from a File with no scratch file. Feel free to upload the problem PDF 
and we can see if there's something specific about that file which is causing 
the problem.

> Big Pdf parsing to text - Out of memory
> ---------------------------------------
>
>                 Key: PDFBOX-3284
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3284
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.8.10, 1.8.11, 2.0.0, 2.1.0
>            Reporter: Nicolas Daniels
>
> I'm trying to parse a quite big PDF (26MB) and transform it to text, however 
> I'm facing a huge memory consumption leading to out of memory error. Running 
> my test with -Xmx768M will always fail. I've to increase to 1500M to make it 
> work. 
> The resulting text is only 3MB so I don't understand why it is taking so much 
> memory.
> I've tested this code over 1.8.10, 1.8.11 & 2.0.0 with same result.
> The pdf can be found 
> [here|https://www2.swift.com/uhbonline/books/public/en_uk/clr_3_0_stdsmx_msg_def_rpt_sch/sr2015_mx_clearing_3dot0_mdr2_solution.pdf]
> My code:
> {code:title=Test.java|borderStyle=solid}
> @Test
> public void testParsePdf_Content_Memory() throws Exception {
> {
>     InputStream inputStream = new 
> FileInputStream("c:/tmp/sr2015_mx_clearing_3dot0_mdr2_solution.pdf");
>     try {
>              StringWriter writer = new StringWriter();
>            FileWriter fileWriter = new FileWriter(new 
> File("c:/tmp/test.txt"));
>              PDFTextStripper pdfTextStripper = new PDFTextStripper();
>            pdfTextStripper.writeText(PDDocument.load(inputStream), 
> fileWriter);
>              fileWriter.close();
>     } finally {
>         inputStream.close();
>     }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-3284) Big Pdf parsing to text - Out of memory

Reply via email to