[jira] [Created] (TIKA-1907) Big Pdf parsing to text - Out of memory

Nicolas Daniels (JIRA) Wed, 23 Mar 2016 01:21:07 -0700

Nicolas Daniels created TIKA-1907:
-------------------------------------

             Summary: Big Pdf parsing to text - Out of memory
                 Key: TIKA-1907
                 URL: https://issues.apache.org/jira/browse/TIKA-1907
             Project: Tika
          Issue Type: Bug
    Affects Versions: 1.12
            Reporter: Nicolas Daniels



Linked to PDFBox issue: [https://issues.apache.org/jira/browse/PDFBOX-3284]

I'm duplicating it here to make sure it will be fixed in Tika as well. Maybe 
PDFBox is not the appropriate lib to use in such case.

Trying to read the same PDF using Tika leads to the same problem:

{code:title=Test.java|borderStyle=solid}
@Test
public void testParsePdf_Content_Memory() throws Exception {
{
    InputStream inputStream = new 
FileInputStream("c:/tmp/sr2015_mx_clearing_3dot0_mdr2_solution.pdf");
    try {
             StringWriter writer = new StringWriter();
             FileWriter fileWriter = new FileWriter(new 
File("c:/tmp/test.txt"));

            BodyContentHandler handler = new BodyContentHandler(fileWriter);
            Metadata metadata = new Metadata();
            new PDFParser().parse(inputStream, handler, metadata, new 
ParseContext());

             fileWriter.close();
    } finally {
        inputStream.close();
    }
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TIKA-1907) Big Pdf parsing to text - Out of memory

Reply via email to