Nicolas Daniels created TIKA-1907: ------------------------------------- Summary: Big Pdf parsing to text - Out of memory Key: TIKA-1907 URL: https://issues.apache.org/jira/browse/TIKA-1907 Project: Tika Issue Type: Bug Affects Versions: 1.12 Reporter: Nicolas Daniels
Linked to PDFBox issue: [https://issues.apache.org/jira/browse/PDFBOX-3284] I'm duplicating it here to make sure it will be fixed in Tika as well. Maybe PDFBox is not the appropriate lib to use in such case. Trying to read the same PDF using Tika leads to the same problem: {code:title=Test.java|borderStyle=solid} @Test public void testParsePdf_Content_Memory() throws Exception { { InputStream inputStream = new FileInputStream("c:/tmp/sr2015_mx_clearing_3dot0_mdr2_solution.pdf"); try { StringWriter writer = new StringWriter(); FileWriter fileWriter = new FileWriter(new File("c:/tmp/test.txt")); BodyContentHandler handler = new BodyContentHandler(fileWriter); Metadata metadata = new Metadata(); new PDFParser().parse(inputStream, handler, metadata, new ParseContext()); fileWriter.close(); } finally { inputStream.close(); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)