[ 
https://issues.apache.org/jira/browse/TIKA-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298491#comment-16298491
 ] 

chelambarasan commented on TIKA-2496:
-------------------------------------

Hi [[email protected]],

The issue was not with the pdf file and zip file of around 1gb. Tried with 1.16 
and 1.17 jar as well.

The tika processor becomes slow on picking the bigger zip files and getting 
crashed.

> TIKA crashes / runs out of memory on simple PDF
> -----------------------------------------------
>
>                 Key: TIKA-2496
>                 URL: https://issues.apache.org/jira/browse/TIKA-2496
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.15
>         Environment: Linux, Java 8
>            Reporter: chelambarasan
>
> We're using TIKA embedded in a webcrawler and today I've encountered a PDF 
> that results in OutOfMemory errors while being processed by TIKA.
> Tried with Xmx 5gb and pdf file sizes are approximately 50 mb. 
> Tika version: 1.15
> Error as below:
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>       at 
> org.apache.pdfbox.io.ScratchFileBuffer.addPage(ScratchFileBuffer.java:132)
>       at 
> org.apache.pdfbox.io.ScratchFileBuffer.ensureAvailableBytesInPage(ScratchFileBuffer.java:184)
>       at 
> org.apache.pdfbox.io.ScratchFileBuffer.write(ScratchFileBuffer.java:236)
>       at 
> org.apache.pdfbox.io.RandomAccessOutputStream.write(RandomAccessOutputStream.java:46)
>       at org.apache.pdfbox.cos.COSStream$2.write(COSStream.java:266)
>       at 
> org.apache.pdfbox.pdfparser.COSParser.readValidStream(COSParser.java:1142)
>       at 
> org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:970)
>       at 
> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:781)
>       at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:742)
>       at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:673)
>       at 
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:633)
>       at 
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:241)
>       at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:276)
>       at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1132)
>       at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1066)
>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:141)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> Please let us know how to fix this issue



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to