[ 
https://issues.apache.org/jira/browse/TIKA-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298513#comment-16298513
 ] 

Tim Allison commented on TIKA-2496:
-----------------------------------

bq. Able to replicate the issue with any zip file of size more than 2gb.

Funny you mention this, just yesterday, I wrote an "unraveler" for a [PST 
file|https://github.com/tballison/tika-addons/tree/1.17/unravel/src/main/java/org/tallison/tika/unravelers].
  The idea is that when you have large archive files (pst, mbox, zip, tar), you 
either want to do some preprocessing to extract all of the attachments or you 
want to process them specially so that each embedded file is extracted as its 
own standalone "extract".  If you are extracting text for search, for example, 
a user would not be thrilled to have a 2gb zip file treated as a single file, 
typically.

So, would it make sense to do some preprocessing on your large zips to extract 
the contents?

Eventually, I'd like to add the unraveler functionality into Tika, but that's a 
good way off.

> TIKA crashes / runs out of memory on simple PDF
> -----------------------------------------------
>
>                 Key: TIKA-2496
>                 URL: https://issues.apache.org/jira/browse/TIKA-2496
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.15
>         Environment: Linux, Java 8
>            Reporter: chelambarasan
>
> We're using TIKA embedded in a webcrawler and today I've encountered a PDF 
> that results in OutOfMemory errors while being processed by TIKA.
> Tried with Xmx 5gb and pdf file sizes are approximately 50 mb. 
> Tika version: 1.15
> Error as below:
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>       at 
> org.apache.pdfbox.io.ScratchFileBuffer.addPage(ScratchFileBuffer.java:132)
>       at 
> org.apache.pdfbox.io.ScratchFileBuffer.ensureAvailableBytesInPage(ScratchFileBuffer.java:184)
>       at 
> org.apache.pdfbox.io.ScratchFileBuffer.write(ScratchFileBuffer.java:236)
>       at 
> org.apache.pdfbox.io.RandomAccessOutputStream.write(RandomAccessOutputStream.java:46)
>       at org.apache.pdfbox.cos.COSStream$2.write(COSStream.java:266)
>       at 
> org.apache.pdfbox.pdfparser.COSParser.readValidStream(COSParser.java:1142)
>       at 
> org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:970)
>       at 
> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:781)
>       at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:742)
>       at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:673)
>       at 
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:633)
>       at 
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:241)
>       at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:276)
>       at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1132)
>       at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1066)
>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:141)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> Please let us know how to fix this issue



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to