[ 
https://issues.apache.org/jira/browse/TIKA-1631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15975097#comment-15975097
 ] 

Luis Filipe Nassif commented on TIKA-1631:
------------------------------------------

{quote}
If at all possible, consider running Tika outside of the jvm/vm/m that you're 
running your indexer/post processing on
{quote}
Yes, I have been considering it for more than an year. Fortunately these 
problems are very rare, thanks to Tika active improvement! High throughput and 
low memory consumption are very important to us, and those are affected by 
using multiple processes and by the communication overhead between them. 
Probably I will add and optional mode to use ForkParser in the future, thanks 
to the new ForkParser configurable timeout implemented by [~talli...@mitre.org] 
:)

And after that, maybe I could contribute a new ForkParser communication channel 
using mmaped files instead of Sockets, the first is much more efficient!

> OutOfMemoryException in ZipContainerDetector
> --------------------------------------------
>
>                 Key: TIKA-1631
>                 URL: https://issues.apache.org/jira/browse/TIKA-1631
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 1.8
>            Reporter: Pavel Micka
>         Attachments: cache.mpgindex
>
>
> When I try to detect ZIP container I rarely get this exception. It is caused 
> by the fact that the file looks like ZIP container (magics), but in fact its 
> random noise. So Apache decompress tries to find the size of tables (expects 
> correct stream), loads coincidentally huge number (as on the given place 
> there can be anything in the stream) and tries to allocate array of several 
> GB in size (hence the exception).
> This bug negatively influences stability of systems running Tika, as the 
> decompressor can accidentally allocate as much memory as is available and 
> other parts of the system then might not be able to allocate their objects.
> A solution might be to add additional parameter to Tika config that would 
> limit size of these arrays. If the size would be bigger, it would throw 
> exception. This change should not be hard, as method 
> InternalLZWInputStream.initializeTables() is protected.  
> Exception in thread "pool-2-thread-2" java.lang.OutOfMemoryError: Java heap 
> space
>       at 
> org.apache.commons.compress.compressors.z._internal_.InternalLZWInputStream.initializeTables(InternalLZWInputStream.java:111)
>       at 
> org.apache.commons.compress.compressors.z.ZCompressorInputStream.<init>(ZCompressorInputStream.java:52)
>       at 
> org.apache.commons.compress.compressors.CompressorStreamFactory.createCompressorInputStream(CompressorStreamFactory.java:186)
>       at 
> org.apache.tika.parser.pkg.ZipContainerDetector.detectCompressorFormat(ZipContainerDetector.java:106)
>       at 
> org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:92)
>       at 
> org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to