[
https://issues.apache.org/jira/browse/TIKA-416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983229#action_12983229
]
Jukka Zitting edited comment on TIKA-416 at 1/18/11 10:35 AM:
--------------------------------------------------------------
An initial version of this feature is now working and included in the latest
trunk.
To illustrate the improvement, here's what I'm seeing for example with one
somewhat large Excel document:
$ java -Xmx32m -jar tika-app-0.9-SNAPSHOT.jar large.xls
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at
org.apache.poi.poifs.storage.RawDataBlock.<init>(RawDataBlock.java:69)
at
org.apache.poi.poifs.storage.RawDataBlockList.<init>(RawDataBlockList.java:55)
at
org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:157)
at
org.apache.tika.detect.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:145)
at
org.apache.tika.detect.POIFSContainerDetector.detect(POIFSContainerDetector.java:96)
at
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:60)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:126)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:94)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:273)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:80)
The OutOfMemoryError is really troublesome in many container environments where
hitting the memory limit affects all active threads, not just the one using
Tika.
With the new out-of-process parsing feature, it's possible to externalize this
problem into a separate background process:
$ java -Xmx32m -jar tika-app-0.9-SNAPSHOT.jar --fork large.xls
Exception in thread "main" java.io.IOException: Lost connection to a forked
server process
at org.apache.tika.fork.ForkClient.waitForResponse(ForkClient.java:149)
at org.apache.tika.fork.ForkClient.call(ForkClient.java:84)
at org.apache.tika.fork.ForkParser.parse(ForkParser.java:78)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:94)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:273)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:80)
Such normal exceptions are much easier to recover from.
was (Author: jukkaz):
An initial version of this feature is now working and included in the
latest trunk.
To illustrate the improvement, here's what I'm seeing for example with one
somewhat large Excel document:
$ java -Xmx32m -jar tika-app-0.9-SNAPSHOT.jar large.xls
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at
org.apache.poi.poifs.storage.RawDataBlock.<init>(RawDataBlock.java:69)
at
org.apache.poi.poifs.storage.RawDataBlockList.<init>(RawDataBlockList.java:55)
at
org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:157)
at
org.apache.tika.detect.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:145)
at
org.apache.tika.detect.POIFSContainerDetector.detect(POIFSContainerDetector.java:96)
at
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:60)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:126)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:94)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:273)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:80)
The OutOfMemoryError is really troublesome in many container environments where
hitting the memory limit affects all active threads, not just the one using
Tika.
With the new out-of-process parsing feature, it's possible to externalize this
problem into a separate background process:
$ java -Xmx32m -jar tika-app-0.9-SNAPSHOT.jar --fork comlex-document.xls
Exception in thread "main" java.io.IOException: Lost connection to a forked
server process
at org.apache.tika.fork.ForkClient.waitForResponse(ForkClient.java:149)
at org.apache.tika.fork.ForkClient.call(ForkClient.java:84)
at org.apache.tika.fork.ForkParser.parse(ForkParser.java:78)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:94)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:273)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:80)
Such normal exceptions are much easier to recover from.
> Out-of-process text extraction
> ------------------------------
>
> Key: TIKA-416
> URL: https://issues.apache.org/jira/browse/TIKA-416
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Reporter: Jukka Zitting
> Assignee: Jukka Zitting
> Priority: Minor
> Fix For: 0.9
>
>
> There's currently no easy way to guard against JVM crashes or excessive
> memory or CPU use caused by parsing very large, broken or intentionally
> malicious input documents. To better protect against such cases and to
> generally improve the manageability of resource consumption by Tika it would
> be great if we had a way to run Tika parsers in separate JVM processes. This
> could be handled either as a separate "Tika parser daemon" or as an
> explicitly managed pool of forked JVMs.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.