[ https://issues.apache.org/jira/browse/TIKA-416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved TIKA-416. -------------------------------- Resolution: Fixed Fix Version/s: 0.9 Assignee: Jukka Zitting An initial version of this feature is now working and included in the latest trunk. To illustrate the improvement, here's what I'm seeing for example with one somewhat large Excel document: $ java -Xmx32m -jar tika-app-0.9-SNAPSHOT.jar large.xls Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at org.apache.poi.poifs.storage.RawDataBlock.<init>(RawDataBlock.java:69) at org.apache.poi.poifs.storage.RawDataBlockList.<init>(RawDataBlockList.java:55) at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:157) at org.apache.tika.detect.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:145) at org.apache.tika.detect.POIFSContainerDetector.detect(POIFSContainerDetector.java:96) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:60) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:126) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:94) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:273) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:80) The OutOfMemoryError is really troublesome in many container environments where hitting the memory limit affects all active threads, not just the one using Tika. With the new out-of-process parsing feature, it's possible to externalize this problem into a separate background process: $ java -Xmx32m -jar tika-app-0.9-SNAPSHOT.jar --fork comlex-document.xls Exception in thread "main" java.io.IOException: Lost connection to a forked server process at org.apache.tika.fork.ForkClient.waitForResponse(ForkClient.java:149) at org.apache.tika.fork.ForkClient.call(ForkClient.java:84) at org.apache.tika.fork.ForkParser.parse(ForkParser.java:78) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:94) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:273) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:80) Such normal exceptions are much easier to recover from. > Out-of-process text extraction > ------------------------------ > > Key: TIKA-416 > URL: https://issues.apache.org/jira/browse/TIKA-416 > Project: Tika > Issue Type: New Feature > Components: parser > Reporter: Jukka Zitting > Assignee: Jukka Zitting > Priority: Minor > Fix For: 0.9 > > > There's currently no easy way to guard against JVM crashes or excessive > memory or CPU use caused by parsing very large, broken or intentionally > malicious input documents. To better protect against such cases and to > generally improve the manageability of resource consumption by Tika it would > be great if we had a way to run Tika parsers in separate JVM processes. This > could be handled either as a separate "Tika parser daemon" or as an > explicitly managed pool of forked JVMs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.