[jira] Resolved: (TIKA-416) Out-of-process text extraction

Jukka Zitting (JIRA) Tue, 18 Jan 2011 07:35:09 -0800

     [ 
https://issues.apache.org/jira/browse/TIKA-416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jukka Zitting resolved TIKA-416.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.9
         Assignee: Jukka Zitting

An initial version of this feature is now working and included in the latest 
trunk.

To illustrate the improvement, here's what I'm seeing for example with one 
somewhat large Excel document:

$ java -Xmx32m -jar tika-app-0.9-SNAPSHOT.jar large.xls
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at 
org.apache.poi.poifs.storage.RawDataBlock.<init>(RawDataBlock.java:69)
        at 
org.apache.poi.poifs.storage.RawDataBlockList.<init>(RawDataBlockList.java:55)
        at 
org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:157)
        at 
org.apache.tika.detect.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:145)
        at 
org.apache.tika.detect.POIFSContainerDetector.detect(POIFSContainerDetector.java:96)
        at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:60)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:126)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:94)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:273)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:80)

The OutOfMemoryError is really troublesome in many container environments where 
hitting the memory limit affects all active threads, not just the one using 
Tika.

With the new out-of-process parsing feature, it's possible to externalize this 
problem into a separate background process:

$ java -Xmx32m -jar tika-app-0.9-SNAPSHOT.jar --fork comlex-document.xls
Exception in thread "main" java.io.IOException: Lost connection to a forked 
server process
        at org.apache.tika.fork.ForkClient.waitForResponse(ForkClient.java:149)
        at org.apache.tika.fork.ForkClient.call(ForkClient.java:84)
        at org.apache.tika.fork.ForkParser.parse(ForkParser.java:78)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:94)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:273)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:80)

Such normal exceptions are much easier to recover from.

> Out-of-process text extraction
> ------------------------------
>
>                 Key: TIKA-416
>                 URL: https://issues.apache.org/jira/browse/TIKA-416
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.9
>
>
> There's currently no easy way to guard against JVM crashes or excessive 
> memory or CPU use caused by parsing very large, broken or intentionally 
> malicious input documents. To better protect against such cases and to 
> generally improve the manageability of resource consumption by Tika it would 
> be great if we had a way to run Tika parsers in separate JVM processes. This 
> could be handled either as a separate "Tika parser daemon" or as an 
> explicitly managed pool of forked JVMs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (TIKA-416) Out-of-process text extraction

Reply via email to