[jira] Issue Comment Edited: (TIKA-416) Out-of-process text extraction

Jukka Zitting (JIRA) Tue, 18 Jan 2011 07:37:09 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983229#action_12983229
 ]


Jukka Zitting edited comment on TIKA-416 at 1/18/11 10:35 AM:
--------------------------------------------------------------

An initial version of this feature is now working and included in the latest 
trunk.

To illustrate the improvement, here's what I'm seeing for example with one 
somewhat large Excel document:

$ java -Xmx32m -jar tika-app-0.9-SNAPSHOT.jar large.xls
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at 
org.apache.poi.poifs.storage.RawDataBlock.<init>(RawDataBlock.java:69)
        at 
org.apache.poi.poifs.storage.RawDataBlockList.<init>(RawDataBlockList.java:55)
        at 
org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:157)
        at 
org.apache.tika.detect.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:145)
        at 
org.apache.tika.detect.POIFSContainerDetector.detect(POIFSContainerDetector.java:96)
        at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:60)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:126)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:94)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:273)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:80)

The OutOfMemoryError is really troublesome in many container environments where 
hitting the memory limit affects all active threads, not just the one using 
Tika.

With the new out-of-process parsing feature, it's possible to externalize this 
problem into a separate background process:

$ java -Xmx32m -jar tika-app-0.9-SNAPSHOT.jar --fork large.xls
Exception in thread "main" java.io.IOException: Lost connection to a forked 
server process
        at org.apache.tika.fork.ForkClient.waitForResponse(ForkClient.java:149)
        at org.apache.tika.fork.ForkClient.call(ForkClient.java:84)
        at org.apache.tika.fork.ForkParser.parse(ForkParser.java:78)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:94)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:273)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:80)

Such normal exceptions are much easier to recover from.

      was (Author: jukkaz):
    An initial version of this feature is now working and included in the 
latest trunk.

To illustrate the improvement, here's what I'm seeing for example with one 
somewhat large Excel document:

$ java -Xmx32m -jar tika-app-0.9-SNAPSHOT.jar large.xls
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at 
org.apache.poi.poifs.storage.RawDataBlock.<init>(RawDataBlock.java:69)
        at 
org.apache.poi.poifs.storage.RawDataBlockList.<init>(RawDataBlockList.java:55)
        at 
org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:157)
        at 
org.apache.tika.detect.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:145)
        at 
org.apache.tika.detect.POIFSContainerDetector.detect(POIFSContainerDetector.java:96)
        at 
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:60)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:126)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:94)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:273)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:80)

The OutOfMemoryError is really troublesome in many container environments where 
hitting the memory limit affects all active threads, not just the one using 
Tika.

With the new out-of-process parsing feature, it's possible to externalize this 
problem into a separate background process:

$ java -Xmx32m -jar tika-app-0.9-SNAPSHOT.jar --fork comlex-document.xls
Exception in thread "main" java.io.IOException: Lost connection to a forked 
server process
        at org.apache.tika.fork.ForkClient.waitForResponse(ForkClient.java:149)
        at org.apache.tika.fork.ForkClient.call(ForkClient.java:84)
        at org.apache.tika.fork.ForkParser.parse(ForkParser.java:78)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:94)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:273)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:80)

Such normal exceptions are much easier to recover from.
  
> Out-of-process text extraction
> ------------------------------
>
>                 Key: TIKA-416
>                 URL: https://issues.apache.org/jira/browse/TIKA-416
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.9
>
>
> There's currently no easy way to guard against JVM crashes or excessive 
> memory or CPU use caused by parsing very large, broken or intentionally 
> malicious input documents. To better protect against such cases and to 
> generally improve the manageability of resource consumption by Tika it would 
> be great if we had a way to run Tika parsers in separate JVM processes. This 
> could be handled either as a separate "Tika parser daemon" or as an 
> explicitly managed pool of forked JVMs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (TIKA-416) Out-of-process text extraction

Reply via email to