Hei,

I opened a couple of issues to note some parser instability:

https://issues.apache.org/jira/browse/TIKA-815
https://issues.apache.org/bugzilla/show_bug.cgi?id=52372
https://issues.apache.org/bugzilla/show_bug.cgi?id=52373
https://issues.apache.org/jira/browse/COMPRESS-169

TIKA-815 is the overall one that points to the fact that tika could
have a few more tests to ensure that the underlying parsers are more
robusts. The fact that Tika has a general interface allows those
stress testing to be applied on all parsers, which may be a good idea.
The code is simple and available on github. Feedback appreciated.




Now a question that pertains more to the user list. In TIKA-815, Nick
pointed that one could use ForkedParser to improve stability. I didn't
manage to get it to work.

When I use the command line tika app, e.g. with

java -jar /tmp/tika-app-1.0.jar -v -t -f  brokenFile.doc

then tika reports nothing.

But if I try to reproduce something similar programatically I run into
strange errors:

first because my current classLoader isn't serializable and the client
tries to serialize it.

org.apache.tika.exception.TikaException: Failed to communicate with a
forked parser process. The process has most likely crashed due to some
error like running out of memory. A new process will be started for
the next parsing request.
        at org.apache.tika.fork.ForkParser.parse(ForkParser.java:123)
        at org.apache.tika.Tika.parseToString(Tika.java:380)
        at org.apache.tika.Tika.parseToString(Tika.java:414)
        at 
no.finntech.tika.harderner.TikaIndexerHardenerTest.parseContent(TikaIndexerHardenerTest.java:142)
        at 
no.finntech.tika.harderner.TikaIndexerHardenerTest.flipBitAndIndexContent(TikaIndexerHardenerTest.java:125)
        at 
no.finntech.tika.harderner.TikaIndexerHardenerTest.originalFileIndexesProperly4(TikaIndexerHardenerTest.java:69)
        at 
no.finntech.tika.harderner.TikaIndexerHardenerTest.main(TikaIndexerHardenerTest.java:170)
Caused by: java.io.NotSerializableException: sun.misc.Launcher$AppClassLoader
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1164)
        at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
        at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
        at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
        at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:330)
        at java.util.HashMap.writeObject(HashMap.java:1001)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:945)
        at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1469)
        at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
        at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
        at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
        at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
        at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:330)
        at 
org.apache.tika.fork.ForkObjectInputStream.sendObject(ForkObjectInputStream.java:84)
        at org.apache.tika.fork.ForkClient.sendObject(ForkClient.java:135)
        at org.apache.tika.fork.ForkClient.call(ForkClient.java:108)
        at org.apache.tika.fork.ForkParser.parse(ForkParser.java:120)
        ... 6 more

This is because Tika tries to serialize the forkParser in the
ParseContext. I solved this by introducing

    private void setContextParser(ParseContext context) {
        Parser p = parser;
        if (parser instanceof ForkParser) {
            p = ((ForkParser)parser).getParser(); // requires exposing
the parser in ForkParser
        }
        context.set(Parser.class, p);
    }

and modifying parseToString(...) with:

            ParseContext context = new ParseContext();
            setContextParser(context);

So there's maybe a bug here.

This solves the exception but causes tika to not report any error when
parsing. It just doesn't parse anything and returns gracefully.

Any idea ?

Jerome

Reply via email to