Hei, I opened a couple of issues to note some parser instability:
https://issues.apache.org/jira/browse/TIKA-815 https://issues.apache.org/bugzilla/show_bug.cgi?id=52372 https://issues.apache.org/bugzilla/show_bug.cgi?id=52373 https://issues.apache.org/jira/browse/COMPRESS-169 TIKA-815 is the overall one that points to the fact that tika could have a few more tests to ensure that the underlying parsers are more robusts. The fact that Tika has a general interface allows those stress testing to be applied on all parsers, which may be a good idea. The code is simple and available on github. Feedback appreciated. Now a question that pertains more to the user list. In TIKA-815, Nick pointed that one could use ForkedParser to improve stability. I didn't manage to get it to work. When I use the command line tika app, e.g. with java -jar /tmp/tika-app-1.0.jar -v -t -f brokenFile.doc then tika reports nothing. But if I try to reproduce something similar programatically I run into strange errors: first because my current classLoader isn't serializable and the client tries to serialize it. org.apache.tika.exception.TikaException: Failed to communicate with a forked parser process. The process has most likely crashed due to some error like running out of memory. A new process will be started for the next parsing request. at org.apache.tika.fork.ForkParser.parse(ForkParser.java:123) at org.apache.tika.Tika.parseToString(Tika.java:380) at org.apache.tika.Tika.parseToString(Tika.java:414) at no.finntech.tika.harderner.TikaIndexerHardenerTest.parseContent(TikaIndexerHardenerTest.java:142) at no.finntech.tika.harderner.TikaIndexerHardenerTest.flipBitAndIndexContent(TikaIndexerHardenerTest.java:125) at no.finntech.tika.harderner.TikaIndexerHardenerTest.originalFileIndexesProperly4(TikaIndexerHardenerTest.java:69) at no.finntech.tika.harderner.TikaIndexerHardenerTest.main(TikaIndexerHardenerTest.java:170) Caused by: java.io.NotSerializableException: sun.misc.Launcher$AppClassLoader at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1164) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:330) at java.util.HashMap.writeObject(HashMap.java:1001) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:945) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1469) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:330) at org.apache.tika.fork.ForkObjectInputStream.sendObject(ForkObjectInputStream.java:84) at org.apache.tika.fork.ForkClient.sendObject(ForkClient.java:135) at org.apache.tika.fork.ForkClient.call(ForkClient.java:108) at org.apache.tika.fork.ForkParser.parse(ForkParser.java:120) ... 6 more This is because Tika tries to serialize the forkParser in the ParseContext. I solved this by introducing private void setContextParser(ParseContext context) { Parser p = parser; if (parser instanceof ForkParser) { p = ((ForkParser)parser).getParser(); // requires exposing the parser in ForkParser } context.set(Parser.class, p); } and modifying parseToString(...) with: ParseContext context = new ParseContext(); setContextParser(context); So there's maybe a bug here. This solves the exception but causes tika to not report any error when parsing. It just doesn't parse anything and returns gracefully. Any idea ? Jerome