[ https://issues.apache.org/jira/browse/TIKA-830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176702#comment-13176702 ]
Jukka Zitting commented on TIKA-830: ------------------------------------ The problem here is the basic assumption that the Tika facade class makes about how the configured parser will use the instance passed in the ParseContext. By default (and before we added the constructor that allows a custom parser to be given) the Tika facade will construct and use an AutoDetectParser based on all the available and/or configured format-specific parsers. Format-specific parsers that support embedded documents expect the ParseContext to contain a parser instance that they can delegate parsing tasks to, so to support parsing of embedded documents the Tika facade passes the configured parser instance through the ParseContext. The ForkParser on the other hand assumes that anything in the ParseContext is serializable so that it can be sent to the forked JVM process for use from there. Passing a ForkParser instance to the forked JVM like through the ParseContext could easily trigger a recursion of new JVM forks being created, which is why the ForkParser by design is not serializable. I agree with Nick that the resulting error message could certainly be better, but I don't it's a good idea to change the basic design of either ForkParser or the Tika facade class in this respect. If we want the Tika facade class to support forked parsing, I think it would be better to add a separate flag for that to explicitly make the facade class create and use a ForkParser instance based on the configured normal Parser instance. However, the ForkParser is a pretty complex tool that practically always needs custom configuration (java command, memory limits, class loader, etc.), which is why I don't think we should expose it through the Tika facade that's mostly designed for simpler use cases. PS. Instead of the instanceof check we now have in ForkParser (thanks for that, BTW!), it might be a better idea to check for errors from trying to serialize the ParseContext. That'll capture a muhc wider range of cases where a ForkParser instance or some other non-serializable resource is being passed to a forked JVM. > Tika.parseToString() causes ForkParser to try to serialize itself > ----------------------------------------------------------------- > > Key: TIKA-830 > URL: https://issues.apache.org/jira/browse/TIKA-830 > Project: Tika > Issue Type: Bug > Affects Versions: 1.0 > Reporter: Jerome Lacoste > Priority: Blocker > Attachments: > 0005-TIKA-830-Tike.parseToString-caused-ForkParser-to-try.patch, > 0006-TIKA-830-refactor-tests-for-clarity.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira