[ 
https://issues.apache.org/jira/browse/TIKA-830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176702#comment-13176702
 ] 

Jukka Zitting commented on TIKA-830:
------------------------------------

The problem here is the basic assumption that the Tika facade class makes about 
how the configured parser will use the instance passed in the ParseContext.

By default (and before we added the constructor that allows a custom parser to 
be given) the Tika facade will construct and use an AutoDetectParser based on 
all the available and/or configured format-specific parsers. Format-specific 
parsers that support embedded documents expect the ParseContext to contain a 
parser instance that they can delegate parsing tasks to, so to support parsing 
of embedded documents the Tika facade passes the configured parser instance 
through the ParseContext.

The ForkParser on the other hand assumes that anything in the ParseContext is 
serializable so that it can be sent to the forked JVM process for use from 
there. Passing a ForkParser instance to the forked JVM like through the 
ParseContext could easily trigger a recursion of new JVM forks being created, 
which is why the ForkParser by design is not serializable.

I agree with Nick that the resulting error message could certainly be better, 
but I don't it's a good idea to change the basic design of either ForkParser or 
the Tika facade class in this respect.

If we want the Tika facade class to support forked parsing, I think it would be 
better to add a separate flag for that to explicitly make the facade class 
create and use a ForkParser instance based on the configured normal Parser 
instance. However, the ForkParser is a pretty complex tool that practically 
always needs custom configuration (java command, memory limits, class loader, 
etc.), which is why I don't think we should expose it through the Tika facade 
that's mostly designed for simpler use cases.

PS. Instead of the instanceof check we now have in ForkParser (thanks for that, 
BTW!), it might be a better idea to check for errors from trying to serialize 
the ParseContext. That'll capture a muhc wider range of cases where a 
ForkParser instance or some other non-serializable resource is being passed to 
a forked JVM.
                
> Tika.parseToString() causes ForkParser to try to serialize itself
> -----------------------------------------------------------------
>
>                 Key: TIKA-830
>                 URL: https://issues.apache.org/jira/browse/TIKA-830
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.0
>            Reporter: Jerome Lacoste
>            Priority: Blocker
>         Attachments: 
> 0005-TIKA-830-Tike.parseToString-caused-ForkParser-to-try.patch, 
> 0006-TIKA-830-refactor-tests-for-clarity.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to