[ 
https://issues.apache.org/jira/browse/TIKA-456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885504#action_12885504
 ] 

Jukka Zitting commented on TIKA-456:
------------------------------------

Sounds reasonable, though it would be good if we could somehow prevent a 
runaway thread from continuing to interact with the client application (reading 
the stream, sending SAX events, modifying metadata, accessing the parse 
context) after the TimeoutParser.parse() method has returned. Terminating a 
thread in Java is a bit troublesome, but we could at least try something like 
interrupt() the thread if it runs longer than expected. Alternatively (or 
complementarily) we could add wrappers around the parse() arguments so that we 
can disconnect a runaway thread from the client-visible objects.

See also TIKA-416 for a more heavyweight alternative that'll allow us to 
isolate the parsing process even more completely.


> Support timeouts for parsers
> ----------------------------
>
>                 Key: TIKA-456
>                 URL: https://issues.apache.org/jira/browse/TIKA-456
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Ken Krugler
>            Assignee: Chris A. Mattmann
>
> There are a number of reasons why Tika could hang while parsing. One common 
> case is when a parser is fed an incomplete document, such as what happens 
> when limiting the amount of data fetched during a web crawl.
> One solution is to create a TikaCallable that wraps the Tika   parser, and 
> then use this with a FutureTask. For example, when using a ParsedDatum POJO 
> for the results of the parse operation, I do something like this:
>     parser = new AutoDetectParser();
>     Callable<ParsedDatum> c = new TikaCallable(parser, contenthandler, 
> inputstream, metadata);
>     FutureTask<ParsedDatum> task = new  FutureTask<ParsedDatum>(c);
>     Thread t = new Thread(task);
>     t.start();
>     ParsedDatum result = task.get(MAX_PARSE_DURATION, TimeUnit.SECONDS);
> And TikaCallable() looks like:
> class TikaCallable implements Callable<ParsedDatum> {
>     public TikaCallable(Parser parser, ContentHandler handler, InputStream 
> is, Metadata metadata) {
>         _parser = parser;
>         _handler = handler;
>         _input = is;
>         _metadata = metadata;
>         ...
>     }
>     public ParsedDatum call() throws Exception {
>         ....
>         _parser.parse(_input, _handler, _metadata, new ParseContext());
>         ....
>     }
> }
> This seems like it would be generally useful, as I doubt that we'd  ever be 
> able to guarantee that none of the parsers being wrapped by Tika could ever 
> hang.
> One idea is to create a TimeoutParser that wraps a regular Tika Parser. E.g. 
> something like:
>   Parser p = new TimeoutParser(new AutodetectParser(), 20, TimeUnit.SECONDS);
> Then the call to p.parse(...) would create a Callable (similar to the code 
> above) and use the specified timeout when calling task.get().
> One minus with this approach is that it creates a new thread for each parse 
> request, but I don't think the thread overhead is significant when compared 
> to the typical parser operation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to