[ https://issues.apache.org/jira/browse/TIKA-456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13900360#comment-13900360 ]
William Palmer commented on TIKA-456: ------------------------------------- AFAIK it's still deprecated. If a library used by Tika hangs, or Tika itself hangs, that causes my code to hang and this is less desirable than using Thread.stop(). If there is a non-deprecated way to stop a thread I will gladly use that approach. I am using Tika with Hadoop to parse files in web archives (93 million files) and a hang causes the map to fail which then causes the job to fail. > Support timeouts for parsers > ---------------------------- > > Key: TIKA-456 > URL: https://issues.apache.org/jira/browse/TIKA-456 > Project: Tika > Issue Type: Improvement > Components: parser > Reporter: Ken Krugler > Assignee: Chris A. Mattmann > > There are a number of reasons why Tika could hang while parsing. One common > case is when a parser is fed an incomplete document, such as what happens > when limiting the amount of data fetched during a web crawl. > One solution is to create a TikaCallable that wraps the Tika parser, and > then use this with a FutureTask. For example, when using a ParsedDatum POJO > for the results of the parse operation, I do something like this: > parser = new AutoDetectParser(); > Callable<ParsedDatum> c = new TikaCallable(parser, contenthandler, > inputstream, metadata); > FutureTask<ParsedDatum> task = new FutureTask<ParsedDatum>(c); > Thread t = new Thread(task); > t.start(); > ParsedDatum result = task.get(MAX_PARSE_DURATION, TimeUnit.SECONDS); > And TikaCallable() looks like: > class TikaCallable implements Callable<ParsedDatum> { > public TikaCallable(Parser parser, ContentHandler handler, InputStream > is, Metadata metadata) { > _parser = parser; > _handler = handler; > _input = is; > _metadata = metadata; > ... > } > public ParsedDatum call() throws Exception { > .... > _parser.parse(_input, _handler, _metadata, new ParseContext()); > .... > } > } > This seems like it would be generally useful, as I doubt that we'd ever be > able to guarantee that none of the parsers being wrapped by Tika could ever > hang. > One idea is to create a TimeoutParser that wraps a regular Tika Parser. E.g. > something like: > Parser p = new TimeoutParser(new AutodetectParser(), 20, TimeUnit.SECONDS); > Then the call to p.parse(...) would create a Callable (similar to the code > above) and use the specified timeout when calling task.get(). > One minus with this approach is that it creates a new thread for each parse > request, but I don't think the thread overhead is significant when compared > to the typical parser operation. -- This message was sent by Atlassian JIRA (v6.1.5#6160)