Hi Kirby & others,

On Jan 31, 2011, at 4:39pm, Kirby Bohling wrote:

On Sat, Jan 29, 2011 at 9:03 AM, Ken Krugler
<[email protected]> wrote:
Some comments below.

On Jan 29, 2011, at 5:55am, Julien Nioche wrote:

Hi,

This shows the state of the various threads within a Java process. Most of them seem to be busy parsing zip archives with Tika. The interesting part
is
that the main thread is at the Generation step :

*  at org.apache.nutch.crawl.Generator.generate(Generator.java:431)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:127)
*
with the "Thread-415331" normalizing the URLs as part of the generation.

So why do we see threads busy at parsing these archives? I think this is a
result of the Timeout mechanism (
https://issues.apache.org/jira/browse/NUTCH-696) used for the parsing. Before it, we used to have the parsing step loop on a single document and never complete. Thanks to Andrzej's patch, the parsing is done is separate threads which are abandonned if more than X seconds have passed (default
30
I think). Obiously these threads are still lurking around in the
background
and consuming CPU.

This is an issue when calling the Crawl command only. When using the
separate commands for the various steps, the runaway threads die with the main process, however since the Crawl uses a single process, these timeout
threads keep going.

Am not an expert in multithreading and don't have an idea of whether these
threads could be killed somehow. Andrzej, any clue?

This is a fundamental problem with run-away threads - there is no safe,
reliable way to kill them off.

And if you parse enough documents, you will run into a number that currently cause Tika to hang. Zip files for sure, but we ran into the same issue with
FLV files.

Over in Tika-land, Jukka has a patch that fires up a child JVM and runs
parsers there. See https://issues.apache.org/jira/browse/TIKA-416

-- Ken


All,

 Just an observation, but the general approach to this problem is to
use Thread.interrupt().  Virtually all code in the JDK treats the
thread being interrupted as a request to cancel.  Java Concurrency in
Practice (JCIP) has a whole chapter on this topic (Chapter 7).  IMHO,
any general purpose library code that swallows "InterruptedException"
and isn't implementing the Thread cancellation policy has a bug in it
(the cancellation policy can only be implemented by the owner of the
thread, unless the library is a task/thread library it cannot be
implementing the cancellation policy).  Any place you see:

[snip]

One exception is that
sockets read/write operations don't operate this way, the socket must
be closed to interrupt a read/write, the approach JCIP suggests is to
tie the socket and thread in such a way that interrupt() closes the
sockets that would be reading/writing inside that thread.

Excellent input, as I need to solve some issues with needing to abort HTTP requests.

[snip]

Not sure exactly what the problems inside of Tika are, but getting it
to respect interruption would be a wonderful thing for everybody that
uses it.  The problem might be getting all underlying libraries it
uses to do so.

Yes, that's exactly the issue in the cases I've seen. The libraries used to do the actual parsing can get caught in loops, when processing unexpected data. There's no checks for interrupt, e.g. it's code that is walking some data structure, and doesn't realize that it's in a loop (e.g. offset to next chunk is set to zero, so the same chunk is endlessly reprocessed).

Occasionally we can get the underlying libraries to fix issues, but each new release has the potential for new and exciting hangs.

That's why Jukka went down the admittedly hard-core and heavy-weight path of providing an option to run parses in a child JVM.

If there's another solution, we'd love to hear about it :)

Thanks,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Reply via email to