Re: nutch crawl command takes 98% of cpu

Kirby Bohling Mon, 31 Jan 2011 16:40:46 -0800

On Sat, Jan 29, 2011 at 9:03 AM, Ken Krugler
<[email protected]> wrote:
> Some comments below.
>
> On Jan 29, 2011, at 5:55am, Julien Nioche wrote:
>
>> Hi,
>>
>> This shows the state of the various threads within a Java process. Most of
>> them seem to be busy parsing zip archives with Tika. The interesting part
>> is
>> that the main thread is at the Generation step :
>>
>> *  at org.apache.nutch.crawl.Generator.generate(Generator.java:431)
>>  at org.apache.nutch.crawl.Crawl.main(Crawl.java:127)
>> *
>> with the "Thread-415331" normalizing the URLs as part of the generation.
>>
>> So why do we see threads busy at parsing these archives? I think this is a
>> result of the Timeout mechanism (
>> https://issues.apache.org/jira/browse/NUTCH-696) used for the parsing.
>> Before it, we used to have the parsing step loop on a single document and
>> never complete. Thanks to Andrzej's patch, the parsing is done is separate
>> threads which are abandonned if more than X seconds have passed (default
>> 30
>> I think). Obiously these threads are still lurking around in the
>> background
>> and consuming CPU.
>>
>> This is an issue when calling the Crawl command only. When using the
>> separate commands for the various steps, the runaway threads die with the
>> main process, however since the Crawl uses a single process, these timeout
>> threads keep going.
>>
>> Am not an expert in multithreading and don't have an idea of whether these
>> threads could be killed somehow. Andrzej, any clue?
>
> This is a fundamental problem with run-away threads - there is no safe,
> reliable way to kill them off.
>
> And if you parse enough documents, you will run into a number that currently
> cause Tika to hang. Zip files for sure, but we ran into the same issue with
> FLV files.
>
> Over in Tika-land, Jukka has a patch that fires up a child JVM and runs
> parsers there. See https://issues.apache.org/jira/browse/TIKA-416
>
> -- Ken
>


All,

  Just an observation, but the general approach to this problem is to
use Thread.interrupt().  Virtually all code in the JDK treats the
thread being interrupted as a request to cancel.  Java Concurrency in
Practice (JCIP) has a whole chapter on this topic (Chapter 7).  IMHO,
any general purpose library code that swallows "InterruptedException"
and isn't implementing the Thread cancellation policy has a bug in it
(the cancellation policy can only be implemented by the owner of the
thread, unless the library is a task/thread library it cannot be
implementing the cancellation policy).  Any place you see:

catch (InterruptedException ex) {
// Ignore
}

Just plan on having a hard to track down bug at some point in the
future.  At the very least, just reset the interruption status like
so:

catch (InterruptedException ex) {
   // Resetting the interruption to avoid losing the cancellation request.
   Thread.currentThread().interrupt();
//  Twiddle any state necessary to get a bail out in a timeline manner...
}

  The problem with using the interruption status as cancellation
approach is that it fails if there is a bug anywhere in any library
that swallows the InterruptedException (in many ways it is similar to
a data race).  It is a fundamental problem with threading (there is no
way to share memory space and have a reliable cancel that a bug can't
subvert, an infinite loop while holding a lock is the canonical
example of the problem, killing the thread could lead to an invariant
being invalid).

   One trivial and simple way if you control the creation of Threads
is to override "Thread.interrupt", and record that the interrupt
method was called (and thus cancellation of the thread/work was
requested), and at the top of the outer most loop check if the cancel
was set, bail out.  That assumes at some point you do in fact get back
to the top of the loop.  If you're stuck in an inner loop, fix the
inner loop that is stuck to respect cancellation/interruption.

  There are several gotchas dealing with interruptions.  Most blocking
APIs inside of Java respect cancellation (they throw
InterruptedException if isInterrupted() is true, rather then start a
potentially blocking operation, and will wake up and throw the
exception if interrupted in the middle of it).  One exception is that
sockets read/write operations don't operate this way, the socket must
be closed to interrupt a read/write, the approach JCIP suggests is to
tie the socket and thread in such a way that interrupt() closes the
sockets that would be reading/writing inside that thread.

I believe that the NIO code does as long as the Channel is a
InterruptableChannel, which the stock network implementations should
be.  Selector.select() does not handle interruption, it must have
.wakup called on it in an analogous way to closing the socket.

Not sure exactly what the problems inside of Tika are, but getting it
to respect interruption would be a wonderful thing for everybody that
uses it.  The problem might be getting all underlying libraries it
uses to do so.

Kirby

Re: nutch crawl command takes 98% of cpu

Reply via email to