Hi Kirby & others,
On Jan 31, 2011, at 4:39pm, Kirby Bohling wrote:
On Sat, Jan 29, 2011 at 9:03 AM, Ken Krugler
<[email protected]> wrote:
Some comments below.
On Jan 29, 2011, at 5:55am, Julien Nioche wrote:
Hi,
This shows the state of the various threads within a Java process.
Most of
them seem to be busy parsing zip archives with Tika. The
interesting part
is
that the main thread is at the Generation step :
* at org.apache.nutch.crawl.Generator.generate(Generator.java:431)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:127)
*
with the "Thread-415331" normalizing the URLs as part of the
generation.
So why do we see threads busy at parsing these archives? I think
this is a
result of the Timeout mechanism (
https://issues.apache.org/jira/browse/NUTCH-696) used for the
parsing.
Before it, we used to have the parsing step loop on a single
document and
never complete. Thanks to Andrzej's patch, the parsing is done is
separate
threads which are abandonned if more than X seconds have passed
(default
30
I think). Obiously these threads are still lurking around in the
background
and consuming CPU.
This is an issue when calling the Crawl command only. When using the
separate commands for the various steps, the runaway threads die
with the
main process, however since the Crawl uses a single process, these
timeout
threads keep going.
Am not an expert in multithreading and don't have an idea of
whether these
threads could be killed somehow. Andrzej, any clue?
This is a fundamental problem with run-away threads - there is no
safe,
reliable way to kill them off.
And if you parse enough documents, you will run into a number that
currently
cause Tika to hang. Zip files for sure, but we ran into the same
issue with
FLV files.
Over in Tika-land, Jukka has a patch that fires up a child JVM and
runs
parsers there. See https://issues.apache.org/jira/browse/TIKA-416
-- Ken
All,
Just an observation, but the general approach to this problem is to
use Thread.interrupt(). Virtually all code in the JDK treats the
thread being interrupted as a request to cancel. Java Concurrency in
Practice (JCIP) has a whole chapter on this topic (Chapter 7). IMHO,
any general purpose library code that swallows "InterruptedException"
and isn't implementing the Thread cancellation policy has a bug in it
(the cancellation policy can only be implemented by the owner of the
thread, unless the library is a task/thread library it cannot be
implementing the cancellation policy). Any place you see:
[snip]
One exception is that
sockets read/write operations don't operate this way, the socket must
be closed to interrupt a read/write, the approach JCIP suggests is to
tie the socket and thread in such a way that interrupt() closes the
sockets that would be reading/writing inside that thread.
Excellent input, as I need to solve some issues with needing to abort
HTTP requests.
[snip]
Not sure exactly what the problems inside of Tika are, but getting it
to respect interruption would be a wonderful thing for everybody that
uses it. The problem might be getting all underlying libraries it
uses to do so.
Yes, that's exactly the issue in the cases I've seen. The libraries
used to do the actual parsing can get caught in loops, when processing
unexpected data. There's no checks for interrupt, e.g. it's code that
is walking some data structure, and doesn't realize that it's in a
loop (e.g. offset to next chunk is set to zero, so the same chunk is
endlessly reprocessed).
Occasionally we can get the underlying libraries to fix issues, but
each new release has the potential for new and exciting hangs.
That's why Jukka went down the admittedly hard-core and heavy-weight
path of providing an option to run parses in a child JVM.
If there's another solution, we'd love to hear about it :)
Thanks,
-- Ken
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g