Hi

I'd like to summarize a discussion I had w/ Robert and Mike last night on
IRC, about the parallelism of tasks in Benchmark:

For some reason, ever since parallel tasks were introduced, when I run 'ant
test' from the contrib/benchmark folder (or the root), the tests just hang
at some point, after WriteLineDocTaskTest finishes. What's very weird is
that it seems I'm the only one experiencing this, and so for a long time I
thought it's just a problem w/ my environment ... until yesterday when I did
a fresh checkout of trunk, to a fresh folder and project, and still the
tests stuck.

Thread dump does not show anything relevant to Lucene code, but rather to
Ant. The main thread is waiting on
org/apache/tools/ant/taskdefs/Parallel.spinThreads, another on
org/apache/tools/ant/taskdefs/Execute.waitFor and two other on
java/io/FileInputStream.read. But nothing is related to Lucene code,
directly. Also annoyingly, but conveniently for debugging that issue, it
happens very consistently on my machine - sometimes the test passes, but 90%
hangs.
Running w/ -Drunsequential=1 consistently succeeds.

We've explored different ways to understand the cause of the problem, and
came across several improvements and a workaround, but unfortunately not to
a definite resolution:

* As a last resort, we can add runsequential property to benchmark
build.xml, which forces Benchmark tests to run sequentially. Since that's a
tiny package which takes a few seconds to run anyway, and parallelism
doesn't improve much (it actually runs slower, when it passes, on my
machine: parallel=15 sec, seq=11 sec), this might be acceptable.

* Moving the junit temp files (such as that flag file) created to the temp
directory each test uses. This is actually a good thing to do anyway (thanks
Robert for spotting that), because it avoids accidental commits of such
files :), as well as doesn't clutter the main environment. We've done that
because when I hit CTR:+C to stop one of the runs which hung, we received a
FNFE on a junit flag "file is being accessed by another process" (something
like that), and thought this is related to the hangs I'm seeing. Anyway,
this file is attempted access by multiple JVMs concurrently, which seems
bad.

* Explore the JUnit Formatter code under src/test, since it uses file
locking. I've disabled locks (using NoLockFactory), however the test still
hung.

* Change common-build.xml threadsPerProcessor to '1' instead of '2'. We
think that might be a good thing to do anyway - if people run on machines
with just one CPU, threading is not expected to help much, as opposed to
running on multiple CPUs. But we don't want to enforce it on anyone, so we
think to change the default to '1', but introduce a property
'threadsPerProcessor' which users will be able to set explicitly.
** Surprisingly, when I set it to '1' or '10' (I run on dual-core Thinkpad
W500), the test consistently passes - it just doesn't like the value '2'. At
least it passed as long as I ran it, maybe a thread hang is lurking for me
around the corner somewhere.

* We made sure the benchmark tests indeed read/write the test data files
from/to unique directories. But like I said - there is no hang in Lucene
code reported in the thread dump.

It was very late last night when we stopped, and my eyes were tired, so I
didn't summarize it right away. Robert, I hope I've captured everything we
did, if not please add.

Anyone's got any suggestions? It's unfortunate that I'm the only one running
into this problem, because whatever the suggestions are, you'll probably
need me to confirm them :). And I'm going away for 3 days (camping - no
internet ... well at least no laptop :)), so unless someone has a suggestion
within the coming few hours, we can continue that when I get back.

Shai

Reply via email to