Hi I'd like to summarize a discussion I had w/ Robert and Mike last night on IRC, about the parallelism of tasks in Benchmark:
For some reason, ever since parallel tasks were introduced, when I run 'ant test' from the contrib/benchmark folder (or the root), the tests just hang at some point, after WriteLineDocTaskTest finishes. What's very weird is that it seems I'm the only one experiencing this, and so for a long time I thought it's just a problem w/ my environment ... until yesterday when I did a fresh checkout of trunk, to a fresh folder and project, and still the tests stuck. Thread dump does not show anything relevant to Lucene code, but rather to Ant. The main thread is waiting on org/apache/tools/ant/taskdefs/Parallel.spinThreads, another on org/apache/tools/ant/taskdefs/Execute.waitFor and two other on java/io/FileInputStream.read. But nothing is related to Lucene code, directly. Also annoyingly, but conveniently for debugging that issue, it happens very consistently on my machine - sometimes the test passes, but 90% hangs. Running w/ -Drunsequential=1 consistently succeeds. We've explored different ways to understand the cause of the problem, and came across several improvements and a workaround, but unfortunately not to a definite resolution: * As a last resort, we can add runsequential property to benchmark build.xml, which forces Benchmark tests to run sequentially. Since that's a tiny package which takes a few seconds to run anyway, and parallelism doesn't improve much (it actually runs slower, when it passes, on my machine: parallel=15 sec, seq=11 sec), this might be acceptable. * Moving the junit temp files (such as that flag file) created to the temp directory each test uses. This is actually a good thing to do anyway (thanks Robert for spotting that), because it avoids accidental commits of such files :), as well as doesn't clutter the main environment. We've done that because when I hit CTR:+C to stop one of the runs which hung, we received a FNFE on a junit flag "file is being accessed by another process" (something like that), and thought this is related to the hangs I'm seeing. Anyway, this file is attempted access by multiple JVMs concurrently, which seems bad. * Explore the JUnit Formatter code under src/test, since it uses file locking. I've disabled locks (using NoLockFactory), however the test still hung. * Change common-build.xml threadsPerProcessor to '1' instead of '2'. We think that might be a good thing to do anyway - if people run on machines with just one CPU, threading is not expected to help much, as opposed to running on multiple CPUs. But we don't want to enforce it on anyone, so we think to change the default to '1', but introduce a property 'threadsPerProcessor' which users will be able to set explicitly. ** Surprisingly, when I set it to '1' or '10' (I run on dual-core Thinkpad W500), the test consistently passes - it just doesn't like the value '2'. At least it passed as long as I ran it, maybe a thread hang is lurking for me around the corner somewhere. * We made sure the benchmark tests indeed read/write the test data files from/to unique directories. But like I said - there is no hang in Lucene code reported in the thread dump. It was very late last night when we stopped, and my eyes were tired, so I didn't summarize it right away. Robert, I hope I've captured everything we did, if not please add. Anyone's got any suggestions? It's unfortunate that I'm the only one running into this problem, because whatever the suggestions are, you'll probably need me to confirm them :). And I'm going away for 3 days (camping - no internet ... well at least no laptop :)), so unless someone has a suggestion within the coming few hours, we can continue that when I get back. Shai