On 16/05/2013 15:50, David Chase wrote:
:
Parallel performance is a little harder to reason about on big x86 boxes (both
Intel and AMD), so I am leaving the threshold high. Dave Dice thought this
might be an artifact of cores being put into a power-saving mode and being slow
to wake (the particular benchmark I wrote would have been pessimal for this,
since it alternated between serial and parallel phases). The eventual speedups
were often impressive (6x-12x) but it was unclear how many hardware threads
(out of the 32-64 available) I was using to obtain this. Yes, I need to plug
this into JMH for fine-tuning. I'm using the system fork-join pool because
that initially seemed like the good-citizen thing to do (balance CRC/Adler
needs against those of anyone else who might be doing work) but I am starting
to wonder if it would make more sense to establish a small private pool with a
bounded number of threads, so that I don't need to worry about being a good
citizen so much. It occurs to me, late in the game, that using big-ish units
of work is another, different way to be a bad citizen. (I would prefer to get
this checked in if it represents a net improvement, and then work on the tuning
afterwards.)
I'm sure Doug or Brian or David Holmes will have opinions on this point
but I would think using the common pool is right. If parallel sort,
CRC32 and other specific usages each created their own thread pool then
I could imagine a lot of thread pools hanging around and competing. Plus
there are cases like EE where no-parallelism might be the right answer
and one wouldn't want to have to configure each usage.
In any case, this looks really good work. One thing that might be worth
checking is startup/warm-up. I have a vague memory of this being a
concern in the past with Adler32, Sherman might remember the details.
-Alan.