Re: Groovy 3 -> 4 Performance Degradation Sample

MG Mon, 09 May 2022 08:41:21 -0700

Hi Jochen,

since I am not feeling well right now, just some quick answers to someof your questions:


1. I am not at liberty to release the source code from our project, not
   even partially, so these JARs are currently all I can provide. I
   felt when I reported this problem initially, there were doubts that
   it even exists, so after I could not pinpoint the problem on my
   side, I invested the time to extract these project parts, clean them
   from any historical references to other modules etc and supply them
   in compiled form.
2. As I have said in previous posts, the performance degradation occurs
   with non-Indy vs Indy. In fact this is how I could pinpoint that
   Indy was to blame: After realizing that our test suite ran for 3h
   under Groovy 4, instead of 1h under Groovy 3, and that this was not
   restricted to a few tests becoming really slow, but was a more or
   less general phenomenon (which was puzzling), I ran the test suite
   under Groovy 3 Indy (for the first time), and got basically the same
   reduced performance.
3. This is not a micro benchmark simply because I was unable to create
   a small concise example that reproduces the problem. On the
   contrary, it always seemed that a large part of our framework needed
   to be involved for the performance degradation to show itself. That
   is why it took so long to produce a working sample, and why the test
   script contains some classes/traits/interfaces at the start: I had
   to reproduce somce code from the rest of our framework in the
   script, to reproduce one of the simpler tests which exhibited the
   problem.
4. Throwing away the first test loop: It is a bit of an acdemic
   discusssion, since Groovy 4 performance is always 2 to 3 times
   slower in any case. You can execute the test script interleaved
   between Grooyv 3 / 4 multiple times to see how measurements develop,
   but I would say it is clear that Groovy 4 is always way slower than
   Groovy 3 for this test, and that is all that matters :-)

Cheers,
mg


On 08/05/2022 19:53, Jochen Theodorou wrote:

On 07.05.22 01:55, MG wrote:

Hi Groovy devs,

I have finally managed to extract a sample from our Groovy framework
that only uses our most basic modules and still exhibits the 2-3x
performance degradation between (non-indy) Groovy 3 and Groovy 4 I
already described in multiple posts a while ago.
The code was built using the latest Groovy releases (3.0.10 / 4.0.1),
you can find the 2 JARs & a test script at:

https://github.com/mgroovy/groovyperformance/tree/trunk/groovy4_performance_public_sample


The problem is that the jars contain compiled Groovy code only, which
means we do not know the source for them... makes it a bit harder to
find out what is actually going wrong by trying to reproduce it

To check the performance:

 1. Open a GroovyConsole window for Groovy 3.0.10
 2. Do Script\Add JAR(s) to ClassPath: groovyperformance-3.0.10.jar
 3. Load & execute groovysql_performance_groovy4_2_xx_yy_zzzz.groovy

Analogue for Groovy 4.0.1 with groovyperformance-4.0.1.jar.

On 3 different PCs I consistently get results as shown below (each timed
test step executes the same code 100x, each time creating the same
simple SQL GString using our framework):

*Groovy 3.0.10*
0) dt = 7.99 ms
1) dt = 2.01 ms
2) dt = 1.53 ms
3) dt = 1.65 ms
4) dt = 1.36 ms

*Groovy 4.0.1*
0) dt = 16.51 ms
1) dt = 7.14 ms
2) dt = 5.83 ms
3) dt = 6.6 ms
4) dt = 6.24 ms


just a note, since you mention this later: This is Groovy 3 non-Indy vs.
Groovy 4 (always indy)

Throwing away the first loop, which includes framework setup time,
Groovy 3 outperforms Groovy 4 by a factor of about 3 (Note: On my
notebook the factor is closer to 2).


I am not sure that throwing out the first loop is actually right to do.
As always with benchmarks, the problem is partially figuring out what
you are really measuring. My assumption is that the initial callsite
creation is much slower in indy, than it was with non-indy. If you have
a lot of 1-time only invocations, this would pile up quite a bit.

Execution times generally decrease when starting the script multiple
times, but the best dt I have observed on the PC the above measurements
were taken was 3.3ms, whereas Groovy 3.0.10 goes down below 1ms (In any
case this is irrelevant for practical applications, since here a short,
identical code segment will not be executed n x 500-times in a loop).

[...]

You have to imagine it like this for non-indy...
You do a method call, Groovy finds the method that is called and
generates bytecode for a call to this method, guarded by some class
(receiver and parameters) checks basically. Subsequent call are then
done using this optimized path and all you pay compared to normal Java
is the class checks, some special logic for category methods and like
2-3 method calls. All can be inlined pretty well. But won't with this
low amount of calls. Still the path itself is pretty fast already on the
second execution.

The problem here is that the first execution is actually done from
within the callsite code. This has to happen, as somebody has to do the
invocation in the end. And even the nth invocation is done from Groovy
created code. If you have to deal with the java module system, then this
is a killer. But anyway. Let us note for the first time call performance:

method selection + small (unverified) bytecode generation + actual
invocation, with 2-3 method calls deepness and some type transformations
before you get to the actual method

With Indy you have the bootstrap method, which is called and here we do
basically the method selection and produce method handles, method types,
guards and type transformations. There is also some kind of cache....

Now the question is, what is going wrong. The method selection is not
the cost, since that is in both the same. Type transformations are done
differently in both. In the callsite code we basically generate code for
it, in indy we create a handle for it.

Basically I see 2 basic possible troubles:
1) creating the initial selection is too expensive
2) the caching is not actually working

The first one is one I suspect. In my opinion Indy is slower because of
the type transformation code, because of how MethodTypes are produced
for Indy, because an invocation of generated code is much faster than a
method handle (which is kind of interpreted before Java creates bytecode
for it). A bit slower would be no problem of course.

I spend the whole Saturday looking at the code and trying to understand
what is happening

Well.. the structure seems to be that there is a size 4 LRU cache per
callsite. But since the threshold is so high nothing is ever going to be
cached. If I set the threshold to -1 (0) actually would not work, since
there is a comparison with >, not >=. Every cache entry can be seen as a
pair of something that will call the target and a fallback. The fallback
seems to be to select the method again. Imho that should lead back to
the cache, otherwise the cache makes no sense in a case of reselection
of the cached method and basically degrades to a size 1 cache.

But besides that... there is like a thousand callsites generated by your
code - which does just mean, that this is no micro benchmark.

Something does look strange though. For example the call to
maxNameLength seems not to be properly cached. The cache stores as
target selectMethod, every time, even if the threshold is set to 0. But
then again it became 3:00 in the morning and I ran out of time

As it should work imho:

At first execution we build a handle, that can be called later on and
store it in the cache. The fallback of this handle should be the cache.
ASCII-Art for foo():

    handle
    /    \
  foo   cache


The cache would need a key of the runtime type signature of the call.
But since that could be expensive, the receiver alone would already help
a little (that is what is done already). So if no old selection applies,
a new method should be selected:

    handle
    /    \
  foo   cache
        /   \
      hit   miss
       |     |
      foo'  handle'


or in some "simple" code:

const MH callCache = ...

static initial run: install method selectMethodAndCreateCache(*) {
  MH target = selectMethod(*, fallback: callCache)
  CachedCallsite cs = new CachedCallSite()
  cs.install(target, receiverName)
}

callCache(CachedCallsite cs, Object receiver, *) {
  target = cs.cache.findCachedMethod(receiver)
  if (target==null) {
    target=selectMethod(*)
    cs.install(target, receiverName(receiver))
  }
  target.invokeExact(*)
}

and

findCachedMethod(receiver) {
  if (receiver of current target == receiver)
       disableCaching()
  return get(receiver)
}

install(target, recceiverName) {
  super.setTarget(target)
  currentReceiverName = receiverName
  put(receiverName, target)
}


The structure that my debugging suggests seems to be different from
that. Furthermore...

I tried a bunch of small improvements, that all did not really do a
change, and would not be suitable for production (like disabling almost
all guards), but it did not change the numbers significantly. I know
this situation from back in 1.6 when I did the primitive optimizations.
It was not until I could eliminate a call to synchronized to get better
performance. In general concurrency-barriers are very bad for inlining.
The cache code involves at least two of them, but not sure removing them
would actually make things fast.

The only thing I can imagine is building up the code slowly from "fast,
but cannot much of anything" to "fully capable" and see at what point
the performance breaks down. And there might be multiple such points.

sadly I lack the time to do a project like that right now.

bye Jochen

Re: Groovy 3 -> 4 Performance Degradation Sample

Reply via email to