On 07.05.22 01:55, MG wrote:
Hi Groovy devs,
I have finally managed to extract a sample from our Groovy framework
that only uses our most basic modules and still exhibits the 2-3x
performance degradation between (non-indy) Groovy 3 and Groovy 4 I
already described in multiple posts a while ago.
The code was built using the latest Groovy releases (3.0.10 / 4.0.1),
you can find the 2 JARs & a test script at:
https://github.com/mgroovy/groovyperformance/tree/trunk/groovy4_performance_public_sample
The problem is that the jars contain compiled Groovy code only, which
means we do not know the source for them... makes it a bit harder to
find out what is actually going wrong by trying to reproduce it
To check the performance:
1. Open a GroovyConsole window for Groovy 3.0.10
2. Do Script\Add JAR(s) to ClassPath: groovyperformance-3.0.10.jar
3. Load & execute groovysql_performance_groovy4_2_xx_yy_zzzz.groovy
Analogue for Groovy 4.0.1 with groovyperformance-4.0.1.jar.
On 3 different PCs I consistently get results as shown below (each timed
test step executes the same code 100x, each time creating the same
simple SQL GString using our framework):
*Groovy 3.0.10*
0) dt = 7.99 ms
1) dt = 2.01 ms
2) dt = 1.53 ms
3) dt = 1.65 ms
4) dt = 1.36 ms
*Groovy 4.0.1*
0) dt = 16.51 ms
1) dt = 7.14 ms
2) dt = 5.83 ms
3) dt = 6.6 ms
4) dt = 6.24 ms
just a note, since you mention this later: This is Groovy 3 non-Indy vs.
Groovy 4 (always indy)
Throwing away the first loop, which includes framework setup time,
Groovy 3 outperforms Groovy 4 by a factor of about 3 (Note: On my
notebook the factor is closer to 2).
I am not sure that throwing out the first loop is actually right to do.
As always with benchmarks, the problem is partially figuring out what
you are really measuring. My assumption is that the initial callsite
creation is much slower in indy, than it was with non-indy. If you have
a lot of 1-time only invocations, this would pile up quite a bit.
Execution times generally decrease when starting the script multiple
times, but the best dt I have observed on the PC the above measurements
were taken was 3.3ms, whereas Groovy 3.0.10 goes down below 1ms (In any
case this is irrelevant for practical applications, since here a short,
identical code segment will not be executed n x 500-times in a loop).
[...]
You have to imagine it like this for non-indy...
You do a method call, Groovy finds the method that is called and
generates bytecode for a call to this method, guarded by some class
(receiver and parameters) checks basically. Subsequent call are then
done using this optimized path and all you pay compared to normal Java
is the class checks, some special logic for category methods and like
2-3 method calls. All can be inlined pretty well. But won't with this
low amount of calls. Still the path itself is pretty fast already on the
second execution.
The problem here is that the first execution is actually done from
within the callsite code. This has to happen, as somebody has to do the
invocation in the end. And even the nth invocation is done from Groovy
created code. If you have to deal with the java module system, then this
is a killer. But anyway. Let us note for the first time call performance:
method selection + small (unverified) bytecode generation + actual
invocation, with 2-3 method calls deepness and some type transformations
before you get to the actual method
With Indy you have the bootstrap method, which is called and here we do
basically the method selection and produce method handles, method types,
guards and type transformations. There is also some kind of cache....
Now the question is, what is going wrong. The method selection is not
the cost, since that is in both the same. Type transformations are done
differently in both. In the callsite code we basically generate code for
it, in indy we create a handle for it.
Basically I see 2 basic possible troubles:
1) creating the initial selection is too expensive
2) the caching is not actually working
The first one is one I suspect. In my opinion Indy is slower because of
the type transformation code, because of how MethodTypes are produced
for Indy, because an invocation of generated code is much faster than a
method handle (which is kind of interpreted before Java creates bytecode
for it). A bit slower would be no problem of course.
I spend the whole Saturday looking at the code and trying to understand
what is happening
Well.. the structure seems to be that there is a size 4 LRU cache per
callsite. But since the threshold is so high nothing is ever going to be
cached. If I set the threshold to -1 (0) actually would not work, since
there is a comparison with >, not >=. Every cache entry can be seen as a
pair of something that will call the target and a fallback. The fallback
seems to be to select the method again. Imho that should lead back to
the cache, otherwise the cache makes no sense in a case of reselection
of the cached method and basically degrades to a size 1 cache.
But besides that... there is like a thousand callsites generated by your
code - which does just mean, that this is no micro benchmark.
Something does look strange though. For example the call to
maxNameLength seems not to be properly cached. The cache stores as
target selectMethod, every time, even if the threshold is set to 0. But
then again it became 3:00 in the morning and I ran out of time
As it should work imho:
At first execution we build a handle, that can be called later on and
store it in the cache. The fallback of this handle should be the cache.
ASCII-Art for foo():
handle
/ \
foo cache
The cache would need a key of the runtime type signature of the call.
But since that could be expensive, the receiver alone would already help
a little (that is what is done already). So if no old selection applies,
a new method should be selected:
handle
/ \
foo cache
/ \
hit miss
| |
foo' handle'
or in some "simple" code:
const MH callCache = ...
static initial run: install method selectMethodAndCreateCache(*) {
MH target = selectMethod(*, fallback: callCache)
CachedCallsite cs = new CachedCallSite()
cs.install(target, receiverName)
}
callCache(CachedCallsite cs, Object receiver, *) {
target = cs.cache.findCachedMethod(receiver)
if (target==null) {
target=selectMethod(*)
cs.install(target, receiverName(receiver))
}
target.invokeExact(*)
}
and
findCachedMethod(receiver) {
if (receiver of current target == receiver)
disableCaching()
return get(receiver)
}
install(target, recceiverName) {
super.setTarget(target)
currentReceiverName = receiverName
put(receiverName, target)
}
The structure that my debugging suggests seems to be different from
that. Furthermore...
I tried a bunch of small improvements, that all did not really do a
change, and would not be suitable for production (like disabling almost
all guards), but it did not change the numbers significantly. I know
this situation from back in 1.6 when I did the primitive optimizations.
It was not until I could eliminate a call to synchronized to get better
performance. In general concurrency-barriers are very bad for inlining.
The cache code involves at least two of them, but not sure removing them
would actually make things fast.
The only thing I can imagine is building up the code slowly from "fast,
but cannot much of anything" to "fully capable" and see at what point
the performance breaks down. And there might be multiple such points.
sadly I lack the time to do a project like that right now.
bye Jochen