Paul King created GROOVY-12034:
----------------------------------
Summary: Apply "fat-free lambda" patterns to Groovy — DGM `*With`
family + `curryWith` helper
Key: GROOVY-12034
URL: https://issues.apache.org/jira/browse/GROOVY-12034
Project: Groovy
Issue Type: Improvement
Reporter: Paul King
Assignee: Paul King
h2. Summary
Donald Raab's blog post "Fat-Free Lambdas in Java"
([https://donraab.medium.com/fat-free-lambdas-in-java-bf228da0613b]) describes
a pattern where stateless lambdas, combined with library APIs that take an
extra parameter, produce zero per-call allocation. This ticket assesses
applying that pattern to Groovy, backed by a JMH benchmark.
*Result:* the optimisation already works for {{@CompileStatic}} lambdas
targeting a functional interface (GROOVY-11905), but Groovy's GDK does not
expose an API surface that lets idiomatic call sites _use_ it. The dominant
Groovy idiom for "don't capture" — {{Closure.rcurry}} — measures *worse on
every axis* than just capturing. Adding a small {{*With}} family to
{{DefaultGroovyMethods}}, plus a {{curryWith}} helper, would deliver the
fat-free benefit to ordinary Groovy code with no breaking changes.
h2. Background — the technique
Three pillars stack to make the optimisation work in Java:
# *Compiler hoisting of stateless lambdas* — javac lowers non-capturing {{(x)
-> ...}} to a static {{LambdaMetafactory}} call site whose linked {{CallSite}}
returns the same singleton forever.
# *API that prefers passing over capturing* — Eclipse Collections'
{{anySatisfyWith(Predicate2<T,P>, P)}} accepts a 2-arg functional interface
plus the would-be-captured value, so the lambda body becomes stateless.
# *Method references* — {{String::startsWith}} is unbound; naturally
non-capturing.
h2. Mapping to Groovy
h3. Pillar 1 — Compiler hoisting
*Already done for {{@CompileStatic}} lambdas targeting a functional interface
(GROOVY-11905).*
* {{StaticTypesLambdaAnalyzer.isNonCapturing()}} at
{{StaticTypesLambdaWriter.java:72-75}} detects non-capturing lambdas.
* Non-capturing lambdas get {{ACC_STATIC}} on the synthetic {{doCall}}
({{StaticTypesLambdaWriter.java:339}}).
* Emitted as {{invokedynamic}} → {{LambdaMetafactory.metafactory}} with
{{H_INVOKESTATIC}} ({{StaticTypesLambdaWriter.java:179-183, :429}}). The JDK
returns the singleton.
*Cannot be applied to plain {{Closure}}* (and probably should not be):
* Every {{Closure}} carries {{owner}}, {{thisObject}}, {{delegate}},
{{resolveStrategy}}, {{directive}}, {{parameterTypes}},
{{maximumNumberOfParameters}}, {{bcw}} ({{Closure.java:260-274}}) — ~56-72
bytes minimum, dwarfing a Java capturing lambda.
* {{Closure}} instances are externally mutable: callers may set {{.delegate}},
{{.resolveStrategy}}, call {{.curry(...)}}. Sharing a singleton would let one
caller's mutation leak globally.
* In dynamic compilation, {{LambdaWriter}} delegates to {{ClosureWriter}}
({{LambdaWriter.java:27-56}}), so even a non-capturing lambda becomes a fresh
{{Closure}} subclass per evaluation.
h3. Pillar 2 — API surface (the gap)
Groovy's GDK has *no equivalent* of Eclipse Collections' {{anySatisfyWith}}.
Confirmed by inspection of {{DefaultGroovyMethods.java}}:
*
{{findAll}}/{{find}}/{{any}}/{{every}}/{{collect}}/{{inject}}/{{count}}/{{groupBy}}
(lines 5484, 5376, 600, 5267, 2336, 8787, 3607, 7850) all take a single
{{Closure}}.
* The only {{BiPredicate}}/{{BiFunction}} overloads in DGM are {{zipWithNext}}
({{DGM.java:18515}}) and {{groupConsecutive}} ({{DGM.java:18696}}) — both pair
_adjacent elements_, not element-plus-external-arg.
Groovy's existing "don't capture" tool is {{Closure.rcurry/curry/ncurry}}
({{Closure.java:673, 706, 755}}). But every call allocates:
{code:java}
super(uncurriedClosure.clone()); // CurriedClosure.java:67 — clones the
base Closure
this.curriedArguments = arguments; // CurriedClosure.java:70 — fresh
Object[]
{code}
So {{rcurry(value)}} costs a clone + wrapper + array, _heavier than just
capturing_. (Confirmed by the benchmark below.)
h3. Pillar 3 — Method references
Already work in {{@CompileStatic}}. Same indy path, same singleton behaviour,
no synthetic class needed (the method handle points straight at the target
method).
h2. Benchmark
Sidecar code added at:
*
{{subprojects/performance/src/jmh/groovy/org/apache/groovy/bench/FatFreeLambda.groovy}}
*
{{subprojects/performance/src/jmh/groovy/org/apache/groovy/bench/FatFreeLambdaBench.java}}
Seven variants compared, all traversing the full list (prefix never matches)
with identical per-element work:
||Tag||Variant||Per-call allocation shape||
|A|{{data.any \{ it.startsWith(prefix) \}}} — capturing closure literal|1 ×
{{Closure}} subclass + per-element wrap|
|B|{{data.any(pred.rcurry(prefix))}} — closure + rcurry|1 × {{Closure}} + 1 ×
{{CurriedClosure}} + per-element args|
|C|{{anyWith(data, (s,p) -> s.startsWith(p), prefix)}} — candidate API +
lambda|0 (indy singleton via {{LambdaMetafactory}})|
|D|{{anyWith(data, String::startsWith, prefix)}} — candidate API + method ref|0
(same; no synthetic class)|
|E|hand-written {{for}} loop|0 (baseline)|
|F|{{static final}} hoisted closure + rcurry|1 × {{CurriedClosure}} +
per-element args (no per-call literal)|
|G|{{anyPred(data, curryWith(STARTS_WITH, prefix))}} — fat-free curry helper|1
small capturing lambda per call|
Setup: JMH 1.37, JDK 17.0.18 Zulu, macOS, 3×2s warmup + 5×2s × 2 forks, GC
profiler enabled, size sweep {{@Param({"100","1000","10000"})}}. Two passes:
default ({{indy=true}}) and {{-Pindy=false}}. Each pass ~21 min wall.
Run command:
{noformat}
./gradlew :perf:jmh -PbenchInclude=FatFreeLambda
{noformat}
h2. Results
h3. Throughput (ops/µs, higher = better, {{indy=true}})
||Variant||size=100||size=1000||size=10000||
|C {{anyWith}} + lambda|9.926 ± 0.105|1.001 ± 0.004|0.096 ± 0.001|
|D {{anyWith}} + method ref|9.622 ± 0.114|0.998 ± 0.009|0.096 ± 0.001|
|E baseline {{for}} loop|8.215 ± 0.050|0.768 ± 0.007|0.072 ± 0.001|
|G {{curryWith}}|7.850 ± 0.237|0.927 ± 0.012|0.078 ± 0.001|
|A capture closure|0.584 ± 0.024|0.062 ± 0.002|0.007 ± 0.001|
|B rcurry|0.252 ± 0.010|0.025 ± 0.001|0.003 ± 0.001|
|F static-final + rcurry|0.248 ± 0.001|0.025 ± 0.002|0.003 ± 0.001|
h3. Allocation (bytes/op, lower = better, {{indy=true}})
||Variant||size=100||size=1000||size=10000||Per-element rate||
|C/D|≈ 10⁻⁴|≈ 10⁻³|0.031|*0*|
|E baseline|≈ 10⁻³|0.004|32 (GC-thread noise)|*0*|
|G {{curryWith}}|*128*|*128*|*128*|*0 — constant per call*|
|A capture closure|7,304|72,080|720,096|~72 B/elem + ~80 fixed|
|B rcurry|9,808|96,208|960,209|~96 B/elem + ~80 fixed|
|F shared closure + rcurry|9,764|96,176|960,153|matches B (60 B/call less)|
h3. Indy on vs off (size=1000)
||Variant||indy=on ops/µs||indy=off ops/µs||Δ||indy=on B/op||indy=off B/op||
|A capture|0.062|0.057|−8%|72,080|72,080|
|B rcurry|0.025|0.024|−4%|96,208|96,580|
|C {{anyWith}}+lambda|1.001|1.001|0%|0.003|0.003|
|D {{anyWith}}+methodRef|0.998|0.999|0%|0.003|0.003|
|E baseline|0.768|0.905|+18%|0.003|0.003|
|F shared+rcurry|0.025|0.025|0%|96,176|96,472|
|G curryWith|0.927|0.894|−4%|128|120|
h2. Findings
# *The fat-free claim holds in Groovy, with margin.* C and D allocate
effectively zero bytes regardless of size. At size=10000, C/D move 0.03 B/op vs
720 KB/op for A and 960 KB/op for B — *5 orders of magnitude apart*. The
optimisation is real and usable today via {{@CompileStatic}} + functional
interface targets; the GDK just doesn't currently offer the API to reach it.
# *The Groovy idiom "curry instead of capture" is empirically backwards.*
Across every size and both indy modes, B ({{rcurry}}) is *~2.5× slower* than A
(capturing closure) and allocates *~33% more bytes*. The cause is
{{CurriedClosure.call}} concatenating {{curriedArguments}} with the incoming
args into a fresh {{Object[]}} on every element. At size=10000 that's 10,000
small arrays per call.
# *Hoisting the base closure to {{static final}} is essentially worthless.* F
matches B to within ~60 bytes per call across all configurations. The per-call
closure-literal allocation is noise; the per-_element_ {{CurriedClosure}} cost
is everything. Advising users to "extract closure literals to constants"
doesn't solve the rcurry problem.
# *{{curryWith}} is the right shape for the adaptation gap.* G allocates a flat
*128 B per call* regardless of workload size. At size=10000 that's *~7,500×
less allocation than {{rcurry}}* for the same logical operation.
Throughput-wise G is within 10% of the hand-written {{for}} loop. This covers
the case where users _can't_ avoid producing a unary {{Predicate}}/{{Function}}
(handing off to {{Stream.filter}}, {{Collection.removeIf}}, etc.).
# *Indy mode is irrelevant to this story.* Allocations are bit-identical
between {{-Pindy=true}} and {{-Pindy=false}}; throughputs move ≤8% for the
closure variants and 0% for the lambda variants. *The proposal helps indy and
non-indy users equally.*
# *C and D systematically beat E by 21-33%* across every size, {{indy=true}}.
Reproducible across both runs. Most likely the {{BiPredicate.test →
static-method-handle → s.startsWith(p)}} chain inlines into a slightly tighter
loop body than the direct {{for (String s : data) s.startsWith(prefix)}}. Not
load-bearing for the proposal, but worth noting as the strongest possible
refutation of "Groovy lambdas can't match Java loops".
h2. Proposed actions (_proposed pending team review_)
h3. Primary: add {{*With}} overloads to {{DefaultGroovyMethods}}
Additive — no source or binary breakage. ~8–12 methods cover the high-traffic
surface:
{code:java}
public static <T,P> List<T> findAll(Iterable<T> self, BiPredicate<? super
T, ? super P> test, P p);
public static <T,P> boolean any (Iterable<T> self, BiPredicate<? super T,
? super P> test, P p);
public static <T,P> boolean every (Iterable<T> self, BiPredicate<? super T,
? super P> test, P p);
public static <T,P> Number count (Iterable<T> self, BiPredicate<? super T,
? super P> test, P p);
public static <T,P,R> List<R> collect(Iterable<T> self, BiFunction<? super T,
? super P, ? extends R> f, P p);
public static <T,P> void each(Iterable<T> self, BiConsumer<? super T, ?
super P> c, P p);
public static <T,P> T find(Iterable<T> self, BiPredicate<? super T, ?
super P> test, P p);
public static <T,P,K> Map<K,List<T>> groupBy(Iterable<T> self, BiFunction<?
super T, ? super P, K> classifier, P p);
// + inject/reduce with seed; *With variants on Map<K,V>
{code}
Open naming question: {{With}}-suffixed overloads (mirrors Eclipse Collections)
vs. unsuffixed overloads disambiguated by argument types. The latter is the
more usual Groovy approach; the former is more obviously discoverable.
_Proposed pending team review._
h3. Primary: add a {{curryWith}} helper
{code:groovy}
public static <T,P> Predicate<T> curryWith(BiPredicate<? super T, ? super P>
bi, P p);
public static <T,P,R> Function<T,R> curryWith(BiFunction<? super T, ? super P,
? extends R> f, P p);
public static <T,P> Consumer<T> curryWith(BiConsumer<? super T, ? super P> c,
P p);
{code}
Right-curry semantics (the value lands in the second/last parameter slot,
matching what {{*With}} consumers want). Should ship alongside {{*With}} — the
{{rcurry}} → {{*With}} → {{curryWith}} ratio (~33,000× → 1× → ~7,500×) is the
strongest argument for shipping the helper in the same release, so users have a
fat-free path for every shape.
h3. Bonus ticket A: {{CurriedClosure}} per-element allocation
{{CurriedClosure.call}} concatenates {{curriedArguments}} with the incoming
args into a fresh {{Object[]}} on every invocation
({{CurriedClosure.java:81-99}}). The dominant cost in every closure+curry path
in this benchmark. A no-allocation variant for the common case
({{curriedArguments.length == 1 && args.length == 1}}) would save ~24 B/element
on every rcurry — a lift for everyone already using rcurry in production code,
without anyone needing to change source.
h3. Bonus ticket B: {{LambdaWriter}} {{Reference}} wrapping for
effectively-final method parameters
LambdaWriter wraps effectively-final method parameters in
{{groovy.lang.Reference}} — visible in G's 128 B/call constant (synthetic
lambda constructor signature: {{(Object, Object, Reference, Reference)V}}).
javac captures such parameters by value into the synthetic lambda's final
fields with no wrapping; Groovy could do the same, eliminating the per-capture
{{Reference}} allocations. Lower priority than bonus A because JIT
escape-analysis often elides these wrappers in hot paths, but they show up in
cold paths, megamorphic call sites, and interpreter-only execution.
h2. Caveats
* Benchmark uses 1000-element traversal with a cheap predicate
({{startsWith}}); closure overhead is maximally exposed. With heavier
per-element work the proportional gap narrows.
* Single dev machine; no CPU pinning. Throughputs are reproducible to ~10%,
allocations to <1 byte.
* JEP 450 compact object headers would further widen the C/D advantage but
require Java 25+; Groovy's minimum supported JDK is currently 17 (commit
{{e9b9cf1ec5}} / GROOVY-12027).
* The proposed {{*With}} family pays off most under {{@CompileStatic}} + lambda
syntax. Dynamic-Groovy callers using closures get no allocation win — but the
closure overloads remain, so no regression.
* *Related:* GROOVY-11905 (non-capturing lambda optimisation — landed)
h2. Appendix — raw bench output
Two full-fidelity passes saved at:
* Run 1 ({{indy=true}}): {{/tmp/fatfree-bench/run1-indy-on.\{log,txt\}}}
* Run 2 ({{indy=false}}): {{/tmp/fatfree-bench/run2-indy-off.\{log,txt\}}}
Bench source:
{{subprojects/performance/src/jmh/groovy/org/apache/groovy/bench/FatFreeLambda\{,Bench\}.\{groovy,java\}}}
To reproduce with GC profiler:
{noformat}
# Add `profilers = ['gc']` to the jmh { } block in
# build-logic/src/main/groovy/org.apache.groovy-performance.gradle
./gradlew :perf:jmh -PbenchInclude=FatFreeLambda # indy=true
./gradlew :perf:jmh -PbenchInclude=FatFreeLambda -Pindy=false # indy=false
{noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)