costin opened a new pull request, #16129:
URL: https://github.com/apache/lucene/pull/16129

   `NumericDocValues.longValues()` was added in #15149 to amortize virtual call 
overhead when reading values in bulk. The current codec implementation 
(`Lucene90DocValuesProducer`) delegates to a per-doc loop via 
`advanceExact()`/`longValue()`, each going through `DirectReader.get()` with 
bit-unpacking per value.
   
   This PR adds a contiguous-range bulk decode path: when the requested doc IDs 
form a contiguous range (`docs[size-1] - docs[0] == size - 1`), values are read 
as a single byte array and decoded in a tight loop. The contiguous case is the 
common path for aggregation collectors processing full segment scans via 
`DocIdStream.intoArray()` + `longValues()`.
   
   The scalar decode loop auto-vectorizes well — C2 generates `vmovdqu` + 
`vpmovsxdq` + `vpandd` + `vmovdqu32` on AVX-512 and similar on AVX2 (see 
assembly below). A Panama Vector API implementation is included for comparison 
but shows no consistent improvement over the auto-vectorized scalar path and 
should likely be removed before merge.
   
   Non-contiguous (strided/scattered) doc IDs fall back to the existing per-doc 
path. Scattered access requires gather operations which cannot be converted to 
sequential I/O.
   
   ### Benchmarks
   
   **NumericDocValuesBulkDecodeBenchmark** (1M docs, batch=1024, `longValues()` 
throughput)
   
   ```
   JDK 25.0.3+9-LTS (Temurin), JMH 1.37
   JVM args: --add-modules jdk.incubator.vector -Xmx2g -Xms2g 
-XX:+AlwaysPreTouch
   Warmup: 2 x 2s, Measurement: 3 x 2s, Fork: 1
   Mode: Throughput (ops/s, higher is better)
   ```
   
   Run with:
   ```
   java --add-modules jdk.incubator.vector -Xmx2g -Xms2g -XX:+AlwaysPreTouch \
     --module-path lucene/benchmark-jmh/build/benchmarks \
     --module org.apache.lucene.benchmark.jmh 
NumericDocValuesBulkDecodeBenchmark \
     -f 1 -wi 2 -i 3 -p batchSize=1024
   ```
   
   **AMD EPYC 7R32 (c5a.xlarge, AVX2)**
   
   | BPV | Baseline (ops/s) | Bulk decode (ops/s) | Change |
   |-----|-----------------|---------------------|--------|
   | 8   | 702,323 | 745,447 | +6% |
   | 16  | 686,010 | 939,773 | +37% |
   | 24  | 603,410 | 747,713 | +24% |
   | 32  | 742,785 | 1,148,963 | **+55%** |
   | 40  | 640,131 | 667,380 | +4% |
   | 48  | 563,024 | 725,941 | +29% |
   | 56  | 555,943 | 744,923 | +34% |
   | 64  | 624,181 | 817,362 | +31% |
   
   **Intel Xeon 8375C (c6i.xlarge, AVX-512)**
   
   | BPV | Baseline (ops/s) | Bulk decode (ops/s) | Change |
   |-----|-----------------|---------------------|--------|
   | 8   | 884,786 | 1,047,888 | +18% |
   | 16  | 875,663 | 900,786 | +3% |
   | 24  | 746,087 | 874,352 | +17% |
   | 32  | 850,549 | 1,297,329 | **+53%** |
   | 40  | 776,460 | 792,153 | +2% |
   | 48  | 684,416 | 854,355 | +25% |
   | 56  | 684,192 | 839,406 | +23% |
   | 64  | 686,372 | 842,594 | +23% |
   
   Baseline = strided access (per-doc `DirectReader.get()`), equivalent to the 
unpatched code path. Bulk decode = contiguous access triggering the new bulk 
read path. Both use the default (scalar) provider.
   
   ### Scalar vs Panama SIMD
   
   The PR includes a Panama Vector API implementation 
(`PanamaDocValuesBulkDecodeSupport`) following the pattern from #16050. 
However, benchmarks show the scalar path auto-vectorizes effectively and Panama 
adds no consistent benefit:
   
   | BPV | Scalar bulk (ops/s) | Panama SIMD (ops/s) | Delta |
   |-----|---------------------|---------------------|-------|
   | 32  | 1,297,329 | 1,303,320 | +0.5% |
   | 64  | 842,594 | 841,012 | -0.2% |
   | 24  | 874,352 | 1,000,915 | +14% |
   
   (Intel Xeon 8375C, AVX-512. Non-power-of-two widths like BPV=24 show some 
Panama benefit due to the odd-stride read pattern being harder for the JIT to 
auto-vectorize.)
   
   Unless reviewers see value in keeping the Panama path, I plan to remove it 
before merging to reduce complexity.
   
   ### C2 assembly (decode32, Intel Ice Lake, AVX-512)
   
   The hot loop of the scalar `decode32` method after C2 compilation:
   
   ```asm
   vmovdqu  0x10(%rsi,%rbp,1),%ymm2      ; load 8 ints (256-bit)
   vpmovsxdq %ymm2,%zmm2                  ; sign-extend 8 x int32 → 8 x int64 
(512-bit)
   vpandd   %zmm2,%zmm1,%zmm2             ; zero-extend mask
   vmovdqu32 %zmm2,0x10(%rcx,%rbx,8)     ; store 8 longs (512-bit)
   ; repeated 4x unrolled → 32 values per iteration
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to