iemejia opened a new issue, #3495:
URL: https://github.com/apache/parquet-java/issues/3495

   ### Describe the enhancement requested
   
   `PlainValuesWriter` (used for PLAIN-encoded INT32, INT64, FLOAT, DOUBLE, and 
BINARY
   columns) currently writes each value through two layers of abstraction:
   
   ```
   PlainValuesWriter -> LittleEndianDataOutputStream -> 
CapacityByteArrayOutputStream
   ```
   
   Per `writeInt()`, `LittleEndianDataOutputStream` decomposes the int into 4 
bytes
   in a temporary `writeBuffer[8]` array and calls `out.write(writeBuffer, 0, 
4)`,
   which dispatches through the `OutputStream` chain into 
`CapacityByteArrayOutputStream`.
   That path performs:
   
   - 4 byte-shift operations for little-endian decomposition
   - 1 intermediate `writeBuffer[8]` array write
   - 2 levels of virtual dispatch
   - 1 bounds check in `write(byte[], off, len)`
   - 1 `System.arraycopy` for 4 bytes
   
   Since `CapacityByteArrayOutputStream` already buffers into `ByteBuffer` slabs
   internally, the entire chain can be collapsed into a single 
`ByteBuffer.putInt()`
   call, which is a HotSpot intrinsic that compiles to a single unaligned store 
on
   x86/ARM when the buffer is in `LITTLE_ENDIAN` order.
   
   ### Proposal
   
   1. In `CapacityByteArrayOutputStream`:
      - Set `ByteOrder.LITTLE_ENDIAN` on newly allocated slabs in `addSlab()`.
      - Add `writeInt(int)` and `writeLong(long)` methods that call
        `currentSlab.putInt(v)` / `currentSlab.putLong(v)` directly, with a 
single
        remaining-check that grows the slab if needed.
   
   2. In `PlainValuesWriter`:
      - Remove the `LittleEndianDataOutputStream` field entirely.
      - `writeInteger(v)` -> `arrayOut.writeInt(v)`
      - `writeLong(v)` -> `arrayOut.writeLong(v)`
      - `writeFloat(v)` -> `arrayOut.writeInt(Float.floatToIntBits(v))`
      - `writeDouble(v)` -> `arrayOut.writeLong(Double.doubleToLongBits(v))`
      - `writeBytes(Binary v)` -> `arrayOut.writeInt(v.length()); 
v.writeTo(arrayOut);`
      - `getBytes()` no longer needs to flush a buffering layer.
      - `close()` no longer closes the defunct stream.
   
   What was eliminated per `writeInt` call:
   
   - 4 byte-shift operations for little-endian decomposition
   - 1 intermediate `writeBuffer[8]` array write
   - 2 levels of virtual dispatch
   - 1 bounds check in `write(byte[], off, len)`
   - 1 `System.arraycopy` for 4 bytes
   
   Replaced with:
   
   - 1 remaining-check on the slab `ByteBuffer`
   - 1 `ByteBuffer.putInt()` call (single JVM intrinsic, ~1 store instruction on
     little-endian architectures)
   
   ### Benchmark results
   
   `IntEncodingBenchmark.encodePlain` (100,000 INT32 values per invocation, JMH
   `-wi 3 -i 5 -f 1`):
   
   | Pattern          | Before (ops/s) | After (ops/s) | Improvement |
   |------------------|---------------:|--------------:|------------:|
   | SEQUENTIAL       |     26,817,451 |    52,953,193 | **+97.5% (2.0x)** |
   | RANDOM           |     28,517,312 |    37,774,036 | **+32.5%** |
   | LOW_CARDINALITY  |     28,705,158 |    52,819,678 | **+84.0%** |
   | HIGH_CARDINALITY |     28,595,519 |    37,862,571 | **+32.4%** |
   
   The improvement varies by pattern: SEQUENTIAL and LOW_CARDINALITY see ~2x 
because
   the slab `putInt()` path has highly predictable branching (slab rarely runs 
out
   for sequential writes). RANDOM and HIGH_CARDINALITY still see a solid +32%
   improvement.
   
   The same code path also benefits `writeLong()`, `writeFloat()`, 
`writeDouble()`,
   and the length prefix written by `writeBytes(Binary)`.
   
   Decode round-trip verified: re-reading the encoded data with 
`PlainValuesReader`
   produces identical values at ~1.15B ops/s.
   
   ### Validation
   
   All 573 `parquet-column` tests and 308 `parquet-common` tests pass with the
   change applied.
   
   ### Component(s)
   
   Core


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to