jhrotko commented on PR #949: URL: https://github.com/apache/arrow-java/pull/949#issuecomment-3730857009
The results of the benchmarking comparing the two versions (1) buffer indexing from [e35ed7c](https://github.com/apache/arrow-java/pull/949/commits/e35ed7c3585e8cf681c4e11aa9d91a7c7b041e55) and (2) two MSB and LSB longs UUID holder implementations across 5 scales (1k, 10k, 100k, 1M, 10M elements) for 7 different operations. There were no major differences with exception of the test `getWithUuidHolder` where there was a 65% slow down with version 2. I can provide full data for all methods. | Scale | Old (µs/op) | New (µs/op) | Diff (µs) | % Change | New/Old Ratio | |-------|-------------|-------------|-----------|----------|---------------| | **1k** | 1.000 | 1.666 | +0.666 | +66.6% | 1.67x slower | | **10k** | 9.954 | 16.462 | +6.508 | +65.4% | 1.65x slower | | **100k** | 102.048 | 162.946 | +60.898 | +59.7% | 1.60x slower | | **1M** | 1043.961 | 1669.155 | +625.194 | +59.9% | 1.60x slower | | **10M** | 10121.154 | 16718.845 | +6597.691 | +65.2% | 1.65x slower | The `getWithUuidHolder` regression is due to different implementation approaches: **Old (ArrowBuf reference)**: ```java holder.buffer = getDataBuffer(); // Copy pointer (8 bytes) holder.start = getStartOffset(index); // Copy offset (4 bytes) // Total: 12 bytes copied, NO data reading (deferred work) ``` **New (MSB/LSB longs)**: ```java holder.mostSigBits = Long.reverseBytes(dataBuffer.getLong(start)); // Read + reverse holder.leastSigBits = Long.reverseBytes(dataBuffer.getLong(start + 8)); // Read + reverse // Total: 16 bytes READ + 2 byte reversals (immediate work) ``` The old version defers work (just stores references), while the new version does work upfront (reads and byte-reverses data -> O(N)). --- Given the performance difference I would prefer to implement version 1 with buffer indexing in order to maintain execution performance. Other holder types already take this advantage, namely Decimal, Varchar and Varbinary. Given that UUID takes 16 bytes it could also take advantage of the zero-copy data access design provided by these types -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
