mbutrovich opened a new issue, #4280:
URL: https://github.com/apache/datafusion-comet/issues/4280
## What is the problem the feature request solves?
`CometPlainVector.getUTF8String(int)` and `CometPlainVector.getBinary(int)`
dereference the offset buffer address on every call:
```java
long offsetBufferAddress = varWidthVector.getOffsetBuffer().memoryAddress();
int offset = Platform.getInt(null, offsetBufferAddress + rowId * 4L);
```
That is one Arrow method call, one buffer dereference, and one
`memoryAddress` call per row, duplicating work `valueBufferAddress` already
avoids (the value buffer address is cached as a `final long` field in the
constructor).
For workloads that read VarChar or VarBinary columns heavily through
`CometPlainVector` (native scan, shuffle, and the JVM UDF dispatch path
introduced in #4267), the per-row overhead is pure bookkeeping.
## Describe the potential solution
Hoist the offset buffer address into a `final long offsetBufferAddress`
field set in the constructor alongside `valueBufferAddress`, and use it
directly in `getUTF8String` / `getBinary`:
```java
private final long offsetBufferAddress;
public CometPlainVector(ValueVector vector, boolean useDecimal128, boolean
isUuid, boolean isReused) {
// ...
if (vector instanceof BaseVariableWidthVector) {
this.offsetBufferAddress = ((BaseVariableWidthVector)
vector).getOffsetBuffer().memoryAddress();
} else {
this.offsetBufferAddress = -1;
}
}
public UTF8String getUTF8String(int rowId) {
if (isNullAt(rowId)) return null;
if (!isBaseFixedWidthVector) {
int offset = Platform.getInt(null, offsetBufferAddress + rowId * 4L);
int length = Platform.getInt(null, offsetBufferAddress + (rowId + 1L) *
4L) - offset;
return UTF8String.fromAddress(null, valueBufferAddress + offset, length);
}
// fixed-width branch unchanged
}
```
Same pattern for `getBinary`. JIT treats the address as a register-cached
constant across the hot loop.
## Additional context
- `valueBufferAddress` already uses this pattern; this proposal extends it
for symmetry and for the same reason it was used there.
- Arrow guarantees `getOffsetBuffer()` returns the same buffer object for
the lifetime of the vector, so caching the address at construction is safe.
- No API surface change. No behavior change. Strictly reduces per-row work
for variable-width reads.
- Benefits every current caller of `CometPlainVector`, not just one code
path.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]