cloud-fan opened a new pull request, #56104:
URL: https://github.com/apache/spark/pull/56104
### What changes were proposed in this pull request?
Introduce a generic `BinaryView` physical-value class that holds a
non-owning pointer to a contiguous chunk of bytes living either on-heap or
off-heap, and migrate the GEOMETRY and GEOGRAPHY types onto it.
`BinaryView` is modelled on `UTF8String`'s `(Object base, long offset, int
numBytes)` shape so its accessors plug directly into existing
`Platform.copyMemory` / `Platform.get*` call sites. Lifetime discipline matches
`UTF8String`: callers that need to retain a value past the source buffer's
lifetime must call `copy()`.
Spark already separates logical type from physical type. GEOMETRY and
GEOGRAPHY are different logical types but share the same physical layout: an
opaque chunk of bytes. So they fold into a single physical type:
- Delete `PhysicalGeometryType` and `PhysicalGeographyType`. Add
`PhysicalBinaryViewType` whose `InternalType` is `BinaryView`. Both
`GeometryType` and `GeographyType` map to it.
- Delete `GeometryVal` and `GeographyVal` (they were `byte[]` marker
wrappers).
- `SpecializedGetters.getGeometry` / `getGeography` collapse into one
`getBinaryView(int)`, mirroring how `getUTF8String` is the single accessor
regardless of `StringType` / `CharType` / `VarcharType`.
- `UnsafeWriter.write(int, GeometryVal/GeographyVal)` collapses into
`write(int, BinaryView)`.
- `STUtils` overloads that previously dispatched on `GeometryVal` vs
`GeographyVal` are renamed to explicit `stGeom*` / `stGeog*` pairs; the ST
expressions pick the right variant from the input's logical `DataType` at
runtime-replacement time.
The on-disk `UnsafeRow` layout is unchanged.
Zero-copy reads (the actual perf win):
- `UnsafeRow.getBinaryView` and `UnsafeArrayData.getBinaryView` now do
`BinaryView.fromAddress(baseObject, baseOffset + offset, size)` instead of
`getBinary()` + `fromBytes()` — drops one `byte[]` allocation +
`Platform.copyMemory` per read, exactly mirroring `getUTF8String`.
- `UnsafeWriter.write(int, BinaryView)` writes via `(getBaseObject,
getBaseOffset, numBytes)` instead of `getBytes()`.
- The three `copy()` methods that materialize a `GenericInternalRow` from a
columnar source (`ColumnarRow`, `ColumnarBatchRow`, `MutableColumnarRow`) now
call `.copy()` on the `BinaryView` defensively, matching the existing
`getUTF8String(i).copy()` discipline on the line right above.
### Why are the changes needed?
1. Today reading GEOMETRY / GEOGRAPHY out of an `UnsafeRow` or
`UnsafeArrayData` allocates a fresh `byte[]` and `Platform.copyMemory`s the
bytes into it, even though `UTF8String` shows the zero-copy view pattern is
already established in the same code paths.
2. `GeometryVal` and `GeographyVal` were marker classes whose only content
was forwarding to `byte[]`. Once `BinaryView` exists, keeping them as separate
types — and keeping separate `PhysicalGeometryType` / `PhysicalGeographyType`
that differ only in the wrapper class — is circular: the physical layer
distinguishes them solely because the Java type system distinguished them.
3. Same shape as the rest of Spark: `StringType` / `CharType` /
`VarcharType` → `PhysicalStringType` → `UTF8String`. Geo had been the odd one
out.
### Does this PR introduce _any_ user-facing change?
No. GEOMETRY and GEOGRAPHY are still unreleased, and the user-facing
`org.apache.spark.sql.types.Geometry` / `org.apache.spark.sql.types.Geography`
APIs are unchanged. Only physical-layer internals move.
### How was this patch tested?
- New `BinaryViewSuite` (13 tests) covering on-heap, off-heap (via
`MemoryAllocator.UNSAFE`), slice, copy independence, primitive readers,
unsigned lexicographic `compareTo`, `equals` / `hashCode` across heap and
off-heap, `ByteBuffer` round-trip on both paths, `writeToMemory`, and Java +
Kryo serialization round-trips.
- `unsafe/checkstyle` clean. ASCII-only spot-check on new files clean.
- Existing `GeometryValSuite` and `GeographyValSuite` were deleted (the
classes are gone). Their assertions (other than the deliberately-throwing
`compareTo`) are subsumed by `BinaryViewSuite` and the existing
`GeometryExecutionSuite` / `GeographyExecutionSuite` / `STUtilsSuite` /
`CatalystTypeConvertersSuite` / `LiteralExpressionSuite` /
`UnsafeRowWriterSuite` / `ArrowWriterSuite` / `GenerateUnsafeProjectionSuite` /
`ParquetDelta{Byte,Length}ArrayEncodingSuite` / `Geo{metry,graphy}TypeSuite`
which have all been updated for the new API.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude code (Opus 4.7)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]