Re: [PR] [Variant] Fix the variant shred logic [arrow-rs]

via GitHub Mon, 29 Jun 2026 07:25:10 -0700


scovich commented on PR #10157:
URL: https://github.com/apache/arrow-rs/pull/10157#issuecomment-4833641167

Ah, after re-reading I realize there's a fifth level of aggressiveness (0/ -
no type changes whatsoever), and _that's_ the proposed new behavior of `as_xxx`
methods? In that case it does make sense to eliminate `as_uxx` methods because
as you say those don't exist in the underlying variant and so would always
return None. They only make sense if we choose 1/ or 2/.

One reason for choosing 1/ or 2/ is that it increases the effectiveness of
shredding. For example, when shredding as i32 it's annoying for small i64
values to not shred when they fit in i32, and even more annoying for i8 and i16
values to not shred (they _always_ fit in i32). Especially when you observe
that variant is really just JSON, and JSON just has a generic numeric type.
It's completely implementation-defined which of the "exact numeric" equivalence
class types the JSON literal `42` parses to (`int8`, `int16`, `int32`, `int64`,
`decimal4`, `decimal8`, or `decimal16`). And the [variant
spec](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types)
state that:

> The Equivalence Class column [of the _Variant primitive types_ table]
indicates logical equivalence of physically encoded types. For example, a user
expression operating on a string value containing "hello" should behave the
same, whether it is encoded with the short string optimization, or long string
encoding. Similarly, user expressions operating on an int8 value of 1 should
behave the same as a decimal16 with scale 2 and unscaled value 100.

Furthermore, the [JSON
spec](https://datatracker.ietf.org/doc/html/rfc7159#section-6) encourages
implementations to treat numeric types as IEEE 754 double precision values, so
we have to consider `double` as well:

> This specification allows implementations to set limits on the range and
precision of numbers accepted. Since software that implements IEEE 754-2008
binary64 (double precision) numbers is generally available and widely used,
good interoperability can be achieved by implementations that expect no more
precision or range than these provide, in the sense that implementations will
approximate JSON numbers within the expected precision.
> ...
> Note that when such software is used, numbers that are integers and are in
the range `[-(2**53)+1, (2**53)-1]` are interoperable in the sense that
implementations will agree exactly on their numeric values.

(I'm not sure why/when we'd ever see a `float` value, but we may as well
handle it if we anyway have to handle `double`)

All in all, the above suggests that `as_xxx` methods should take approach 2/
in order to maximize conformance with both variant and JSON specs while also
minimizing surprise to users. But there's plenty of room for varying opinions,
e.g.:
* What, exactly, counts as a "user expression" in the variant spec
* Whether we _really_ care about the JSON spec's suggested 53 bits of
precision, when variant specifically includes a 256-bit decimal type

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Variant] Fix the variant shred logic [arrow-rs]

Reply via email to