scovich commented on PR #10157: URL: https://github.com/apache/arrow-rs/pull/10157#issuecomment-4833641167
Ah, after re-reading I realize there's a fifth level of aggressiveness (0/ - no type changes whatsoever), and _that's_ the proposed new behavior of `as_xxx` methods? In that case it does make sense to eliminate `as_uxx` methods because as you say those don't exist in the underlying variant and so would always return None. They only make sense if we choose 1/ or 2/. One reason for choosing 1/ or 2/ is that it increases the effectiveness of shredding. For example, when shredding as i32 it's annoying for small i64 values to not shred when they fit in i32, and even more annoying for i8 and i16 values to not shred (they _always_ fit in i32). Especially when you observe that variant is really just JSON, and JSON just has a generic numeric type. It's completely implementation-defined which of the "exact numeric" equivalence class types the JSON literal `42` parses to (`int8`, `int16`, `int32`, `int64`, `decimal4`, `decimal8`, or `decimal16`). And the [variant spec](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types) state that: > The Equivalence Class column [of the _Variant primitive types_ table] indicates logical equivalence of physically encoded types. For example, a user expression operating on a string value containing "hello" should behave the same, whether it is encoded with the short string optimization, or long string encoding. Similarly, user expressions operating on an int8 value of 1 should behave the same as a decimal16 with scale 2 and unscaled value 100. Furthermore, the [JSON spec](https://datatracker.ietf.org/doc/html/rfc7159#section-6) encourages implementations to treat numeric types as IEEE 754 double precision values, so we have to consider `double` as well: > This specification allows implementations to set limits on the range and precision of numbers accepted. Since software that implements IEEE 754-2008 binary64 (double precision) numbers is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide, in the sense that implementations will approximate JSON numbers within the expected precision. > ... > Note that when such software is used, numbers that are integers and are in the range `[-(2**53)+1, (2**53)-1]` are interoperable in the sense that implementations will agree exactly on their numeric values. (I'm not sure why/when we'd ever see a `float` value, but we may as well handle it if we anyway have to handle `double`) All in all, the above suggests that `as_xxx` methods should take approach 2/ in order to maximize conformance with both variant and JSON specs while also minimizing surprise to users. But there's plenty of room for varying opinions, e.g.: * What, exactly, counts as a "user expression" in the variant spec * Whether we _really_ care about the JSON spec's suggested 53 bits of precision, when variant specifically includes a 256-bit decimal type -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
