scovich commented on PR #10157:
URL: https://github.com/apache/arrow-rs/pull/10157#issuecomment-4833641167

   Ah, after re-reading I realize there's a fifth level of aggressiveness (0/ - 
no type changes whatsoever), and _that's_ the proposed new behavior of `as_xxx` 
methods? In that case it does make sense to eliminate `as_uxx` methods because 
as you say those don't exist in the underlying variant and so would always 
return None. They only make sense if we choose 1/ or 2/.
   
   One reason for choosing 1/ or 2/ is that it increases the effectiveness of 
shredding. For example, when shredding as i32 it's annoying for small i64 
values to not shred when they fit in i32, and even more annoying for i8 and i16 
values to not shred (they _always_ fit in i32). Especially when you observe 
that variant is really just JSON, and JSON just has a generic numeric type. 
It's completely implementation-defined which of the "exact numeric" equivalence 
class types the JSON literal `42` parses to (`int8`, `int16`, `int32`, `int64`, 
`decimal4`, `decimal8`, or `decimal16`). And the [variant 
spec](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types)
 state that:
    
   > The Equivalence Class column [of the _Variant primitive types_ table] 
indicates logical equivalence of physically encoded types. For example, a user 
expression operating on a string value containing "hello" should behave the 
same, whether it is encoded with the short string optimization, or long string 
encoding. Similarly, user expressions operating on an int8 value of 1 should 
behave the same as a decimal16 with scale 2 and unscaled value 100.
   
   Furthermore, the [JSON 
spec](https://datatracker.ietf.org/doc/html/rfc7159#section-6) encourages 
implementations to treat numeric types as IEEE 754 double precision values, so 
we have to consider `double` as well:
   
   > This specification allows implementations to set limits on the range and 
precision of numbers accepted.  Since software that implements IEEE 754-2008 
binary64 (double precision) numbers is generally available and widely used, 
good interoperability can be achieved by implementations that expect no more 
precision or range than these provide, in the sense that implementations will 
approximate JSON numbers within the expected precision. 
   > ...
   > Note that when such software is used, numbers that are integers and are in 
the range `[-(2**53)+1, (2**53)-1]` are interoperable in the sense that 
implementations will agree exactly on their numeric values.
   
   (I'm not sure why/when we'd ever see a `float` value, but we may as well 
handle it if we anyway have to handle `double`)
   
   All in all, the above suggests that `as_xxx` methods should take approach 2/ 
in order to maximize conformance with both variant and JSON specs while also 
minimizing surprise to users. But there's plenty of room for varying opinions, 
e.g.:
   * What, exactly, counts as a "user expression" in the variant spec
   * Whether we _really_ care about the JSON spec's suggested 53 bits of 
precision, when variant specifically includes a 256-bit decimal type


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to