scovich commented on issue #7715:
URL: https://github.com/apache/arrow-rs/issues/7715#issuecomment-3059224365

   
   > What should the "path" argument be? A String? A JSON path? Some structured 
thing (Vec)`?
   
   I tried a bit of prototyping yesterday, and ran into exactly the same 
question. I actually wonder if the low level library needs to do this at all? 
My local prototype had ended up with the following new method on `Variant`:
   ```rust
       pub fn get_object_field(&self, field_name: &str) -> Option<Self> {
           match self {
               Variant::Object(object) => object.get(field_name),
               _ => None,
           }
       }
   ```
   (plus the analagous `get_array_element` method)
   
   At least for the moment, I suspect those would be sufficient building blocks 
for engines to drill down however they like. 
   
   For example, those are the sorts of methods an engine would rely on, if it 
co-recurses into a shredding schema and variant. It wouldn't actually _have_ a 
path to supply in such cases. 
   
   That said, an array builder type implementation would likely find it a lot 
more convenient to convert the schema to a list of builders, with the leaf 
builders each having their own variant path? More string traversals that way 
(unfortunate), but probably better code regularity and cache locality that 
compensate at least partly for that.
   
   > Should we also provide a "requested data type" field? Similar to the data 
bricks function
   
   At first I was strongly against that idea (let the engine worry about the 
complexity of casting semantics). 
   
   Then I realized that an engine can control the casting very easily, by 
extracting a variant column and then processing the result however they see fit 
(basically, handle the "get" and "cast" steps separately).
   
   There's also a certain appeal to having the casting built right in (variant 
isn't as cheap to navigate as strongly typed values). Especially widening casts 
that should hopefully be uncontroversial, tho lossless narrowing casts could 
also make sense given the loose semantics of JSON.
   
   Problem is, every engine has their own specific ideas of what casting 
semantics are "correct" and this quickly becomes a slippery slope. Spark, for 
example, is _extremely_ permissive. Its cast function will happily truncate 
non-integers to ints, convert anything to string, and even try to string-parse 
values to the requested type (which gets murky for e.g. timestamps).
   
   Worse, a literal interpretation of the variant spec suggests that readers 
are _required_ to cast interchangeably between all types of the same 
"[equivalence 
class](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types)"
 -- e.g. one could argue that somebody who requests int8 _must_ receive back 
zeros for any decimal16(scale=38) values that might have lived there, because 
they're in the same equivalence class. That seems broken/wrong to me (severe 
information loss, float/double aren't in that equivalence class, etc), but the 
spec kind of says it ☹️ .
   
   Overall, I would favor a `variant_get` that does "safe" casts only 
(widening, or lossless narrowing), as a compromise. Many engines will hopefully 
like that sane default, and those that find it too strict or too loose can 
always request variant and cast the result however they like.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to