scovich commented on issue #7715: URL: https://github.com/apache/arrow-rs/issues/7715#issuecomment-3059224365
> What should the "path" argument be? A String? A JSON path? Some structured thing (Vec)`? I tried a bit of prototyping yesterday, and ran into exactly the same question. I actually wonder if the low level library needs to do this at all? My local prototype had ended up with the following new method on `Variant`: ```rust pub fn get_object_field(&self, field_name: &str) -> Option<Self> { match self { Variant::Object(object) => object.get(field_name), _ => None, } } ``` (plus the analagous `get_array_element` method) At least for the moment, I suspect those would be sufficient building blocks for engines to drill down however they like. For example, those are the sorts of methods an engine would rely on, if it co-recurses into a shredding schema and variant. It wouldn't actually _have_ a path to supply in such cases. That said, an array builder type implementation would likely find it a lot more convenient to convert the schema to a list of builders, with the leaf builders each having their own variant path? More string traversals that way (unfortunate), but probably better code regularity and cache locality that compensate at least partly for that. > Should we also provide a "requested data type" field? Similar to the data bricks function At first I was strongly against that idea (let the engine worry about the complexity of casting semantics). Then I realized that an engine can control the casting very easily, by extracting a variant column and then processing the result however they see fit (basically, handle the "get" and "cast" steps separately). There's also a certain appeal to having the casting built right in (variant isn't as cheap to navigate as strongly typed values). Especially widening casts that should hopefully be uncontroversial, tho lossless narrowing casts could also make sense given the loose semantics of JSON. Problem is, every engine has their own specific ideas of what casting semantics are "correct" and this quickly becomes a slippery slope. Spark, for example, is _extremely_ permissive. Its cast function will happily truncate non-integers to ints, convert anything to string, and even try to string-parse values to the requested type (which gets murky for e.g. timestamps). Worse, a literal interpretation of the variant spec suggests that readers are _required_ to cast interchangeably between all types of the same "[equivalence class](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types)" -- e.g. one could argue that somebody who requests int8 _must_ receive back zeros for any decimal16(scale=38) values that might have lived there, because they're in the same equivalence class. That seems broken/wrong to me (severe information loss, float/double aren't in that equivalence class, etc), but the spec kind of says it ☹️ . Overall, I would favor a `variant_get` that does "safe" casts only (widening, or lossless narrowing), as a compromise. Many engines will hopefully like that sane default, and those that find it too strict or too loose can always request variant and cast the result however they like. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org