Re: [I] [Variant] Add low level support for shredding and unshredding [arrow-rs]

via GitHub Thu, 10 Jul 2025 14:54:55 -0700


scovich commented on issue #7715:
URL: https://github.com/apache/arrow-rs/issues/7715#issuecomment-3059224365

> What should the "path" argument be? A String? A JSON path? Some structured
thing (Vec)`?

I tried a bit of prototyping yesterday, and ran into exactly the same
question. I actually wonder if the low level library needs to do this at all?
My local prototype had ended up with the following new method on `Variant`:
```rust
pub fn get_object_field(&self, field_name: &str) -> Option<Self> {
match self {
Variant::Object(object) => object.get(field_name),
_ => None,
}
}
```
(plus the analagous `get_array_element` method)

At least for the moment, I suspect those would be sufficient building blocks
for engines to drill down however they like.

For example, those are the sorts of methods an engine would rely on, if it
co-recurses into a shredding schema and variant. It wouldn't actually _have_ a
path to supply in such cases.

That said, an array builder type implementation would likely find it a lot
more convenient to convert the schema to a list of builders, with the leaf
builders each having their own variant path? More string traversals that way
(unfortunate), but probably better code regularity and cache locality that
compensate at least partly for that.

> Should we also provide a "requested data type" field? Similar to the data
bricks function

At first I was strongly against that idea (let the engine worry about the
complexity of casting semantics).

Then I realized that an engine can control the casting very easily, by
extracting a variant column and then processing the result however they see fit
(basically, handle the "get" and "cast" steps separately).

There's also a certain appeal to having the casting built right in (variant
isn't as cheap to navigate as strongly typed values). Especially widening casts
that should hopefully be uncontroversial, tho lossless narrowing casts could
also make sense given the loose semantics of JSON.

Problem is, every engine has their own specific ideas of what casting
semantics are "correct" and this quickly becomes a slippery slope. Spark, for
example, is _extremely_ permissive. Its cast function will happily truncate
non-integers to ints, convert anything to string, and even try to string-parse
values to the requested type (which gets murky for e.g. timestamps).

Worse, a literal interpretation of the variant spec suggests that readers
are _required_ to cast interchangeably between all types of the same
"[equivalence
class](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types)"
-- e.g. one could argue that somebody who requests int8 _must_ receive back
zeros for any decimal16(scale=38) values that might have lived there, because
they're in the same equivalence class. That seems broken/wrong to me (severe
information loss, float/double aren't in that equivalence class, etc), but the
spec kind of says it ☹️ .

Overall, I would favor a `variant_get` that does "safe" casts only
(widening, or lossless narrowing), as a compromise. Many engines will hopefully
like that sane default, and those that find it too strict or too loose can
always request variant and cast the result however they like.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [Variant] Add low level support for shredding and unshredding [arrow-rs]

Reply via email to