scovich commented on issue #7715:
URL: https://github.com/apache/arrow-rs/issues/7715#issuecomment-3059331649
> I can't quite figure out how it [Iceberg shredding kernel] works, but it
looks like you provide a variant shredding function to the writer
I'm not a huge fan of that approach, mostly because rust won't be able to
benefit from JIT like JVM does. The callback could be wildly expensive to
execute because it would likely be built up from a bunch of smaller callbacks
that navigate the variant paths and process individual values. Unless the
engine has a really beefy codegen capability, but I'm not aware of codegen and
rust getting along super well -- would probably have to access it via FFI.
Seems better to have the shredding kernel take a specification (whatever its
public form takes) and convert it to an interpreter type state machine
internally, in hopes of being reasonably efficient. Super error-prone but not
under user control and so hopefully containable. But definitely complex.
> What does `ShreddingSpecification` look like (we could look at the API in
iceberg-java) to figure this out
What if the the writer just passes a shredding schema? For example, if they
want to shred `v:c::INT`, they could provide:
```
{ -- top-level variant
metadata: BINARY,
value: BINARY,
typed_value: {
c: { -- nested variant
value: BINARY,
typed_value: INT,
}
}
}
```
Values of `v:x` that cast to INT will end up in `typed_value.x.typed_value`,
and those that do not end up in `typed_value.x.value` instead. Rows that for
which `v:x` doesn't even exist end up in `value`.
If the caller is confident that `v:x::INT` shreds _perfectly_, they could
instead pass:
{ -- top-level variant
metadata: BINARY,
value: BINARY,
typed_value: {
x: INT, -- no fallback
}
}
```
... with the caveat that the shredding spec forbids `v:x` to reside in the
`value` column for a (partially) shredded struct. So if the shredding
encounters a value of `v:x` that does not cast to INT, there's no fallback and
shredding would fail. I suppose one could use a concept of "strict mode" and
allow engines to disable the error (writing NULL instead), but that would be
opting into data loss which always makes me nervous.
We'd also need to decide whether it's legal to specify a shredding spec
where some of the `value` columns are missing (the result is valid shredded
variant, but it would be similar to the perfect shredding schema, in that
there's no fallback for values that don't shred):
```
{ -- top-level variant
metadata: BINARY,
typed_value: { -- no fallback
x: { -- nested variant
typed_value: INT, -- no fallback
}
}
}
```
> Do we need an `unshred_variant` kernel
Yes.
If nothing else, we need a way for engines that don't support shredding to
correctly consume shredded variant. Or who do support shredding to some degree,
but don't want the high complexity of propagating shredded variant all through
the query plan above the scan.
Engines also need a way to convert a variant column to string for display,
and just calling naive to-json on a shredded variant column doesn't produce the
right output (all the `value` and `typed_value` columns mixed in). Tho I
suppose one could make a _really_ fancy shredding-aware to-json implementation.
Finally, if the reader requested e.g. `v:a.b.c` (without casting to any
specific type) the result is VARIANT. But if `v:a.b.c` was perfectly shredded
in a given file, there is no variant value to give back. And if `v:a.b.c` is
itself a struct with fields and sub-fields of its own, then converting that
back to a variant value the query expects requires quite some work. And it has
to be done on a per-file basis, because other files may not have shredded
perfectly, or could have shredded differently.
> What should happen if the input `variant_array` already has some shredded
columns
That gets very complex very fast, and provides more uses for that
`unshred_variant` kernel, if the requested read shredding doesn't match the
specific write shredding in a given file.
The simplest approach is to just unshred the whole thing and re-shred the
result. But even more sophisticated approaches would ultimately need the
ability to unshred certain paths (see above).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]