scovich commented on issue #7715: URL: https://github.com/apache/arrow-rs/issues/7715#issuecomment-3059331649
> I can't quite figure out how it [Iceberg shredding kernel] works, but it looks like you provide a variant shredding function to the writer I'm not a huge fan of that approach, mostly because rust won't be able to benefit from JIT like JVM does. The callback could be wildly expensive to execute because it would likely be built up from a bunch of smaller callbacks that navigate the variant paths and process individual values. Unless the engine has a really beefy codegen capability, but I'm not aware of codegen and rust getting along super well -- would probably have to access it via FFI. Seems better to have the shredding kernel take a specification (whatever its public form takes) and convert it to an interpreter type state machine internally, in hopes of being reasonably efficient. Super error-prone but not under user control and so hopefully containable. But definitely complex. > What does `ShreddingSpecification` look like (we could look at the API in iceberg-java) to figure this out What if the the writer just passes a shredding schema? For example, if they want to shred `v:c::INT`, they could provide: ``` { -- top-level variant metadata: BINARY, value: BINARY, typed_value: { c: { -- nested variant value: BINARY, typed_value: INT, } } } ``` Values of `v:x` that cast to INT will end up in `typed_value.x.typed_value`, and those that do not end up in `typed_value.x.value` instead. Rows that for which `v:x` doesn't even exist end up in `value`. If the caller is confident that `v:x::INT` shreds _perfectly_, they could instead pass: { -- top-level variant metadata: BINARY, value: BINARY, typed_value: { x: INT, -- no fallback } } ``` ... with the caveat that the shredding spec forbids `v:x` to reside in the `value` column for a (partially) shredded struct. So if the shredding encounters a value of `v:x` that does not cast to INT, there's no fallback and shredding would fail. I suppose one could use a concept of "strict mode" and allow engines to disable the error (writing NULL instead), but that would be opting into data loss which always makes me nervous. We'd also need to decide whether it's legal to specify a shredding spec where some of the `value` columns are missing (the result is valid shredded variant, but it would be similar to the perfect shredding schema, in that there's no fallback for values that don't shred): ``` { -- top-level variant metadata: BINARY, typed_value: { -- no fallback x: { -- nested variant typed_value: INT, -- no fallback } } } ``` > Do we need an `unshred_variant` kernel Yes. If nothing else, we need a way for engines that don't support shredding to correctly consume shredded variant. Or who do support shredding to some degree, but don't want the high complexity of propagating shredded variant all through the query plan above the scan. Engines also need a way to convert a variant column to string for display, and just calling naive to-json on a shredded variant column doesn't produce the right output (all the `value` and `typed_value` columns mixed in). Tho I suppose one could make a _really_ fancy shredding-aware to-json implementation. Finally, if the reader requested e.g. `v:a.b.c` (without casting to any specific type) the result is VARIANT. But if `v:a.b.c` was perfectly shredded in a given file, there is no variant value to give back. And if `v:a.b.c` is itself a struct with fields and sub-fields of its own, then converting that back to a variant value the query expects requires quite some work. And it has to be done on a per-file basis, because other files may not have shredded perfectly, or could have shredded differently. > What should happen if the input `variant_array` already has some shredded columns That gets very complex very fast, and provides more uses for that `unshred_variant` kernel, if the requested read shredding doesn't match the specific write shredding in a given file. The simplest approach is to just unshred the whole thing and re-shred the result. But even more sophisticated approaches would ultimately need the ability to unshred certain paths (see above). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org