scovich commented on PR #8166: URL: https://github.com/apache/arrow-rs/pull/8166#issuecomment-3260318512
Attn @alamb @carpecodeum @liamzwbao : One thing I noticed today -- * while making an early pass over https://github.com/apache/arrow-rs/pull/8241 * and re-reviewing https://github.com/apache/arrow-rs/pull/8235 I noticed that `cast_to_variant` follows the same problematic column-oriented approach as the output builder this PR replaces with a row builder. In particular, it converts entire columns to variant, which means repeatedly re-encoding deeply nested variant values, transferring their bytes to new arrays, and recreating incomplete metadata columns that immediately get thrown away. I spent some time today, trying to refactor it to be row-oriented (while keeping a similar code structure), but late in the process I realized that several array types break it because they need to do some holistic column-level transformation, which cannot be captured efficiently by a row-wise approach: * maps - cast the key column to string * dictionaries - convert the dictionary value column and then clone values from it as needed * run-end encoding - similar to dictionary, but _also_ need to convert the incoming logical index to physical run index, which is a linear search in row-oriented code (= quadratic cost!) I'm pretty sure the actual solution will be to merge the new variant row builder infrastructure in this PR, and then rework `cast_to_variant` to use it. That way, the row builder's constructor can do any column-level transformations that might be needed, before row-oriented visiting begins. In other words, this PR is probably the most important (= biggest bottleneck) variant PR currently open. NOTE: Even tho `variant_get` and `cast_to_variant` would both use the new row builder infra, they are not doing the same thing. The former (when unshredding) needs to use a read-only metadata builder because the `metadata` column already exists; the latter would use a normal metadata builder. So `cast_to_variant` may actually be lower hanging fruit for converting "stuff" to binary variant, with `variant_get` support for the same landing later when we sort out the read-only metadata thing. Thoughts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org