scovich commented on PR #8166:
URL: https://github.com/apache/arrow-rs/pull/8166#issuecomment-3260318512

   Attn @alamb @carpecodeum @liamzwbao :
   One thing I noticed today --  
   * while making an early pass over 
https://github.com/apache/arrow-rs/pull/8241
   * and re-reviewing https://github.com/apache/arrow-rs/pull/8235
   
   I noticed that `cast_to_variant` follows the same problematic 
column-oriented approach as the output builder this PR replaces with a row 
builder. In particular, it converts entire columns to variant, which means 
repeatedly re-encoding deeply nested variant values, transferring their bytes 
to new arrays, and recreating incomplete metadata columns that immediately get 
thrown away.
   
   I spent some time today, trying to refactor it to be row-oriented (while 
keeping a similar code structure), but late in the process I realized that 
several array types break it because they need to do some holistic column-level 
transformation, which cannot be captured efficiently by a row-wise approach:
   * maps - cast the key column to string
   * dictionaries - convert the dictionary value column and then clone values 
from it as needed
   * run-end encoding - similar to dictionary, but _also_ need to convert the 
incoming logical index to physical run index, which is a linear search in 
row-oriented code (= quadratic cost!)
   
   I'm pretty sure the actual solution will be to merge the new variant row 
builder infrastructure in this PR, and then rework `cast_to_variant` to use it. 
That way, the row builder's constructor can do any column-level transformations 
that might be needed, before row-oriented visiting begins.
   
   In other words, this PR is probably the most important (= biggest 
bottleneck) variant PR currently open.
   
   NOTE: Even tho `variant_get` and `cast_to_variant` would both use the new 
row builder infra, they are not doing the same thing. The former (when 
unshredding) needs to use a read-only metadata builder because the `metadata` 
column already exists; the latter would use a normal metadata builder. So 
`cast_to_variant` may actually be lower hanging fruit for converting "stuff" to 
binary variant, with `variant_get` support for the same landing later when we 
sort out the read-only metadata thing.
   
   Thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to