adriangb commented on issue #42069:
URL: https://github.com/apache/arrow/issues/42069#issuecomment-2696048756

   A thought on data shredding / subcolumarizarion.
   
   My understanding is that in parquet it is being stored as truly as 
individual columns. Should the arrow type need to collapse / collect them? In 
particular it seems important to me that a query like `variant_get(col, 
'shredded-key')` never have to touch the potentially much larger unshredded 
data. There is little value in shredding if the data gets recombined at the 
arrow layer, you already had to pay the price of downloading the data, decoding 
parquet, etc. Since it seems to me the query engine will always have to be 
aware of the shredding, could arrow just avoid dealing with shredding 
altogether and leave that up to the query engine? The query engine would have 
to know to rewrite queries to hit the shredded data or reconstitute it if 
needed. I think that would make it easier to filter push down, stats pruning, 
etc to "just work".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to