pitrou commented on PR #47377:
URL: https://github.com/apache/arrow/pull/47377#issuecomment-4302948923

   Hey @zanmato1984, sorry for not responding earlier. Reviewing this is a 
non-trivial task, especially as I'm worried about the API and efficiency 
implications of this. I'm also worried about what I perceive as additional 
complexity in the execution layer, which is already overly complicated with 
lots of weird cases (I'm afraid bugs and hidden inefficiencies will hide there).
   
   I'm also skeptical that a selection vector that is purely a vector of 
selection indices is a good solution performance-wise. In many cases you might 
have long spans of contiguous selected values, but with this solution we lose 
any opportunity of efficient batch execution (not to mention potential SIMD 
vectorization). This might be especially painful with chunked inputs (does this 
PR work with chunked arrays at all?).
   
   This is also my experience following the take/filter implementation work 
that we have been doing heroic efforts to try and improve the indirect indexing 
story while IMHO we should really be talking in terms of contiguous spans, or 
more precisely a mixed representation (spans/individual indices).
   
   All in all, I'm afraid this is committing to a lot of choices that both add 
complexity while painting us in a corner performance-wise.
   
   What do you think? You probably have thought about all this already.
   
   (also for the record my current professional situation is that I'm available 
for smaller maintenance and PR reviews but it's difficult to commit non-client 
time for such large enhancements)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to