pitrou commented on PR #47377: URL: https://github.com/apache/arrow/pull/47377#issuecomment-4302948923
Hey @zanmato1984, sorry for not responding earlier. Reviewing this is a non-trivial task, especially as I'm worried about the API and efficiency implications of this. I'm also worried about what I perceive as additional complexity in the execution layer, which is already overly complicated with lots of weird cases (I'm afraid bugs and hidden inefficiencies will hide there). I'm also skeptical that a selection vector that is purely a vector of selection indices is a good solution performance-wise. In many cases you might have long spans of contiguous selected values, but with this solution we lose any opportunity of efficient batch execution (not to mention potential SIMD vectorization). This might be especially painful with chunked inputs (does this PR work with chunked arrays at all?). This is also my experience following the take/filter implementation work that we have been doing heroic efforts to try and improve the indirect indexing story while IMHO we should really be talking in terms of contiguous spans, or more precisely a mixed representation (spans/individual indices). All in all, I'm afraid this is committing to a lot of choices that both add complexity while painting us in a corner performance-wise. What do you think? You probably have thought about all this already. (also for the record my current professional situation is that I'm available for smaller maintenance and PR reviews but it's difficult to commit non-client time for such large enhancements) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
