pitrou commented on PR #47377:
URL: https://github.com/apache/arrow/pull/47377#issuecomment-4304952483

   First I forgot another problem: the PR currently uses 32-bit indices, but we 
really want 64-bit indices, right? (perhaps UInt64 to match what sort_indices 
outputs, though that's unnecessary).
   
   > I agree the execution layer is already overly complicated.
   
   Separately from this PR, can we think about ways to make it simpler? Perhaps 
there are internal execution "options" that aren't really useful.
   
   > On performance: the selection path is only intended to be exercised from 
the upcoming special-form work (#47374), as a narrow and semantically explicit 
entry point.
   
   Yes, but we would like it to be more generally useful for execution engines, 
right?
   
   > in summary, sparse execution wins strongly at low selectivity, while the 
worst regressions (up to ~4x) are from the generic dense fallback when a kernel 
doesn’t provide `selective_exec` (extra gather/scatter calls)
   
   The regression might be much worse on chunked inputs?
   
   > If you have a preferred API shape (hybrid runs+indices, a generalized 
“selection” object, or a different exec signature), I’d really appreciate 
guidance - I’d rather adjust before we cement the API.
   
   I'm not sure what it should look like, and we can probably add some 
complexity piecewise if we agree the API remains experimental.
   
   Ideally I'd like something that can be used internally for take/filter as 
well.
   
   A conceptual sketch could look like:
   ```c++
   struct ContiguousSpan {
     int64_t start_offset;
     int64_t length;
   };
   struct FilteredSpan {
     int64_t start_offset;
     int64_t length;
     /* followed by a filter bitmap with `length` bits */
   };
   struct DiscreteSpan {
     int64_t length;
     /* followed by `length` 64-bit indices */
   };
   
   using SelectionSpan = std::variant<ContiguousSpan, FilteredSpan, 
DiscreteSpan>;
   ```
   
   (but SelectionSpan would actually be encoded using some bit-twiddling and a 
selection vector would be a Buffer containing a number of SelectionSpans)
   
   That's of course quite a bit of work and DiscreteSpan might be the only 
implemented variant at the start.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to