Weston Pace created ARROW-16289:
-----------------------------------

             Summary: [C++] (eventually) abandon scalar columns of an ExecBatch 
in favor of RLE encoded arrays
                 Key: ARROW-16289
                 URL: https://issues.apache.org/jira/browse/ARROW-16289
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Weston Pace


This JIRA is a proposal / discussion.  I am not asserting this is the way to go 
but I would like to consider it.

>From the execution engine's perspective an exec batch's columns are always 
>either arrays or scalars.  The only time we make use of scalars today is for 
>the four augmented columns (e.g. __filename).  Once we have support for RLE 
>arrays a scalar could easily be encoded as an RLE array and there would be no 
>need to use scalars here.

The advantage would be reducing the complexity in exec nodes and avoiding 
issues like ARROW-16288.  It is already rather difficult to explain the idea of 
a "scalar" and "vector" function and then have to turn around and explain that 
the word "scalar" has an entirely different meaning when talking about field 
shape.

I think it's worth considering taking this even further and removing the 
concept from the compute layer entirely.  Kernel functions that want to have 
special logic for scalars could do so using the RLE array.  This would be a 
significant change to many kernels which currently declare the ANY shape and 
determine which logic to apply within the kernel itself (e.g. there is one 
array OR scalar kernel and not one kernel for each).

Admittedly there is probably a few instructions and a few bytes more to handle 
an RLE scalar than the scalar we have today.  However, this is just different 
flavors of O(1) and not likely to have significant impact.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to