[jira] [Commented] (ARROW-16289) [C++] (eventually) abandon scalar columns of an ExecBatch in favor of RLE encoded arrays

2022-04-23 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526811#comment-17526811
 ] 

Antoine Pitrou commented on ARROW-16289:


RLE arrays is a pie in the sky feature for now so I'm not sure this is really 
worth discussing.

Notice that ExecBatch is supposed to support a selection vector that has never 
been implemented in any kernel that I know of.

> [C++] (eventually) abandon scalar columns of an ExecBatch in favor of RLE 
> encoded arrays
> 
>
> Key: ARROW-16289
> URL: https://issues.apache.org/jira/browse/ARROW-16289
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> This JIRA is a proposal / discussion.  I am not asserting this is the way to 
> go but I would like to consider it.
> From the execution engine's perspective an exec batch's columns are always 
> either arrays or scalars.  The only time we make use of scalars today is for 
> the four augmented columns (e.g. __filename).  Once we have support for RLE 
> arrays a scalar could easily be encoded as an RLE array and there would be no 
> need to use scalars here.
> The advantage would be reducing the complexity in exec nodes and avoiding 
> issues like ARROW-16288.  It is already rather difficult to explain the idea 
> of a "scalar" and "vector" function and then have to turn around and explain 
> that the word "scalar" has an entirely different meaning when talking about 
> field shape.
> I think it's worth considering taking this even further and removing the 
> concept from the compute layer entirely.  Kernel functions that want to have 
> special logic for scalars could do so using the RLE array.  This would be a 
> significant change to many kernels which currently declare the ANY shape and 
> determine which logic to apply within the kernel itself (e.g. there is one 
> array OR scalar kernel and not one kernel for each).
> Admittedly there is probably a few instructions and a few bytes more to 
> handle an RLE scalar than the scalar we have today.  However, this is just 
> different flavors of O(1) and not likely to have significant impact.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16289) [C++] (eventually) abandon scalar columns of an ExecBatch in favor of RLE encoded arrays

2022-04-22 Thread Eduardo Ponce (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526656#comment-17526656
 ] 

Eduardo Ponce commented on ARROW-16289:
---

The term Scalar is used in different (but related) contexts. For example, the 
notion of a Scalar value, Scalar kernels, Scalar expressions, etc.

I recall from an ad-hoc conversation last year where it was discussed that we 
should consider treating Scalars as a 1-element Array to making the compute 
layer logic more straightforward. The front-end API would still have the 
concept of a Scalar but it would be disguised as an Array for execution 
purposes.

I think such a proposal has its merits, but we should ensure where the concept 
of Scalar will remain and make these distinctions clear.

> [C++] (eventually) abandon scalar columns of an ExecBatch in favor of RLE 
> encoded arrays
> 
>
> Key: ARROW-16289
> URL: https://issues.apache.org/jira/browse/ARROW-16289
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> This JIRA is a proposal / discussion.  I am not asserting this is the way to 
> go but I would like to consider it.
> From the execution engine's perspective an exec batch's columns are always 
> either arrays or scalars.  The only time we make use of scalars today is for 
> the four augmented columns (e.g. __filename).  Once we have support for RLE 
> arrays a scalar could easily be encoded as an RLE array and there would be no 
> need to use scalars here.
> The advantage would be reducing the complexity in exec nodes and avoiding 
> issues like ARROW-16288.  It is already rather difficult to explain the idea 
> of a "scalar" and "vector" function and then have to turn around and explain 
> that the word "scalar" has an entirely different meaning when talking about 
> field shape.
> I think it's worth considering taking this even further and removing the 
> concept from the compute layer entirely.  Kernel functions that want to have 
> special logic for scalars could do so using the RLE array.  This would be a 
> significant change to many kernels which currently declare the ANY shape and 
> determine which logic to apply within the kernel itself (e.g. there is one 
> array OR scalar kernel and not one kernel for each).
> Admittedly there is probably a few instructions and a few bytes more to 
> handle an RLE scalar than the scalar we have today.  However, this is just 
> different flavors of O(1) and not likely to have significant impact.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16289) [C++] (eventually) abandon scalar columns of an ExecBatch in favor of RLE encoded arrays

2022-04-22 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526637#comment-17526637
 ] 

David Li commented on ARROW-16289:
--

The concept of scalars would still exist (e.g. in expressions, options) so 
there's still potential for confusion though this would reduce it. Aggregations 
would presumably still return scalars, too.

It does seem being able to accept scalars is more confusing than it's worth, 
though.

> [C++] (eventually) abandon scalar columns of an ExecBatch in favor of RLE 
> encoded arrays
> 
>
> Key: ARROW-16289
> URL: https://issues.apache.org/jira/browse/ARROW-16289
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> This JIRA is a proposal / discussion.  I am not asserting this is the way to 
> go but I would like to consider it.
> From the execution engine's perspective an exec batch's columns are always 
> either arrays or scalars.  The only time we make use of scalars today is for 
> the four augmented columns (e.g. __filename).  Once we have support for RLE 
> arrays a scalar could easily be encoded as an RLE array and there would be no 
> need to use scalars here.
> The advantage would be reducing the complexity in exec nodes and avoiding 
> issues like ARROW-16288.  It is already rather difficult to explain the idea 
> of a "scalar" and "vector" function and then have to turn around and explain 
> that the word "scalar" has an entirely different meaning when talking about 
> field shape.
> I think it's worth considering taking this even further and removing the 
> concept from the compute layer entirely.  Kernel functions that want to have 
> special logic for scalars could do so using the RLE array.  This would be a 
> significant change to many kernels which currently declare the ANY shape and 
> determine which logic to apply within the kernel itself (e.g. there is one 
> array OR scalar kernel and not one kernel for each).
> Admittedly there is probably a few instructions and a few bytes more to 
> handle an RLE scalar than the scalar we have today.  However, this is just 
> different flavors of O(1) and not likely to have significant impact.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16289) [C++] (eventually) abandon scalar columns of an ExecBatch in favor of RLE encoded arrays

2022-04-22 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17526627#comment-17526627
 ] 

Weston Pace commented on ARROW-16289:
-

CC [~lidavidm] [~edponce] [~apitrou] [~michalno] [~yibocai]

> [C++] (eventually) abandon scalar columns of an ExecBatch in favor of RLE 
> encoded arrays
> 
>
> Key: ARROW-16289
> URL: https://issues.apache.org/jira/browse/ARROW-16289
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> This JIRA is a proposal / discussion.  I am not asserting this is the way to 
> go but I would like to consider it.
> From the execution engine's perspective an exec batch's columns are always 
> either arrays or scalars.  The only time we make use of scalars today is for 
> the four augmented columns (e.g. __filename).  Once we have support for RLE 
> arrays a scalar could easily be encoded as an RLE array and there would be no 
> need to use scalars here.
> The advantage would be reducing the complexity in exec nodes and avoiding 
> issues like ARROW-16288.  It is already rather difficult to explain the idea 
> of a "scalar" and "vector" function and then have to turn around and explain 
> that the word "scalar" has an entirely different meaning when talking about 
> field shape.
> I think it's worth considering taking this even further and removing the 
> concept from the compute layer entirely.  Kernel functions that want to have 
> special logic for scalars could do so using the RLE array.  This would be a 
> significant change to many kernels which currently declare the ANY shape and 
> determine which logic to apply within the kernel itself (e.g. there is one 
> array OR scalar kernel and not one kernel for each).
> Admittedly there is probably a few instructions and a few bytes more to 
> handle an RLE scalar than the scalar we have today.  However, this is just 
> different flavors of O(1) and not likely to have significant impact.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)