[ https://issues.apache.org/jira/browse/ARROW-16289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526656#comment-17526656 ]
Eduardo Ponce commented on ARROW-16289: --------------------------------------- The term Scalar is used in different (but related) contexts. For example, the notion of a Scalar value, Scalar kernels, Scalar expressions, etc. I recall from an ad-hoc conversation last year where it was discussed that we should consider treating Scalars as a 1-element Array to making the compute layer logic more straightforward. The front-end API would still have the concept of a Scalar but it would be disguised as an Array for execution purposes. I think such a proposal has its merits, but we should ensure where the concept of Scalar will remain and make these distinctions clear. > [C++] (eventually) abandon scalar columns of an ExecBatch in favor of RLE > encoded arrays > ---------------------------------------------------------------------------------------- > > Key: ARROW-16289 > URL: https://issues.apache.org/jira/browse/ARROW-16289 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Weston Pace > Priority: Major > > This JIRA is a proposal / discussion. I am not asserting this is the way to > go but I would like to consider it. > From the execution engine's perspective an exec batch's columns are always > either arrays or scalars. The only time we make use of scalars today is for > the four augmented columns (e.g. __filename). Once we have support for RLE > arrays a scalar could easily be encoded as an RLE array and there would be no > need to use scalars here. > The advantage would be reducing the complexity in exec nodes and avoiding > issues like ARROW-16288. It is already rather difficult to explain the idea > of a "scalar" and "vector" function and then have to turn around and explain > that the word "scalar" has an entirely different meaning when talking about > field shape. > I think it's worth considering taking this even further and removing the > concept from the compute layer entirely. Kernel functions that want to have > special logic for scalars could do so using the RLE array. This would be a > significant change to many kernels which currently declare the ANY shape and > determine which logic to apply within the kernel itself (e.g. there is one > array OR scalar kernel and not one kernel for each). > Admittedly there is probably a few instructions and a few bytes more to > handle an RLE scalar than the scalar we have today. However, this is just > different flavors of O(1) and not likely to have significant impact. -- This message was sent by Atlassian Jira (v8.20.7#820007)