tustvold opened a new issue, #2596:
URL: https://github.com/apache/arrow-rs/issues/2596
**Is your feature request related to a problem or challenge? Please describe
what you are trying to do.**
<!--
A clear and concise description of what the problem is. Ex. I'm always
frustrated when [...]
(This section helps Arrow developers understand the context and *why* for
this feature, in addition to the *what*)
-->
Now that we provide scalar comparison kernels, it is relatively rare that a
query needs to perform array-array comparison. In fact DataFusion has only one
test that now calls into this code, as the join filter tests happen to have a
predicate comparing across columns. Further the case of comparing two
dictionary arrays, or a dictionary array to some other type of array, is a rare
case of a rare case.
Unfortunately generating the comparison kernels for dictionary arrays is
hugely expensive as it requires generating code for each unique combination of
index and value type. For many use-cases this is a significant overhead for no
benefit.
The scalar kernels do not have this issue as the predicate is evaluated
against the underlying values first, and then this boolean array is unpacked
based on the dictionary. As neither stage is parameterized on both the key and
value types, the combinatorial explosion is avoided.
**Describe the solution you'd like**
<!--
A clear and concise description of what you want to happen.
-->
I would like a feature flag that disables code generation for dictionary
comparison kernels
On my local machine this yields for a release build
```
________________________________________________________
Executed in 21.08 secs fish external
usr time 153.14 secs 549.00 micros 153.14 secs
sys time 6.23 secs 106.00 micros 6.23 secs
```
Compared to master
```
________________________________________________________
Executed in 84.87 secs fish external
usr time 224.12 secs 677.00 micros 224.11 secs
sys time 6.44 secs 0.00 micros 6.44 secs
```
Or an ~50% speedup.
**Describe alternatives you've considered**
<!--
A clear and concise description of any alternative solutions or features
you've considered.
-->
I thought about other feature flags, I think we could definitely consider
some of these as follow ups, but `dyn_cmp_dict` is the big hitter
* `dyn_cmp` - just feature flag the dyn comparison kernels as a whole
* `dyn_cmp_distinct` - just feature flag dyn comparison against different
types of array
* `dict_non_i32` - feature flag support for non-i32 dictionaries
(potentially really fiddly to implement)
**Additional context**
<!--
Add any other context or screenshots about the feature request here.
-->
Related to #2594
@alamb @psvri
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]