Hi Olo,
In the case of filtering a single column by values in that column, then
at least theoretically iterating the array twice may be more
inefficient, correct. This design predates my involvement in the project
so I can't speak to the exact design but some thoughts for your
consideration:
* Most query use-cases will compute the predicate and then apply it to
more than one column, computing it once saves effort for a very low
additional memory overhead
* Knowing the BooleanArray up front allows cheaply counting the set bits
within filter to determine:
* The amount of capacity to allocate, avoiding potentially
expensive reallocations
* The optimal iteration strategy, e.g. copying large consecutive
ranges using SIMD instructions
* Elide bounds checks
* Having kernels that evaluate predicates and select simultaneously,
e.g. by passing in a generic predicate function
* Would likely result in a codegen explosion from monomorphisation,
hurting compile times and binary size
* Are extremely unlikely to vectorise correctly, LLVM struggles
with practically any conditional branch in the loop body, even
Iterator::next
I therefore think the current design makes sense, but we're always open
to suggestions for improvements. I am sure if you filed an issue with
some compelling benchmark results, it would spark something interesting
Kind Regards,
Raphael Taylor-Davies
On 04/11/2022 17:12, Olo Sawyerr wrote:
Hi Raphael,
Thanks for the info. Yes, makes sense now and the example in the PR is
clear.
I had a question about the separation design however, doesn't that
mean the Array is iterated through twice? - Once to determine what to
filter out, and then a second time to actually perform the filtering,
would this not be inefficient? How come there's no option to filter
out on the first iteration rather than a 2-step process?
Regards,
Olo
------------------------------------------------------------------------
*From:* Raphael Taylor-Davies <[email protected]>
*Sent:* 04 November 2022 02:05
*To:* [email protected] <[email protected]>
*Subject:* Re: [RUST] Compute kernel filtering
Hi Olo,
Typically you will evaluate some predicate in order to obtain a
BooleanArray, potentially using one of the comparison
<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rs%2Farrow%2Flatest%2Farrow%2Fcompute%2Fkernels%2Fcomparison%2Findex.html&data=05%7C01%7C%7C7aa4e73b0390462f0ff408dabe09199c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638031243542288038%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=If5U9ltVdnMcKDb5NKuCy9bcjk%2Bg1nWJc3Mpfyg5cbw%3D&reserved=0>
kernels, and then pass this to the filter kernel to filter the actual
values. This separation not only allows both kernels to be implemented
more efficiently, but allows filtering an array with a predicate
evaluated on a different array.
I've created this PR
<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow-rs%2Fpull%2F3014&data=05%7C01%7C%7C7aa4e73b0390462f0ff408dabe09199c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638031243542288038%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=E%2BP4SYzTWqvs0eXpJxt3i1XV3xxfJhaDiT%2BesLgX%2Bmc%3D&reserved=0>
to add an example doctest of this. Please let me know if anything
isn't clear
Kind Regards,
Raphael Taylor-Davies
On 04/11/2022 06:48, Olo Sawyerr wrote:
Hi there,
Hope you're well.
I have a question about the *arrow
<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rs%2Farrow%2Flatest%2Farrow%2Findex.html&data=05%7C01%7C%7C7aa4e73b0390462f0ff408dabe09199c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638031243542288038%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ADzef7HqEmsOGDjDRBv0ZAhD86nHO068oE%2FewsxWWPY%3D&reserved=0>::compute
<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rs%2Farrow%2Flatest%2Farrow%2Fcompute%2Findex.html&data=05%7C01%7C%7C7aa4e73b0390462f0ff408dabe09199c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638031243542288038%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=DTVQVNFxVjdkNjhNrybcV%2FulmoLGGe96ejViJAbDnUk%3D&reserved=0>::kernels
<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rs%2Farrow%2Flatest%2Farrow%2Fcompute%2Fkernels%2Findex.html&data=05%7C01%7C%7C7aa4e73b0390462f0ff408dabe09199c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638031243542288038%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=OzchweJlbpEyBqy%2F5NBuvkWWpDqKQ8c0e8CN3JuorrU%3D&reserved=0>::filter
<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rs%2Farrow%2Flatest%2Farrow%2Fcompute%2Fkernels%2Ffilter%2Ffn.filter.html%23&data=05%7C01%7C%7C7aa4e73b0390462f0ff408dabe09199c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638031243542288038%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=htFFG42e9sRpHAKYR0FXb8e2KE2%2BKzwz6bjA4R4DGRU%3D&reserved=0>::filter
<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rs%2Farrow%2Flatest%2Farrow%2Fcompute%2Fkernels%2Ffilter%2Findex.html&data=05%7C01%7C%7C7aa4e73b0390462f0ff408dabe09199c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638031243542288038%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=9SpH0ldv1w3vLe5qQk3mgx2f52wu8J%2Bl7Im%2FU68o8Lg%3D&reserved=0>()***function*.* The
function takes a predicate which is a BooleanArray to do the
filtering. How should this be used in practice? One usually
doesn'tknow which item should be filtered out or not in advance
before you actually do the filter, so won't know what the
BooleanArray should contain.
I would have thought that the predicate would be a function instead
to test if each item should be filtered out or not.
How is arrow
<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rs%2Farrow%2Flatest%2Farrow%2Findex.html&data=05%7C01%7C%7C7aa4e73b0390462f0ff408dabe09199c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638031243542288038%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ADzef7HqEmsOGDjDRBv0ZAhD86nHO068oE%2FewsxWWPY%3D&reserved=0>::compute
<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rs%2Farrow%2Flatest%2Farrow%2Fcompute%2Findex.html&data=05%7C01%7C%7C7aa4e73b0390462f0ff408dabe09199c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638031243542288038%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=DTVQVNFxVjdkNjhNrybcV%2FulmoLGGe96ejViJAbDnUk%3D&reserved=0>::kernels
<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rs%2Farrow%2Flatest%2Farrow%2Fcompute%2Fkernels%2Findex.html&data=05%7C01%7C%7C7aa4e73b0390462f0ff408dabe09199c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638031243542288038%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=OzchweJlbpEyBqy%2F5NBuvkWWpDqKQ8c0e8CN3JuorrU%3D&reserved=0>::filter
<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rs%2Farrow%2Flatest%2Farrow%2Fcompute%2Fkernels%2Ffilter%2Ffn.filter.html%23&data=05%7C01%7C%7C7aa4e73b0390462f0ff408dabe09199c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638031243542288038%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=htFFG42e9sRpHAKYR0FXb8e2KE2%2BKzwz6bjA4R4DGRU%3D&reserved=0>::filter()
meant to be used in practice?
Thanks.