Hi Olo,

In the case of filtering a single column by values in that column, then at least theoretically iterating the array twice may be more inefficient, correct. This design predates my involvement in the project so I can't speak to the exact design but some thoughts for your consideration:


* Most query use-cases will compute the predicate and then apply it to more than one column, computing it once saves effort for a very low additional memory overhead

* Knowing the BooleanArray up front allows cheaply counting the set bits within filter to determine:

    * The amount of capacity to allocate, avoiding potentially expensive reallocations

    * The optimal iteration strategy, e.g. copying large consecutive ranges using SIMD instructions

    * Elide bounds checks

* Having kernels that evaluate predicates and select simultaneously, e.g. by passing in a generic predicate function

    * Would likely result in a codegen explosion from monomorphisation, hurting compile times and binary size

    * Are extremely unlikely to vectorise correctly, LLVM struggles with practically any conditional branch in the loop body, even Iterator::next


I therefore think the current design makes sense, but we're always open to suggestions for improvements. I am sure if you filed an issue with some compelling benchmark results, it would spark something interesting


Kind Regards,


Raphael Taylor-Davies


On 04/11/2022 17:12, Olo Sawyerr wrote:
Hi Raphael,

Thanks for the info. Yes, makes sense now and the example in the PR is clear.

I had a question about the separation design however, doesn't that mean the Array is iterated through twice? - Once to determine what to filter out, and then a second time to actually perform the filtering, would this not be inefficient? How come there's no option to filter out on the first iteration rather than a 2-step process?

Regards,

Olo


------------------------------------------------------------------------
*From:* Raphael Taylor-Davies <[email protected]>
*Sent:* 04 November 2022 02:05
*To:* [email protected] <[email protected]>
*Subject:* Re: [RUST] Compute kernel filtering

Hi Olo,


Typically you will evaluate some predicate in order to obtain a BooleanArray, potentially using one of the comparison <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rs%2Farrow%2Flatest%2Farrow%2Fcompute%2Fkernels%2Fcomparison%2Findex.html&data=05%7C01%7C%7C7aa4e73b0390462f0ff408dabe09199c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638031243542288038%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=If5U9ltVdnMcKDb5NKuCy9bcjk%2Bg1nWJc3Mpfyg5cbw%3D&reserved=0> kernels, and then pass this to the filter kernel to filter the actual values. This separation not only allows both kernels to be implemented more efficiently, but allows filtering an array with a predicate evaluated on a different array.


I've created this PR <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow-rs%2Fpull%2F3014&data=05%7C01%7C%7C7aa4e73b0390462f0ff408dabe09199c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638031243542288038%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=E%2BP4SYzTWqvs0eXpJxt3i1XV3xxfJhaDiT%2BesLgX%2Bmc%3D&reserved=0> to add an example doctest of this. Please let me know if anything isn't clear


Kind Regards,


Raphael Taylor-Davies


On 04/11/2022 06:48, Olo Sawyerr wrote:
Hi there,

Hope you're well.

I have a question about the *arrow <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rs%2Farrow%2Flatest%2Farrow%2Findex.html&data=05%7C01%7C%7C7aa4e73b0390462f0ff408dabe09199c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638031243542288038%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ADzef7HqEmsOGDjDRBv0ZAhD86nHO068oE%2FewsxWWPY%3D&reserved=0>::compute <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rs%2Farrow%2Flatest%2Farrow%2Fcompute%2Findex.html&data=05%7C01%7C%7C7aa4e73b0390462f0ff408dabe09199c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638031243542288038%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=DTVQVNFxVjdkNjhNrybcV%2FulmoLGGe96ejViJAbDnUk%3D&reserved=0>::kernels <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rs%2Farrow%2Flatest%2Farrow%2Fcompute%2Fkernels%2Findex.html&data=05%7C01%7C%7C7aa4e73b0390462f0ff408dabe09199c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638031243542288038%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=OzchweJlbpEyBqy%2F5NBuvkWWpDqKQ8c0e8CN3JuorrU%3D&reserved=0>::filter <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rs%2Farrow%2Flatest%2Farrow%2Fcompute%2Fkernels%2Ffilter%2Ffn.filter.html%23&data=05%7C01%7C%7C7aa4e73b0390462f0ff408dabe09199c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638031243542288038%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=htFFG42e9sRpHAKYR0FXb8e2KE2%2BKzwz6bjA4R4DGRU%3D&reserved=0>::filter <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rs%2Farrow%2Flatest%2Farrow%2Fcompute%2Fkernels%2Ffilter%2Findex.html&data=05%7C01%7C%7C7aa4e73b0390462f0ff408dabe09199c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638031243542288038%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=9SpH0ldv1w3vLe5qQk3mgx2f52wu8J%2Bl7Im%2FU68o8Lg%3D&reserved=0>()***function*.* The function takes a predicate which is a BooleanArray to do the filtering. How should this be used in practice? One usually doesn'tknow which item should be filtered out or not in advance before you actually do the filter, so won't know what the BooleanArray should contain.

I would have thought that the predicate would be a function instead to test if each item should be filtered out or not.

How is arrow <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rs%2Farrow%2Flatest%2Farrow%2Findex.html&data=05%7C01%7C%7C7aa4e73b0390462f0ff408dabe09199c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638031243542288038%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ADzef7HqEmsOGDjDRBv0ZAhD86nHO068oE%2FewsxWWPY%3D&reserved=0>::compute <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rs%2Farrow%2Flatest%2Farrow%2Fcompute%2Findex.html&data=05%7C01%7C%7C7aa4e73b0390462f0ff408dabe09199c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638031243542288038%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=DTVQVNFxVjdkNjhNrybcV%2FulmoLGGe96ejViJAbDnUk%3D&reserved=0>::kernels <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rs%2Farrow%2Flatest%2Farrow%2Fcompute%2Fkernels%2Findex.html&data=05%7C01%7C%7C7aa4e73b0390462f0ff408dabe09199c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638031243542288038%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=OzchweJlbpEyBqy%2F5NBuvkWWpDqKQ8c0e8CN3JuorrU%3D&reserved=0>::filter <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rs%2Farrow%2Flatest%2Farrow%2Fcompute%2Fkernels%2Ffilter%2Ffn.filter.html%23&data=05%7C01%7C%7C7aa4e73b0390462f0ff408dabe09199c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638031243542288038%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=htFFG42e9sRpHAKYR0FXb8e2KE2%2BKzwz6bjA4R4DGRU%3D&reserved=0>::filter() meant to be used in practice?

Thanks.

Reply via email to