Hi Micah, Thank you for the detailed response. Apologize for not responding earlier.
a.) Looked at the latencies with and without filtering based on just foreach and the latency is dominated by the parquet/write operation. So I’m going to go with what I have which already provides substantial improvement for my use case. b.) Would like to contribute for implement ANY over booleans in Arrow/compute kernel. Waiting for permission to come through. I’m also interested in contributing to Azure/ADLS filesystem but the library I was looking at is c++14 here https://github.com/Azure/azure-sdk-for-cpp <https://github.com/Azure/azure-sdk-for-cpp> . Is c++14 no-go as a dependency in Arrow (even conditional ?) Thank you Yesh > On Feb 28, 2021, at 2:09 PM, Micah Kornfield <[email protected]> wrote: > > Hi Yeshwanth, > I think you can do the first part of the filtering using the Equals kernel > and IsIn kernel on the child arrays of the Map. I took a quick look but I > don't think that there is anything implemented that would allow you to map > the resulting bitmaps to the parent lists. It seems that we would want to add > an "Any" function for List<Bool> that returns a Bool array if any of the > elements are true. There is already one for flat Boolean Arrays [1] but I > don't think that is useful here. > > So I think the logic that you would ultimately want in pseudo-code: > > children_bitmap = Equals(map.key, "some string") && IsIn(map.struct.id > <http://map.struct.id/>, [[“aaa”, “bee”, “see”]) > list = MakeList(map.offsets, children_bitmap) > final_selection = Any(list) > > Is the new Kernel something you would be interested in contributing? > > -Micah > > [1] https://github.com/apache/arrow/pull/8294 > <https://github.com/apache/arrow/pull/8294> > On Sun, Feb 28, 2021 at 9:05 AM Yeshwanth Sriram <[email protected] > <mailto:[email protected]>> wrote: > Using C++//Arrow to filter out large parquet files and I’m able to do this > successfully. The current poc implementation is based on nested for/loops > which I would like to avoid this and instead use built-in filter/take > functions or some recommendations to extract (take functions ?) arrays of > indices or booleans to filter out rows. > > The input (data) array/column type is MapArray[key:String, > value:StructArray[id:String, …]] > > The input filter is a {filter_key: “some string”, filter_ids: [“aaa”, “bee”, > “see”, ..] } > - Where filter_key, and filter_ids is to match contents of input MapArray > > The output I’m looking for is either array of booleans or indices of input > array that match the input filer. > > Thank you
