I think C++14 is fine for optional dependencies and shouldn't block
any development work right now. Note that we should be able to upgrade
to require a minimum of C++14 as soon as April or May of this year
since we will stop having to support one of the last gcc < 5
toolchains (for R 3.5 IIUC)

On Wed, Mar 3, 2021 at 5:41 PM Yeshwanth Sriram <[email protected]> wrote:
>
> Hi Micah,
>
> Thank you for the detailed response. Apologize for not responding earlier.
>
> a.) Looked at the latencies with and without filtering based on just foreach 
> and the latency is dominated by the parquet/write operation. So I’m going to 
> go with what I have which already provides substantial improvement for my use 
> case.
>
> b.) Would like to contribute for implement ANY over booleans in Arrow/compute 
> kernel. Waiting for permission to come through.
>
> I’m also interested in contributing to Azure/ADLS filesystem but the library 
> I was looking at is c++14 here https://github.com/Azure/azure-sdk-for-cpp . 
> Is c++14 no-go as a dependency in Arrow (even conditional ?)
>
> Thank you
> Yesh
>
> On Feb 28, 2021, at 2:09 PM, Micah Kornfield <[email protected]> wrote:
>
> Hi  Yeshwanth,
> I think you can do the first part of the filtering using the Equals kernel 
> and IsIn kernel on the child arrays of the Map.  I took a quick look but I 
> don't think that there is anything implemented that would allow you to map 
> the resulting bitmaps to the parent lists. It seems that we would want to add 
> an "Any" function for List<Bool> that returns a Bool array if any of the 
> elements are true. There is already one for flat Boolean Arrays [1] but I 
> don't think that is useful here.
>
> So I think the logic that you would ultimately want in pseudo-code:
>
> children_bitmap = Equals(map.key, "some string") && IsIn(map.struct.id, 
> [[“aaa”, “bee”, “see”])
> list = MakeList(map.offsets, children_bitmap)
> final_selection = Any(list)
>
> Is the new Kernel something you would be interested in contributing?
>
> -Micah
>
> [1] https://github.com/apache/arrow/pull/8294
>
> On Sun, Feb 28, 2021 at 9:05 AM Yeshwanth Sriram <[email protected]> 
> wrote:
>>
>> Using C++//Arrow to filter out large parquet files and I’m able to do this 
>> successfully. The current poc implementation is based on nested for/loops 
>> which I would like to avoid this and instead use built-in filter/take 
>> functions or some recommendations  to extract (take functions ?) arrays of 
>> indices or booleans to filter out rows.
>>
>> The input (data) array/column type is MapArray[key:String, 
>> value:StructArray[id:String, …]]
>>
>> The input filter is a {filter_key: “some string”, filter_ids: [“aaa”, “bee”, 
>> “see”, ..] }
>>   - Where filter_key, and filter_ids is to match contents of input MapArray
>>
>> The output I’m looking for is either array of booleans or indices of input 
>> array that match the input filer.
>>
>> Thank you
>
>

Reply via email to