Re: pyarrow.compute case_when

Ryan Kuhns Tue, 15 Nov 2022 05:57:17 -0800

Hi joris,

I appreciate the ticket. I like proposed functionality (and use of keyword to 
move between map and replace functionality).


I appreciate the help figuring out the use of make_struct. I got an error on 
the values portion when using unpacking. In pyarrow the following works:

>> import pyarrow as pa
>> import pyarrow.compute as pc
>> 
>> map = {“a”: 1, “b”: 2, “c”: 3}
>> input_array = pa.array([“a”, “b”, “c”, “a”])
>> expected_output  = pa.array([1, 2, 3, 1])

>>>> 
>>>> cond = pc.make_struct(*[pc.equal(input_array, val) for val in map.keys()])
>>>> pc.case_when(cond, 1, 2, 3)
>>>> 

Thanks,

Ryan

> On Nov 15, 2022, at 2:33 AM, Joris Van den Bossche 
> <[email protected]> wrote:
> 
> And as an answer to how you can use pyarrow.compute.case_when for this:
> 
>>>> map = {"a": 1, "b": 2, "c": 3}
>>>> cond = pc.make_struct(*[pc.equal(input_array, val) for val in map.keys()])
>>>> pc.case_when(cond, *map.values())
> <pyarrow.lib.Int64Array object at 0x7f44a99f32e0>
> [
>  1,
>  2,
>  3,
>  1
> ]
> 
> The "case_when" compute function takes the multiple conditions as a
> StructArray, which you can compose using the "make_struct" compute
> function.
> It's certainly not the most user friendly or obvious way, so we should
> certainly add some examples to the docstring on how to achieve this.
> 
> Also, for this specific case where you already having this "mapping"
> of values you want to replace, I think we should have a specialized
> kernel, avoiding the need to materialize a boolean array for each
> value -> https://issues.apache.org/jira/browse/ARROW-10641
> 
> Joris
> 
> 
>> On Mon, 14 Nov 2022 at 19:51, Ryan Kuhns <[email protected]> wrote:
>> 
>> Hi,
>> 
>> I’ve got one more question as a follow up to my prior question on working 
>> with multi-file zipped CSVs. [1] Figured it was worth asking in another 
>> thread so it would be easier for others to see specific question about 
>> case_when.
>> 
>> I’m trying to accomplish something like pandas DataFrame.Series.map where I 
>> map values of a arrow array to a new value.
>> 
>> pyarrow.compute.case_when looks like a candidate to solve this, but after 
>> reading the docs, I’m still not clear on how to structure the argument to 
>> the “cond” parameter or if there is alternative functionality that would be 
>> better.
>> 
>> Example input, mapping and expected output:
>> 
>> import pyarrow as pa
>> import pyarrow.compute as pc
>> 
>> map = {“a”: 1, “b”: 2, “c”: 3}
>> input_array = pa.array([“a”, “b”, “c”, “a”])
>> expected_output  = pa.array([1, 2, 3, 1])
>> 
>> Logic I’m hoping for would be the equivalent of the following SQL:
>> 
>> Case
>>    when input_array = “a” then 1
>>    when input_array = “b” then 2
>>    when input_array = “c” then 3
>>    else input_array
>> End
>> 
>> Or alternatively, if input array was a a pandas Series then 
>> input_array.map(map).
>> 
>> Thanks again,
>> 
>> Ryan
>> 
>> 
>> 
>> 
>> 
>> [1] https://www.mail-archive.com/[email protected]/msg02379.html

Re: pyarrow.compute case_when

Reply via email to