Hi joris,
I appreciate the ticket. I like proposed functionality (and use of keyword to
move between map and replace functionality).
I appreciate the help figuring out the use of make_struct. I got an error on
the values portion when using unpacking. In pyarrow the following works:
>> import pyarrow as pa
>> import pyarrow.compute as pc
>>
>> map = {“a”: 1, “b”: 2, “c”: 3}
>> input_array = pa.array([“a”, “b”, “c”, “a”])
>> expected_output = pa.array([1, 2, 3, 1])
>>>>
>>>> cond = pc.make_struct(*[pc.equal(input_array, val) for val in map.keys()])
>>>> pc.case_when(cond, 1, 2, 3)
>>>>
Thanks,
Ryan
> On Nov 15, 2022, at 2:33 AM, Joris Van den Bossche
> <[email protected]> wrote:
>
> And as an answer to how you can use pyarrow.compute.case_when for this:
>
>>>> map = {"a": 1, "b": 2, "c": 3}
>>>> cond = pc.make_struct(*[pc.equal(input_array, val) for val in map.keys()])
>>>> pc.case_when(cond, *map.values())
> <pyarrow.lib.Int64Array object at 0x7f44a99f32e0>
> [
> 1,
> 2,
> 3,
> 1
> ]
>
> The "case_when" compute function takes the multiple conditions as a
> StructArray, which you can compose using the "make_struct" compute
> function.
> It's certainly not the most user friendly or obvious way, so we should
> certainly add some examples to the docstring on how to achieve this.
>
> Also, for this specific case where you already having this "mapping"
> of values you want to replace, I think we should have a specialized
> kernel, avoiding the need to materialize a boolean array for each
> value -> https://issues.apache.org/jira/browse/ARROW-10641
>
> Joris
>
>
>> On Mon, 14 Nov 2022 at 19:51, Ryan Kuhns <[email protected]> wrote:
>>
>> Hi,
>>
>> I’ve got one more question as a follow up to my prior question on working
>> with multi-file zipped CSVs. [1] Figured it was worth asking in another
>> thread so it would be easier for others to see specific question about
>> case_when.
>>
>> I’m trying to accomplish something like pandas DataFrame.Series.map where I
>> map values of a arrow array to a new value.
>>
>> pyarrow.compute.case_when looks like a candidate to solve this, but after
>> reading the docs, I’m still not clear on how to structure the argument to
>> the “cond” parameter or if there is alternative functionality that would be
>> better.
>>
>> Example input, mapping and expected output:
>>
>> import pyarrow as pa
>> import pyarrow.compute as pc
>>
>> map = {“a”: 1, “b”: 2, “c”: 3}
>> input_array = pa.array([“a”, “b”, “c”, “a”])
>> expected_output = pa.array([1, 2, 3, 1])
>>
>> Logic I’m hoping for would be the equivalent of the following SQL:
>>
>> Case
>> when input_array = “a” then 1
>> when input_array = “b” then 2
>> when input_array = “c” then 3
>> else input_array
>> End
>>
>> Or alternatively, if input array was a a pandas Series then
>> input_array.map(map).
>>
>> Thanks again,
>>
>> Ryan
>>
>>
>>
>>
>>
>> [1] https://www.mail-archive.com/[email protected]/msg02379.html