Hi Ryan,

Others will be able to guide you better than me on your actual question,
but I wanted to mention an alternative approach just in case.

For my own needs, I tended to find it very productive to express my queries
in duckdb (for which you can choose SQL or its relational API) on top of
pyarrow dataset (or a scanner, if you prefer). If you're coming from SQL,
this approach could let you remain in SQL, for example.

Adam

On Mon, Nov 14, 2022, 1:51 PM Ryan Kuhns <[email protected]> wrote:

> Hi,
>
> I’ve got one more question as a follow up to my prior question on working
> with multi-file zipped CSVs. [1] Figured it was worth asking in another
> thread so it would be easier for others to see specific question about
> case_when.
>
> I’m trying to accomplish something like pandas DataFrame.Series.map where
> I map values of a arrow array to a new value.
>
> pyarrow.compute.case_when looks like a candidate to solve this, but after
> reading the docs, I’m still not clear on how to structure the argument to
> the “cond” parameter or if there is alternative functionality that would be
> better.
>
> Example input, mapping and expected output:
>
> import pyarrow as pa
> import pyarrow.compute as pc
>
> map = {“a”: 1, “b”: 2, “c”: 3}
> input_array = pa.array([“a”, “b”, “c”, “a”])
> expected_output  = pa.array([1, 2, 3, 1])
>
> Logic I’m hoping for would be the equivalent of the following SQL:
>
> Case
>     when input_array = “a” then 1
>     when input_array = “b” then 2
>     when input_array = “c” then 3
>     else input_array
> End
>
> Or alternatively, if input array was a a pandas Series then
> input_array.map(map).
>
> Thanks again,
>
> Ryan
>
>
>
>
>
> [1] https://www.mail-archive.com/[email protected]/msg02379.html

Reply via email to