Hi Ryan, Others will be able to guide you better than me on your actual question, but I wanted to mention an alternative approach just in case.
For my own needs, I tended to find it very productive to express my queries in duckdb (for which you can choose SQL or its relational API) on top of pyarrow dataset (or a scanner, if you prefer). If you're coming from SQL, this approach could let you remain in SQL, for example. Adam On Mon, Nov 14, 2022, 1:51 PM Ryan Kuhns <[email protected]> wrote: > Hi, > > I’ve got one more question as a follow up to my prior question on working > with multi-file zipped CSVs. [1] Figured it was worth asking in another > thread so it would be easier for others to see specific question about > case_when. > > I’m trying to accomplish something like pandas DataFrame.Series.map where > I map values of a arrow array to a new value. > > pyarrow.compute.case_when looks like a candidate to solve this, but after > reading the docs, I’m still not clear on how to structure the argument to > the “cond” parameter or if there is alternative functionality that would be > better. > > Example input, mapping and expected output: > > import pyarrow as pa > import pyarrow.compute as pc > > map = {“a”: 1, “b”: 2, “c”: 3} > input_array = pa.array([“a”, “b”, “c”, “a”]) > expected_output = pa.array([1, 2, 3, 1]) > > Logic I’m hoping for would be the equivalent of the following SQL: > > Case > when input_array = “a” then 1 > when input_array = “b” then 2 > when input_array = “c” then 3 > else input_array > End > > Or alternatively, if input array was a a pandas Series then > input_array.map(map). > > Thanks again, > > Ryan > > > > > > [1] https://www.mail-archive.com/[email protected]/msg02379.html
