Hi Adam, That’s a good point. I will definitely consider using duckdb on the analysis side of things.
I’m hoping to use this as part of my process for storing the data as a hive partitioned parquet dataset for downstream use. It is the last part of that process, I don’t have implemented directly in pyarrow, so hoping I can figure out a compute function to accomplish this. -Ryan > On Nov 14, 2022, at 2:52 PM, Kirby, Adam <[email protected]> wrote: > > > Hi Ryan, > > Others will be able to guide you better than me on your actual question, but > I wanted to mention an alternative approach just in case. > > For my own needs, I tended to find it very productive to express my queries > in duckdb (for which you can choose SQL or its relational API) on top of > pyarrow dataset (or a scanner, if you prefer). If you're coming from SQL, > this approach could let you remain in SQL, for example. > > Adam > >> On Mon, Nov 14, 2022, 1:51 PM Ryan Kuhns <[email protected]> wrote: >> Hi, >> >> I’ve got one more question as a follow up to my prior question on working >> with multi-file zipped CSVs. [1] Figured it was worth asking in another >> thread so it would be easier for others to see specific question about >> case_when. >> >> I’m trying to accomplish something like pandas DataFrame.Series.map where I >> map values of a arrow array to a new value. >> >> pyarrow.compute.case_when looks like a candidate to solve this, but after >> reading the docs, I’m still not clear on how to structure the argument to >> the “cond” parameter or if there is alternative functionality that would be >> better. >> >> Example input, mapping and expected output: >> >> import pyarrow as pa >> import pyarrow.compute as pc >> >> map = {“a”: 1, “b”: 2, “c”: 3} >> input_array = pa.array([“a”, “b”, “c”, “a”]) >> expected_output = pa.array([1, 2, 3, 1]) >> >> Logic I’m hoping for would be the equivalent of the following SQL: >> >> Case >> when input_array = “a” then 1 >> when input_array = “b” then 2 >> when input_array = “c” then 3 >> else input_array >> End >> >> Or alternatively, if input array was a a pandas Series then >> input_array.map(map). >> >> Thanks again, >> >> Ryan >> >> >> >> >> >> [1] https://www.mail-archive.com/[email protected]/msg02379.html
