Hi Adam,

That’s a good point. I will definitely consider using duckdb on the analysis 
side of things. 

I’m hoping to use this as part of my process for storing the data as a hive 
partitioned parquet dataset for downstream use. It is the last part of that 
process, I don’t have implemented directly in pyarrow, so hoping I can figure 
out a compute function to accomplish this.

-Ryan

> On Nov 14, 2022, at 2:52 PM, Kirby, Adam <[email protected]> wrote:
> 
> 
> Hi Ryan,
> 
> Others will be able to guide you better than me on your actual question, but 
> I wanted to mention an alternative approach just in case.
> 
> For my own needs, I tended to find it very productive to express my queries 
> in duckdb (for which you can choose SQL or its relational API) on top of 
> pyarrow dataset (or a scanner, if you prefer). If you're coming from SQL, 
> this approach could let you remain in SQL, for example.
> 
> Adam
> 
>> On Mon, Nov 14, 2022, 1:51 PM Ryan Kuhns <[email protected]> wrote:
>> Hi,
>> 
>> I’ve got one more question as a follow up to my prior question on working 
>> with multi-file zipped CSVs. [1] Figured it was worth asking in another 
>> thread so it would be easier for others to see specific question about 
>> case_when.
>> 
>> I’m trying to accomplish something like pandas DataFrame.Series.map where I 
>> map values of a arrow array to a new value.
>> 
>> pyarrow.compute.case_when looks like a candidate to solve this, but after 
>> reading the docs, I’m still not clear on how to structure the argument to 
>> the “cond” parameter or if there is alternative functionality that would be 
>> better.
>> 
>> Example input, mapping and expected output:
>> 
>> import pyarrow as pa
>> import pyarrow.compute as pc
>> 
>> map = {“a”: 1, “b”: 2, “c”: 3}
>> input_array = pa.array([“a”, “b”, “c”, “a”])
>> expected_output  = pa.array([1, 2, 3, 1])
>> 
>> Logic I’m hoping for would be the equivalent of the following SQL:
>> 
>> Case
>>     when input_array = “a” then 1
>>     when input_array = “b” then 2
>>     when input_array = “c” then 3
>>     else input_array
>> End
>> 
>> Or alternatively, if input array was a a pandas Series then 
>> input_array.map(map).
>> 
>> Thanks again,
>> 
>> Ryan
>> 
>> 
>> 
>> 
>> 
>> [1] https://www.mail-archive.com/[email protected]/msg02379.html

Reply via email to