Hi Gary, I'm not sure I fully understand the use-case, but it sounds like this might be best handled by a join, which doesn't currently exist with Arrow (but I think the functionality exists in Pandas?).
-Micah On Sun, Sep 6, 2020 at 7:30 AM Gary Clark <gclark...@gmail.com> wrote: > Hi all, > > I am looking for some suggestions on how I can speed up this analytical > workload I am currently working on. This would really help me prove out a > use case here for Apache Arrow to become our go to approach for data > intensive operations. > > I have 2 datasets: > > dataset 1 has 1.5 million rows in 1 parquet file > dataset 2 has 60 million rows in 8 parquet files within a directory > > for each row in dataset 1, there are several columns that become the > conditions for how i filter for matching records in dataset 2. I am > applying these filters upon reading in dataset 2. Here is my code: > > ``` > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > import time > > ds1= pq.read_table('dataset1.parquet') > > ds1_df = ds1.to_pandas().head(5) # testing performance with a sample > > for i, r in ds1_df.iterrows(): > start = time.time() > filters = [ > ('timestamp_col', '>=', r.start_date), > ('timestamp_col', '<=', r.end_date), > ('condition1_non_null', '=', r.condition1), > ('condition2_with_nulls', '=', r.condition2) > ] > > ds2= pq.read_table('./path/dataset2/', filters=filters) > > ds2_df= ds2.to_pandas() > > r['count_condition1_distinct_records'] = ds2_df['condition1_non_null' > ].nunique() > > print(time.time() - start) > ``` > Is this the fastest approach for doing something like this? I added some > timestamps, and it takes ~5-6 secs per row which will be like ~80 days. > > One thought i had was to load dataset2 into memory once instead of each > time, but then applying the filters is where i struggled to understand if > PyArrow has this ability yet. One thing I noticed was in the filtering > conditions I couldn't figure out how to filter out ACTUAL null values (like > for condition 2 which has nulls), so instead, i created the parquet files > where the NULLs come in as strings instead. > > Any guidance is appreciated. Thanks! > > -- > Gary Clark > *Data Scientist & Data Engineer* > *B.S. Mechanical Engineering, Howard University '13* > +1 (717) 798-6916 > gclark...@gmail.com >