Hi all,
I am looking for some suggestions on how I can speed up this analytical
workload I am currently working on. This would really help me prove out a
use case here for Apache Arrow to become our go to approach for data
intensive operations.
I have 2 datasets:
dataset 1 has 1.5 million rows in 1 parquet file
dataset 2 has 60 million rows in 8 parquet files within a directory
for each row in dataset 1, there are several columns that become the
conditions for how i filter for matching records in dataset 2. I am
applying these filters upon reading in dataset 2. Here is my code:
```
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import time
ds1= pq.read_table('dataset1.parquet')
ds1_df = ds1.to_pandas().head(5) # testing performance with a sample
for i, r in ds1_df.iterrows():
start = time.time()
filters = [
('timestamp_col', '>=', r.start_date),
('timestamp_col', '<=', r.end_date),
('condition1_non_null', '=', r.condition1),
('condition2_with_nulls', '=', r.condition2)
]
ds2= pq.read_table('./path/dataset2/', filters=filters)
ds2_df= ds2.to_pandas()
r['count_condition1_distinct_records'] = ds2_df['condition1_non_null'
].nunique()
print(time.time() - start)
```
Is this the fastest approach for doing something like this? I added some
timestamps, and it takes ~5-6 secs per row which will be like ~80 days.
One thought i had was to load dataset2 into memory once instead of each
time, but then applying the filters is where i struggled to understand if
PyArrow has this ability yet. One thing I noticed was in the filtering
conditions I couldn't figure out how to filter out ACTUAL null values (like
for condition 2 which has nulls), so instead, i created the parquet files
where the NULLs come in as strings instead.
Any guidance is appreciated. Thanks!
--
Gary Clark
*Data Scientist & Data Engineer*
*B.S. Mechanical Engineering, Howard University '13*
+1 (717) 798-6916
[email protected]