Hi all,

I am looking for some suggestions on how I can speed up this analytical
workload I am currently working on. This would really help me prove out a
use case here for Apache Arrow to become our go to approach for data
intensive operations.

I have 2 datasets:

dataset 1 has 1.5 million rows in 1 parquet file
dataset 2 has 60 million rows in 8 parquet files within a directory

for each row in dataset 1, there are several columns that become the
conditions for how i filter for matching records in dataset 2. I am
applying these filters upon reading in dataset 2. Here is my code:

```
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import time

ds1= pq.read_table('dataset1.parquet')

ds1_df = ds1.to_pandas().head(5) # testing performance with a sample

for i, r in ds1_df.iterrows():
    start = time.time()
    filters = [
        ('timestamp_col', '>=', r.start_date),
        ('timestamp_col', '<=', r.end_date),
        ('condition1_non_null', '=', r.condition1),
        ('condition2_with_nulls', '=', r.condition2)
    ]

    ds2= pq.read_table('./path/dataset2/', filters=filters)

    ds2_df= ds2.to_pandas()

    r['count_condition1_distinct_records'] = ds2_df['condition1_non_null'
].nunique()

    print(time.time() - start)
```
Is this the fastest approach for doing something like this? I added some
timestamps, and it takes ~5-6 secs per row which will be like ~80 days.

One thought i had was to load dataset2 into memory once instead of each
time, but then applying the filters is where i struggled to understand if
PyArrow has this ability yet. One thing I noticed was in the filtering
conditions I couldn't figure out how to filter out ACTUAL null values (like
for condition 2 which has nulls), so instead, i created the parquet files
where the NULLs come in as strings instead.

Any guidance is appreciated. Thanks!

-- 
Gary Clark
*Data Scientist & Data Engineer*
*B.S. Mechanical Engineering, Howard University '13*
+1 (717) 798-6916
[email protected]

Reply via email to