In this case, it looks like Gary would be served by the ability to filter a
table after read, using a mask (or several masks, it looks like) generated
by his first dataframe.  I still think the right answer is "do this in
Pandas for now, the dataframe-like APIs you are after don't yet exist in
PyArrow".  This means that you will bear the memory overhead of reading the
full dataset and also the conversion to pandas (worst case, peak memory
utilization double the dataset size).  I do remember there being some
recent addition to the Table API related to filtering, but I haven't
personally tested it and I think it's not available in all official builds
yet (I think it is present in conda releases on Linux, though).

On Wed, Sep 9, 2020 at 10:28 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Hi Gary,
> I'm not sure I fully understand the use-case, but it sounds like this
> might be best handled by a join, which doesn't currently exist with Arrow
> (but I think the functionality exists in Pandas?).
>
> -Micah
>
> On Sun, Sep 6, 2020 at 7:30 AM Gary Clark <gclark...@gmail.com> wrote:
>
>> Hi all,
>>
>> I am looking for some suggestions on how I can speed up this analytical
>> workload I am currently working on. This would really help me prove out a
>> use case here for Apache Arrow to become our go to approach for data
>> intensive operations.
>>
>> I have 2 datasets:
>>
>> dataset 1 has 1.5 million rows in 1 parquet file
>> dataset 2 has 60 million rows in 8 parquet files within a directory
>>
>> for each row in dataset 1, there are several columns that become the
>> conditions for how i filter for matching records in dataset 2. I am
>> applying these filters upon reading in dataset 2. Here is my code:
>>
>> ```
>> import pyarrow as pa
>> import pyarrow.parquet as pq
>> import pandas as pd
>> import time
>>
>> ds1= pq.read_table('dataset1.parquet')
>>
>> ds1_df = ds1.to_pandas().head(5) # testing performance with a sample
>>
>> for i, r in ds1_df.iterrows():
>>     start = time.time()
>>     filters = [
>>         ('timestamp_col', '>=', r.start_date),
>>         ('timestamp_col', '<=', r.end_date),
>>         ('condition1_non_null', '=', r.condition1),
>>         ('condition2_with_nulls', '=', r.condition2)
>>     ]
>>
>>     ds2= pq.read_table('./path/dataset2/', filters=filters)
>>
>>     ds2_df= ds2.to_pandas()
>>
>>     r['count_condition1_distinct_records'] = ds2_df['condition1_non_null'
>> ].nunique()
>>
>>     print(time.time() - start)
>> ```
>> Is this the fastest approach for doing something like this? I added some
>> timestamps, and it takes ~5-6 secs per row which will be like ~80 days.
>>
>> One thought i had was to load dataset2 into memory once instead of each
>> time, but then applying the filters is where i struggled to understand if
>> PyArrow has this ability yet. One thing I noticed was in the filtering
>> conditions I couldn't figure out how to filter out ACTUAL null values (like
>> for condition 2 which has nulls), so instead, i created the parquet files
>> where the NULLs come in as strings instead.
>>
>> Any guidance is appreciated. Thanks!
>>
>> --
>> Gary Clark
>> *Data Scientist & Data Engineer*
>> *B.S. Mechanical Engineering, Howard University '13*
>> +1 (717) 798-6916
>> gclark...@gmail.com
>>
>

Reply via email to