Re: [Python] fastest filtering approach for applying some operations over 1 million rows

Jeff Reback Thu, 10 Sep 2020 11:09:20 -0700

your code is extremely inefficient in pandas (using iterrows should only ever 
be a last resort) - you should ask on StackOverflow in suggestions.


simply merging and filtering is likely to work and be very fast

> On Sep 10, 2020, at 1:37 PM, Matthew Corley <mattcor...@gmail.com> wrote:
> 
> 
> In this case, it looks like Gary would be served by the ability to filter a 
> table after read, using a mask (or several masks, it looks like) generated by 
> his first dataframe.  I still think the right answer is "do this in Pandas 
> for now, the dataframe-like APIs you are after don't yet exist in PyArrow".  
> This means that you will bear the memory overhead of reading the full dataset 
> and also the conversion to pandas (worst case, peak memory utilization double 
> the dataset size).  I do remember there being some recent addition to the 
> Table API related to filtering, but I haven't personally tested it and I 
> think it's not available in all official builds yet (I think it is present in 
> conda releases on Linux, though).
> 
>> On Wed, Sep 9, 2020 at 10:28 PM Micah Kornfield <emkornfi...@gmail.com> 
>> wrote:
>> Hi Gary,
>> I'm not sure I fully understand the use-case, but it sounds like this might 
>> be best handled by a join, which doesn't currently exist with Arrow (but I 
>> think the functionality exists in Pandas?).
>> 
>> -Micah
>> 
>>> On Sun, Sep 6, 2020 at 7:30 AM Gary Clark <gclark...@gmail.com> wrote:
>>> Hi all,
>>> 
>>> I am looking for some suggestions on how I can speed up this analytical 
>>> workload I am currently working on. This would really help me prove out a 
>>> use case here for Apache Arrow to become our go to approach for data 
>>> intensive operations.
>>> 
>>> I have 2 datasets:
>>> 
>>> dataset 1 has 1.5 million rows in 1 parquet file
>>> dataset 2 has 60 million rows in 8 parquet files within a directory
>>> 
>>> for each row in dataset 1, there are several columns that become the 
>>> conditions for how i filter for matching records in dataset 2. I am 
>>> applying these filters upon reading in dataset 2. Here is my code:
>>> 
>>> ```
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> import pandas as pd
>>> import time
>>> 
>>> ds1= pq.read_table('dataset1.parquet')
>>> 
>>> ds1_df = ds1.to_pandas().head(5) # testing performance with a sample
>>> 
>>> for i, r in ds1_df.iterrows():
>>>     start = time.time()
>>>     filters = [
>>>         ('timestamp_col', '>=', r.start_date),
>>>         ('timestamp_col', '<=', r.end_date),
>>>         ('condition1_non_null', '=', r.condition1),
>>>         ('condition2_with_nulls', '=', r.condition2)
>>>     ]
>>> 
>>>     ds2= pq.read_table('./path/dataset2/', filters=filters)
>>> 
>>>     ds2_df= ds2.to_pandas()
>>> 
>>>     r['count_condition1_distinct_records'] = 
>>> ds2_df['condition1_non_null'].nunique()
>>> 
>>>     print(time.time() - start)
>>> ```
>>> Is this the fastest approach for doing something like this? I added some 
>>> timestamps, and it takes ~5-6 secs per row which will be like ~80 days.
>>> 
>>> One thought i had was to load dataset2 into memory once instead of each 
>>> time, but then applying the filters is where i struggled to understand if 
>>> PyArrow has this ability yet. One thing I noticed was in the filtering 
>>> conditions I couldn't figure out how to filter out ACTUAL null values (like 
>>> for condition 2 which has nulls), so instead, i created the parquet files 
>>> where the NULLs come in as strings instead.
>>> 
>>> Any guidance is appreciated. Thanks!
>>> 
>>> -- 
>>> Gary Clark
>>> Data Scientist & Data Engineer
>>> B.S. Mechanical Engineering, Howard University '13
>>> +1 (717) 798-6916
>>> gclark...@gmail.com

Re: [Python] fastest filtering approach for applying some operations over 1 million rows

Reply via email to