your code is extremely inefficient in pandas (using iterrows should only ever be a last resort) - you should ask on StackOverflow in suggestions.
simply merging and filtering is likely to work and be very fast > On Sep 10, 2020, at 1:37 PM, Matthew Corley <mattcor...@gmail.com> wrote: > > > In this case, it looks like Gary would be served by the ability to filter a > table after read, using a mask (or several masks, it looks like) generated by > his first dataframe. I still think the right answer is "do this in Pandas > for now, the dataframe-like APIs you are after don't yet exist in PyArrow". > This means that you will bear the memory overhead of reading the full dataset > and also the conversion to pandas (worst case, peak memory utilization double > the dataset size). I do remember there being some recent addition to the > Table API related to filtering, but I haven't personally tested it and I > think it's not available in all official builds yet (I think it is present in > conda releases on Linux, though). > >> On Wed, Sep 9, 2020 at 10:28 PM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> Hi Gary, >> I'm not sure I fully understand the use-case, but it sounds like this might >> be best handled by a join, which doesn't currently exist with Arrow (but I >> think the functionality exists in Pandas?). >> >> -Micah >> >>> On Sun, Sep 6, 2020 at 7:30 AM Gary Clark <gclark...@gmail.com> wrote: >>> Hi all, >>> >>> I am looking for some suggestions on how I can speed up this analytical >>> workload I am currently working on. This would really help me prove out a >>> use case here for Apache Arrow to become our go to approach for data >>> intensive operations. >>> >>> I have 2 datasets: >>> >>> dataset 1 has 1.5 million rows in 1 parquet file >>> dataset 2 has 60 million rows in 8 parquet files within a directory >>> >>> for each row in dataset 1, there are several columns that become the >>> conditions for how i filter for matching records in dataset 2. I am >>> applying these filters upon reading in dataset 2. Here is my code: >>> >>> ``` >>> import pyarrow as pa >>> import pyarrow.parquet as pq >>> import pandas as pd >>> import time >>> >>> ds1= pq.read_table('dataset1.parquet') >>> >>> ds1_df = ds1.to_pandas().head(5) # testing performance with a sample >>> >>> for i, r in ds1_df.iterrows(): >>> start = time.time() >>> filters = [ >>> ('timestamp_col', '>=', r.start_date), >>> ('timestamp_col', '<=', r.end_date), >>> ('condition1_non_null', '=', r.condition1), >>> ('condition2_with_nulls', '=', r.condition2) >>> ] >>> >>> ds2= pq.read_table('./path/dataset2/', filters=filters) >>> >>> ds2_df= ds2.to_pandas() >>> >>> r['count_condition1_distinct_records'] = >>> ds2_df['condition1_non_null'].nunique() >>> >>> print(time.time() - start) >>> ``` >>> Is this the fastest approach for doing something like this? I added some >>> timestamps, and it takes ~5-6 secs per row which will be like ~80 days. >>> >>> One thought i had was to load dataset2 into memory once instead of each >>> time, but then applying the filters is where i struggled to understand if >>> PyArrow has this ability yet. One thing I noticed was in the filtering >>> conditions I couldn't figure out how to filter out ACTUAL null values (like >>> for condition 2 which has nulls), so instead, i created the parquet files >>> where the NULLs come in as strings instead. >>> >>> Any guidance is appreciated. Thanks! >>> >>> -- >>> Gary Clark >>> Data Scientist & Data Engineer >>> B.S. Mechanical Engineering, Howard University '13 >>> +1 (717) 798-6916 >>> gclark...@gmail.com