Anyone? On Sun, Aug 27, 2023 at 2:21 AM PASSWORD ADMINISTRATOR < ultimatepwdmas...@gmail.com> wrote:
> First time using a mailing list so bear with me. > > I am trying to run a simple query on full NYC taxi dataset (my local copy > on HDD), which counts number of rows per group, i.e group by X then count > (*) > > In R-arrow, this can be done using > > > nyc_taxi = arrow::open_dataset('aria_nyc/',partitioning = c('year','month')) > pickup <- nyc_taxi |> > filter( > !is.na(pickup_longitude), > !is.na(pickup_latitude), > ) |> > mutate( > x = as.integer(pickup_longitude), > y = as.integer(pickup_latitude) > ) |> > count(x, y, name = "pickup") |> > collect() > > This takes 2m 47s on my system. I just couldn't find equivalent API in > pyarrow. So, I utilized a for loop over dataset in pyarrow, and it was > taking forever. To simplify, I tried to tried to just run the loop till > completion. It took over 5mins! > > > nyc = ds.dataset("aria_nyc",partitioning=['yr','month']) > l = [] > for bat in tqdm( > nyc.to_batches( > batch_size=1_000_000, > filter=~(ds.field('pickup_longitude').is_null() | > ds.field('pickup_latitude').is_null()) > ,columns={ > > 'pickup_long_int':pc.round(ds.field('pickup_longitude')).cast('int32'), > #'pickup_lat_int':pc.round(ds.field('pickup_latitude')).cast('int32') > } > ) > ): > l.append(bat.num_rows) > > I am pretty sure I'm doing something wrong. API also suggests using > .scanner on a dataset. That continued to give me memory error. What's the > correct and fastest way to group by and count(*) or pandas' > .groupby('x').size() in pyarrow over a larger than memory dataset. >