First time using a mailing list so bear with me.

I am trying to run a simple query on full NYC taxi dataset (my local copy
on HDD), which counts number of rows per group, i.e group by X then count
(*)

In R-arrow, this can be done using


nyc_taxi = arrow::open_dataset('aria_nyc/',partitioning = c('year','month'))
pickup <- nyc_taxi |>
  filter(
    !is.na(pickup_longitude),
    !is.na(pickup_latitude),
  ) |>
  mutate(
    x = as.integer(pickup_longitude),
    y = as.integer(pickup_latitude)
  ) |>
  count(x, y, name = "pickup") |>
  collect()

This takes 2m 47s on my system. I just couldn't find equivalent API in
pyarrow. So, I utilized a for loop over dataset in pyarrow, and it was
taking forever. To simplify, I tried to tried to just run the loop till
completion. It took over 5mins!


nyc = ds.dataset("aria_nyc",partitioning=['yr','month'])
l = []
for bat in tqdm(
    nyc.to_batches(
        batch_size=1_000_000,
        filter=~(ds.field('pickup_longitude').is_null() |
ds.field('pickup_latitude').is_null())
        ,columns={
        'pickup_long_int':pc.round(ds.field('pickup_longitude')).cast('int32'),
        #'pickup_lat_int':pc.round(ds.field('pickup_latitude')).cast('int32')
        }
    )
):
    l.append(bat.num_rows)

I am pretty sure I'm doing something wrong. API also suggests using .scanner
on a dataset. That continued to give me memory error. What's the correct
and fastest way to group by and count(*) or pandas' .groupby('x').size() in
pyarrow over a larger than memory dataset.

Reply via email to