Anyone?

On Sun, Aug 27, 2023 at 2:21 AM PASSWORD ADMINISTRATOR <
ultimatepwdmas...@gmail.com> wrote:

> First time using a mailing list so bear with me.
>
> I am trying to run a simple query on full NYC taxi dataset (my local copy
> on HDD), which counts number of rows per group, i.e group by X then count
> (*)
>
> In R-arrow, this can be done using
>
>
> nyc_taxi = arrow::open_dataset('aria_nyc/',partitioning = c('year','month'))
> pickup <- nyc_taxi |>
>   filter(
>     !is.na(pickup_longitude),
>     !is.na(pickup_latitude),
>   ) |>
>   mutate(
>     x = as.integer(pickup_longitude),
>     y = as.integer(pickup_latitude)
>   ) |>
>   count(x, y, name = "pickup") |>
>   collect()
>
> This takes 2m 47s on my system. I just couldn't find equivalent API in
> pyarrow. So, I utilized a for loop over dataset in pyarrow, and it was
> taking forever. To simplify, I tried to tried to just run the loop till
> completion. It took over 5mins!
>
>
> nyc = ds.dataset("aria_nyc",partitioning=['yr','month'])
> l = []
> for bat in tqdm(
>     nyc.to_batches(
>         batch_size=1_000_000,
>         filter=~(ds.field('pickup_longitude').is_null() | 
> ds.field('pickup_latitude').is_null())
>         ,columns={
>         
> 'pickup_long_int':pc.round(ds.field('pickup_longitude')).cast('int32'),
>         #'pickup_lat_int':pc.round(ds.field('pickup_latitude')).cast('int32')
>         }
>     )
> ):
>     l.append(bat.num_rows)
>
> I am pretty sure I'm doing something wrong. API also suggests using
> .scanner on a dataset. That continued to give me memory error. What's the
> correct and fastest way to group by and count(*) or pandas'
> .groupby('x').size() in pyarrow over a larger than memory dataset.
>

Reply via email to