blongworth commented on issue #39912:
URL: https://github.com/apache/arrow/issues/39912#issuecomment-2437824207

   I'm seeing the same issue when filtering a ~400M row dataset to remove rows 
where a column is duplicated. I'm running R 4.4.1 with Arrow 17.0.0.1 on mac. 
Is there another way do do this within Arrow that I'm missing? Here's the code 
that produces the error:
   
   ```ds_filt <- ds_filt |> 
     group_by(timestamp) |> 
     mutate(duplicate = n()) |> 
     filter(duplicate == 1) |> 
     ungroup()```
   
   Converting `to_duckdb()` gets around this error, but is erroring out due to 
lack of memory. One solution there is to let duckDB work on disk, but that adds 
more time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to