[ https://issues.apache.org/jira/browse/ARROW-11413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17274546#comment-17274546 ]
Jonathan Keane commented on ARROW-11413: ---------------------------------------- Thank you for the report, there shouldn't be a limitation like that. We are trying to reproduce this on our end, but haven't been able to so far. Could you please share with us what version of R you're running this in, as well as some details about the system you're running it on (especially how much RAM you have available)? > dplyr filter is not working for datasets > ----------------------------------------- > > Key: ARROW-11413 > URL: https://issues.apache.org/jira/browse/ARROW-11413 > Project: Apache Arrow > Issue Type: Bug > Components: R > Affects Versions: 2.0.0, 3.0.0 > Environment: i7, windows 10 laptop > Reporter: Zsolt Kegyes-Brassai > Priority: Minor > > I was trying to recreate the > [vignette|https://arrow.apache.org/docs/r/articles/dataset.html] on datasets > and dplyr on a win10 machine. I downloaded the data for 2 consecutive years > (2017, 2018) to my laptop. > The filter is working only for variables used for partitioning. When I am > inserting any other variable (like the total_amount) the R/RStudio session > hangs: no error message and more interestingly no detectable CPU load nor > disk usage (task manager) for many minutes. > I experienced the same issue both with arrow 2.0.0 and 3.0.0 (just I update > my R packages this morning). Previously, I already tried to reinstall the > arrow 2.0.0 package. > Did I misunderstand something in the vignette? Is there any OS limitation? > > {code:java} > // > > library(arrow)Attaching package: 'arrow'The following object is masked from > > 'package:utils': timestamp> library(tidyverse) > -- Attaching packages > ---------------------------------------------------------------------- > tidyverse 1.3.0 -- > v ggplot2 3.3.3 v purrr 0.3.4 > v tibble 3.0.5 v dplyr 1.0.3 > v tidyr 1.1.2 v stringr 1.4.0 > v readr 1.4.0 v forcats 0.5.1 > -- Conflicts > ------------------------------------------------------------------------- > tidyverse_conflicts() -- > x dplyr::filter() masks stats::filter() > x dplyr::lag() masks stats::lag() > > arrow_available() > [1] TRUE > > arrow_info() > Arrow package version: 3.0.0Capabilities: > > s3 TRUE > snappy TRUE > gzip TRUE > brotli FALSE > zstd TRUE > lz4 TRUE > lz4_frame TRUE > lzo FALSE > bz2 FALSE > jemalloc FALSE > mimalloc TRUEMemory: > > Allocator mimalloc > Current 0 bytes > Max 0 bytes> > > ds <- open_dataset(taxidir, partitioning = c("year", "month")) > > ds > FileSystemDataset with 24 Parquet files > vendor_id: string > pickup_at: timestamp[us] > dropoff_at: timestamp[us] > passenger_count: int8 > trip_distance: float > rate_code_id: string > store_and_fwd_flag: string > pickup_location_id: int32 > dropoff_location_id: int32 > payment_type: string > fare_amount: float > extra: float > mta_tax: float > tip_amount: float > tolls_amount: float > improvement_surcharge: float > total_amount: float > year: int32 > month: int32See $metadata for additional Schema metadata > > > > a <- ds %>% > + select(year, total_amount) %>% collect() > > > > b <- ds %>% > + filter(year == 2018) %>% > + select(year, total_amount) %>% collect() > > > > c <- ds %>% > + filter(total_amount > 100) %>% > + select(year, total_amount) %>% collect(){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)