[ https://issues.apache.org/jira/browse/ARROW-14856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nicola Crane updated ARROW-14856: --------------------------------- Component/s: R > [R] group by n() on partitioning variables counts files not rows > ---------------------------------------------------------------- > > Key: ARROW-14856 > URL: https://issues.apache.org/jira/browse/ARROW-14856 > Project: Apache Arrow > Issue Type: Bug > Components: R > Reporter: Brandon Bertelsen > Priority: Major > > It appears that when grouping by a partitioning variable, summarizy/tally, > n() methods now count the number of files in a group rather than the number > of rows. > Using R package from CRAN 6.0.0.2 > {code:java} > library(arrow) > library(dplyr) > set.seed(42) > df <- data.frame(a = sample(1:1e6)) > df$letters <- sample(letters, replace = T, 1e6) > write_dataset(df, path = "test", partitioning = "letters", hive_style = FALSE) > r <- read_parquet("test/a/part-0.parquet") > nrow(r) > # 38389 > ds <- open_dataset("test", partitioning = 'letters') > ds %>% select(letters) %>% group_by(letters) %>% tally() %>% collect() > # # A tibble: 26 × 2 > # letters n > # <chr> <int> > # 1 c 1 > # 2 p 1 > # 3 a 1 > # 4 b 1 > # 5 e 1 > # 6 q 1 > # 7 d 1 > # 8 g 1 > # 9 r 1 > # 10 h 1 > # # … with 16 more rows > file.copy("test/a/part-0.parquet", "test/a/part-1.parquet") > ds <- open_dataset("test", partitioning = 'letters') > ds %>% select(letters) %>% group_by(letters) %>% tally() %>% collect() %>% > arrange(-n) > # # A tibble: 26 × 2 > # letters n > # <chr> <int> > # 1 a 2 > # 2 d 1 > # 3 f 1 > # 4 x 1 > # 5 c 1 > # 6 b 1 > # 7 e 1 > # 8 g 1 > # 9 u 1 > # 10 k 1 > # # … with 16 more rows > # What about with summarize n = n()? > ds %>% select(letters) %>% group_by(letters) %>% summarize(n = n()) %>% > collect() %>% arrange(-n) > # # A tibble: 26 × 2 > # letters n > # <chr> <int> > # 1 a 2 > # 2 b 1 > # 3 h 1 > # 4 g 1 > # 5 c 1 > # 6 i 1 > # 7 j 1 > # 8 s 1 > # 9 k 1 > # 10 d 1 > # # … with 16 more rows {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)