Brandon Bertelsen created ARROW-14856:
-----------------------------------------

             Summary: [R] group by n() on partitioning variables counts files 
not rows
                 Key: ARROW-14856
                 URL: https://issues.apache.org/jira/browse/ARROW-14856
             Project: Apache Arrow
          Issue Type: Bug
            Reporter: Brandon Bertelsen


It appears that when grouping by a partitioning variable, summarizy/tally, n() 
methods now count the number of files in a group rather than the number of 
rows. 

Using R package from CRAN 6.0.0.2
{code:java}
library(arrow)
library(dplyr)
set.seed(42)

df <- data.frame(a = sample(1:1e6))
df$letters <- sample(letters, replace = T, 1e6)

write_dataset(df, path = "test", partitioning = "letters", hive_style = FALSE)

r <- read_parquet("test/a/part-0.parquet")
nrow(r)
# 38389 

ds <- open_dataset("test", partitioning = 'letters')
ds %>% select(letters) %>% group_by(letters) %>% tally() %>% collect()

# # A tibble: 26 × 2
# letters     n
# <chr>   <int>
#   1 c           1
# 2 p           1
# 3 a           1
# 4 b           1
# 5 e           1
# 6 q           1
# 7 d           1
# 8 g           1
# 9 r           1
# 10 h           1
# # … with 16 more rows


file.copy("test/a/part-0.parquet", "test/a/part-1.parquet")

ds <- open_dataset("test", partitioning = 'letters')
ds %>% select(letters) %>% group_by(letters) %>% tally() %>% collect() %>% 
arrange(-n)
# # A tibble: 26 × 2
# letters     n
# <chr>   <int>
#   1 a         2
# 2 d           1
# 3 f           1
# 4 x           1
# 5 c           1
# 6 b           1
# 7 e           1
# 8 g           1
# 9 u           1
# 10 k           1
# # … with 16 more rows

# What about with summarize n = n()?

ds %>% select(letters) %>% group_by(letters) %>% summarize(n = n()) %>% 
collect() %>% arrange(-n)
# # A tibble: 26 × 2
# letters     n
# <chr>   <int>
#   1 a           2
# 2 b           1
# 3 h           1
# 4 g           1
# 5 c           1
# 6 i           1
# 7 j           1
# 8 s           1
# 9 k           1
# 10 d           1
# # … with 16 more rows {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to