[ https://issues.apache.org/jira/browse/ARROW-11679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weston Pace updated ARROW-11679: -------------------------------- Summary: [R] Optimal arrow queries for benchmarking (was: Optimal arrow queries for benchmarking) > [R] Optimal arrow queries for benchmarking > ------------------------------------------ > > Key: ARROW-11679 > URL: https://issues.apache.org/jira/browse/ARROW-11679 > Project: Apache Arrow > Issue Type: Task > Components: Benchmarking, R > Reporter: Jan Gorecki > Priority: Major > > Hi > Hello, > We are running a continuous benchmarking project > (https://h2oai.github.io/db-benchmark). In recent days we added Arrow project. > It uses R's dplyr and ArrowTable as backend. Queries have been written based > on arrow R package documentation. > There are 10 grouping queries: > ``` > # q1: sum v1 by id1 > AT %>% select(id1, v1) %>% group_by(id1) %>% collect() %>% > summarise(v1=sum(v1, na.rm=TRUE)) > # q2: sum v1 by id1:id2 > AT %>% select(id1, id2, v1) %>% group_by(id1, id2) %>% collect() %>% > summarise(v1=sum(v1, na.rm=TRUE)) > # q3: sum v1 mean v3 by id3 > AT %>% select(id3, v1, v3) %>% group_by(id3) %>% collect() %>% > summarise(v1=sum(v1, na.rm=TRUE), v3=mean(v3, na.rm=TRUE)) > # q4: mean v1:v3 by id4 > AT %>% select(id4, v1, v2, v3) %>% group_by(id4) %>% collect() %>% > summarise_at(.funs=\"mean\", .vars=c(\"v1\",\"v2\",\"v3\"), na.rm=TRUE) > # q5: sum v1:v3 by id6 > AT %>% select(id6, v1, v2, v3) %>% group_by(id6) %>% collect () %>% > summarise_at(.funs=\"sum\", .vars=c(\"v1\",\"v2\",\"v3\"), na.rm=TRUE) > # q6: median v3 sd v3 by id4 id5 > AT %>% select(id4, id5, v3) %>% group_by(id4, id5) %>% collect() %>% > summarise(median_v3=median(v3, na.rm=TRUE), sd_v3=sd(v3, na.rm=TRUE)) > # q7: max v1 - min v2 by id3 > AT %>% select(id3, v1, v2) %>% group_by(id3) %>% collect() %>% > summarise(range_v1_v2=max(v1, na.rm=TRUE)-min(v2, na.rm=TRUE)) > # q8: largest two v3 by id6 > AT %>% select(id6, largest2_v3=v3) %>% filter(!is.na(largest2_v3)) %>% > arrange(desc(largest2_v3)) %>% group_by(id6) %>% filter(row_number() <= 2L) > %>% compute() > # q9: regression v1 v2 by id2 id4 > AT %>% select(id2, id4, v1, v2) %>% group_by(id2, id4) %>% collect() %>% > summarise(r2=cor(v1, v2, use=\"na.or.complete\")^2) > # q10: sum v3 count by id1:id6 > AT %>% select(id1, id2, id3, id4, id5, id6, v3) %>% group_by(id1, id2, id3, > id4, id5, id6) %>% collect() %>% summarise(v3=sum(v3, na.rm=TRUE), count=n()) > ``` > Full benchmark script can be found at > https://github.com/h2oai/db-benchmark/blob/master/arrow/groupby-arrow.R > ---- > As per my understanding, all above queries (maybe excluding query 8) will not > utilize any arrow computation, as of now. It is because those operations are > not yet implemented in arrow, and they are falling back to dplyr > implementation. > According to Neal's presentation I watched recently, code written now will > over time get improved by improvements in arrow implementation. Continuous > benchmark I am working on upgrades software automatically, therefore I would > like to use the fact to write code now, and have it faster in future, as > arrow implementation progresses. I believe the mentioned queries will not > satisfy that, because of `collect()` call in the middle. AFAIU it needs a > `compute()` call at the end instead (like now in query 8). > Is there a way to write this code to be optimal now, and also optimal in > future. Similarly as presented by Neal in his presentation? -- This message was sent by Atlassian Jira (v8.3.4#803005)