[ 
https://issues.apache.org/jira/browse/ARROW-11679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17286117#comment-17286117
 ] 

Wes McKinney commented on ARROW-11679:
--------------------------------------

I would suggest waiting until these queries can be evaluated natively against 
Arrow data otherwise you're largely looking at dplyr benchmarks. We're working 
on implementing the algorithms so it could be anywhere from 3-6 months from 
now, maybe sooner. 

> [R] Optimal arrow queries for benchmarking
> ------------------------------------------
>
>                 Key: ARROW-11679
>                 URL: https://issues.apache.org/jira/browse/ARROW-11679
>             Project: Apache Arrow
>          Issue Type: Task
>          Components: Benchmarking, R
>            Reporter: Jan Gorecki
>            Priority: Major
>
> Hi
> Hello,
> We are running a continuous benchmarking project 
> (https://h2oai.github.io/db-benchmark). In recent days we added Arrow project.
> It uses R's dplyr and ArrowTable as backend. Queries have been written based 
> on arrow R package documentation.
> There are 10 grouping queries:
> ```
> # q1: sum v1 by id1
> AT %>% select(id1, v1) %>% group_by(id1) %>% collect() %>% 
> summarise(v1=sum(v1, na.rm=TRUE))
> # q2: sum v1 by id1:id2
> AT %>% select(id1, id2, v1) %>% group_by(id1, id2) %>% collect() %>% 
> summarise(v1=sum(v1, na.rm=TRUE))
> # q3: sum v1 mean v3 by id3
> AT %>% select(id3, v1, v3) %>% group_by(id3) %>% collect() %>% 
> summarise(v1=sum(v1, na.rm=TRUE), v3=mean(v3, na.rm=TRUE))
> # q4: mean v1:v3 by id4
> AT %>% select(id4, v1, v2, v3) %>% group_by(id4) %>% collect() %>% 
> summarise_at(.funs=\"mean\", .vars=c(\"v1\",\"v2\",\"v3\"), na.rm=TRUE)
> # q5: sum v1:v3 by id6
> AT %>% select(id6, v1, v2, v3) %>% group_by(id6) %>% collect () %>% 
> summarise_at(.funs=\"sum\", .vars=c(\"v1\",\"v2\",\"v3\"), na.rm=TRUE)
> # q6: median v3 sd v3 by id4 id5
> AT %>% select(id4, id5, v3) %>% group_by(id4, id5) %>% collect() %>% 
> summarise(median_v3=median(v3, na.rm=TRUE), sd_v3=sd(v3, na.rm=TRUE))
> # q7: max v1 - min v2 by id3
> AT %>% select(id3, v1, v2) %>% group_by(id3) %>% collect() %>% 
> summarise(range_v1_v2=max(v1, na.rm=TRUE)-min(v2, na.rm=TRUE))
> # q8: largest two v3 by id6
> AT %>% select(id6, largest2_v3=v3) %>% filter(!is.na(largest2_v3)) %>% 
> arrange(desc(largest2_v3)) %>% group_by(id6) %>% filter(row_number() <= 2L) 
> %>% compute()
> # q9: regression v1 v2 by id2 id4
> AT %>% select(id2, id4, v1, v2) %>% group_by(id2, id4) %>% collect() %>% 
> summarise(r2=cor(v1, v2, use=\"na.or.complete\")^2)
> # q10: sum v3 count by id1:id6
> AT %>% select(id1, id2, id3, id4, id5, id6, v3) %>% group_by(id1, id2, id3, 
> id4, id5, id6) %>% collect() %>% summarise(v3=sum(v3, na.rm=TRUE), count=n())
> ```
> Full benchmark script can be found at 
> https://github.com/h2oai/db-benchmark/blob/master/arrow/groupby-arrow.R
> ----
> As per my understanding, all above queries (maybe excluding query 8) will not 
> utilize any arrow computation, as of now. It is because those operations are 
> not yet implemented in arrow, and they are falling back to dplyr 
> implementation.
> According to Neal's presentation I watched recently, code written now will 
> over time get improved by improvements in arrow implementation. Continuous 
> benchmark I am working on upgrades software automatically, therefore I would 
> like to use the fact to write code now, and have it faster in future, as 
> arrow implementation progresses. I believe the mentioned queries will not 
> satisfy that, because of `collect()` call in the middle. AFAIU it needs a 
> `compute()` call at the end instead (like now in query 8).
> Is there a way to write this code to be optimal now, and also optimal in 
> future. Similarly as presented by Neal in his presentation?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to