[GitHub] [arrow] dragosmg commented on a diff in pull request #13541: ARROW-15016: [R] `show_exec_plan` for an `arrow_dplyr_query`

GitBox Wed, 13 Jul 2022 03:12:24 -0700


dragosmg commented on code in PR #13541:
URL: https://github.com/apache/arrow/pull/13541#discussion_r919899240



##########
r/R/dplyr.R:
##########
@@ -219,6 +219,31 @@ tail.arrow_dplyr_query <- function(x, n = 6L, ...) {
   x
 }
 
+#' Show the details of an Arrow Execution Plan
+#'
+#' This is a function which gives more details about the Execution Plan 
(`ExecPlan`)
+#' of an `arrow_dplyr_query` object. It is similar to `dplyr::explain()`.
+#'
+#' @param x an `arrow_dplyr_query` to print the `ExecPlan` for.
+#'
+#' @return The argument, invisibly.
+#' @export
+#'
+#' @examplesIf arrow_with_dataset() & requireNamespace("dplyr", quietly = TRUE)
+#' library(dplyr)
+#' mtcars %>%
+#'   arrow_table() %>%
+#'   filter(mpg > 20) %>%
+#'   mutate(x = gear/carb) %>%
+#'   show_exec_plan()
+show_exec_plan <- function(x) {

Review Comment:
   Here's a link to the [design 
doc](https://docs.google.com/document/d/1Ep8aV4jDsNCCy9uv1bjWY_JF17nzHQogv0EnGJvraQI/edit),
 also linked in the PR description and the Jira ticket.
   
   In short, in my opinion, a _first step_ would be to have `show_exec_plan()` 
as a function that outputs exactly what we get from C++. That's what I 
interpreted the scope of the 
[ticket](https://issues.apache.org/jira/browse/ARROW-15016) to be.
   
   My proposal rests on this point-of-view: I think we need, for accessibility 
reasons, 2 separate ways of surfacing details regarding the ExecPlan, for 2 
separate audiences (one readable by the seasoned Arrow developer and one 
readable by the regular R user). 
   1. `show_exec_plan()` would cater to the first audience and, thus, focus on 
minimising cognitive load by keeping things the same across languages. 
   2. `explain()` should probably not aim to be both and would be targeted 
towards the second audience and do it well. By _well_ I understand here as in a 
way that makes sense to the regular R user, who expects arrow to just work. In 
this context, `explain()` would be a tool allowing them to inspect what is 
going on in a dplyr-like pipeline, but do so in a language they are familiar 
with. 
   
   Having an `explain()` method would absolutely be super useful, but I do not 
think that is fully scoped out yet. Probably, before we start the 
implementation of `explain()`, we should flesh out what the output of 
`explain()` will look like. I strongly believe we should start from the 
description of the generic which says: "gives more details about an object 
(...) and is more focused on _human readable_ output" (my underlining). 
   
   In this PR there are already some really good suggestions of stuff we could 
include in the output of a function like `explain()` (e.g. to cover 
`query_can_stream()`). 
   
   If we add to the above the time constraint of the impending release, I do 
not think extending this PR to cover  `explain()` would be the best course of 
action.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] dragosmg commented on a diff in pull request #13541: ARROW-15016: [R] `show_exec_plan` for an `arrow_dplyr_query`

Reply via email to