vibhatha commented on PR #14295:
URL: https://github.com/apache/arrow/pull/14295#issuecomment-1269542129
> You are totally right...sorry for the indirection! That is indeed a
tempfile path on a Mac (I forgot to restart my session before reprexing that
one 😬 ).
>
> After fixing the issues you noted, I'm down to one failing test after
building on your branch! Again, it's an index error but I'm _pretty sure_ it's
not an off-by-one (or maybe you can help me spot my error!). What I _think_
this plan does is:
>
> * Reads a 1-column Parquet
> * Computes two new columns: a copy of the first column, and the first
column plus 10, then uses emit to only select the second and third columns of
the output.
> * Computes one new column: a copy of the second column, then uses emit to
only select the first column
>
> Both of these steps work in isolation, but together, they fail.
>
> ```r
> library(arrow, warn.conflicts = FALSE)
> #> Some features are not enabled in this build of Arrow. Run
`arrow_info()` for more information.
>
> plan_as_json <- '{
> "extensionUris": [
> {
> "extensionUriAnchor": 1,
> "uri":
"https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml"
> }
> ],
> "extensions": [
> {
> "extensionFunction": {
> "extensionUriReference": 1,
> "functionAnchor": 2,
> "name": "add"
> }
> }
> ],
> "relations": [
> {
> "rel": {
> "project": {
> "common": {"emit": {"outputMapping": [2]}},
> "input": {
> "project": {
> "common": {"emit": {"outputMapping": [1, 2]}},
> "input": {
> "read": {
> "baseSchema": {
> "names": ["int"],
> "struct": {"types": [{"i32": {}}]}
> },
> "localFiles": {
> "items": [
> {"uriFile": "file://THIS_IS_THE_TEMP_FILE",
"parquet": {}}
> ]
> }
> }
> },
> "expressions": [
> {"selection": {"directReference": {"structField":
{"field": 0}}}},
> {
> "scalarFunction": {
> "functionReference": 2,
> "outputType": {"i32": {}},
> "arguments": [
> {"enum": {"unspecified": {}}},
> {"value": {"selection": {"directReference":
{"structField": {"field": 0}}}}},
> {"value": {"literal": {"fp64": 10}}}
> ]
> }
> }
> ]
> }
> },
> "expressions": [
> {"selection": {"directReference": {"structField": {"field":
0}}}}
> ]
> }
> }
> }
> ]
> }'
>
> temp_parquet <- tempfile()
> write_parquet(data.frame(int = 1:5), temp_parquet)
> plan_as_json <- gsub("THIS_IS_THE_TEMP_FILE", temp_parquet, plan_as_json)
> arrow:::do_exec_plan_substrait(plan_as_json)
> #> Error: Invalid: No match for FieldRef.FieldPath(2) in FieldPath(1):
int32
> #> FieldPath(2): double
> #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/type.h:1796
CheckNonEmpty(matches, root)
> #>
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/expression.cc:437
ref->FindOne(in)
> #>
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/project_node.cc:68
expr.Bind(*inputs[0]->output_schema(), plan->exec_context())
> #>
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:535
MakeExecNode(this->factory_name, plan, std::move(inputs), *this->options,
registry)
> #>
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:530
std::get<Declaration>(input).AddToPlan(plan, registry)
> #>
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:530
std::get<Declaration>(input).AddToPlan(plan, registry)
> ```
>
> Created on 2022-10-04 by the [reprex
package](https://reprex.tidyverse.org) (v2.0.1)
>
> The more readable dplyr synatax is:
>
> ```r
> # remotes::install_github("paleolimbot/substrait-r@enable-arrow")
> # (requires libarrow built against #14295)
> library(substrait, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
>
> # Adds a column to the end
> data.frame(int = 1:5) |>
> arrow_substrait_compiler() |>
> mutate(int10 = int + 10) |>
> collect()
> #> # A tibble: 5 × 2
> #> int int10
> #> <int> <dbl>
> #> 1 1 11
> #> 2 2 12
> #> 3 3 13
> #> 4 4 14
> #> 5 5 15
>
> # Selects the first column
> data.frame(int = 1:5) |>
> arrow_substrait_compiler() |>
> select(int) |>
> collect()
> #> # A tibble: 5 × 1
> #> int
> #> <int>
> #> 1 1
> #> 2 2
> #> 3 3
> #> 4 4
> #> 5 5
>
> # Chaining together fails
> data.frame(int = 1:5) |>
> arrow_substrait_compiler() |>
> mutate(int10 = int + 10) |>
> select(int) |>
> collect()
> #> Error: Invalid: No match for FieldRef.FieldPath(2) in FieldPath(1):
int32
> #> FieldPath(2): double
> #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/type.h:1796
CheckNonEmpty(matches, root)
> #>
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/expression.cc:437
ref->FindOne(in)
> #>
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/project_node.cc:68
expr.Bind(*inputs[0]->output_schema(), plan->exec_context())
> #>
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:535
MakeExecNode(this->factory_name, plan, std::move(inputs), *this->options,
registry)
> #>
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:530
std::get<Declaration>(input).AddToPlan(plan, registry)
> #>
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:530
std::get<Declaration>(input).AddToPlan(plan, registry)
> ```
>
> Created on 2022-10-04 by the [reprex
package](https://reprex.tidyverse.org) (v2.0.1)
First of all thanks for finding out another issue. I added a fix for it.
Your understanding about the plan is accurate.
Flow of Operators: Read ==> Project -> Emit(Project) -> Project ->
Emit(Project)
Schema Field changing ==> 1 Field -> 3 Fields -> 2 Fields -> 3 Fields -> 1
Field
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]