vibhatha commented on PR #14295:
URL: https://github.com/apache/arrow/pull/14295#issuecomment-1269542129

   > You are totally right...sorry for the indirection! That is indeed a 
tempfile path on a Mac (I forgot to restart my session before reprexing that 
one 😬 ).
   > 
   > After fixing the issues you noted, I'm down to one failing test after 
building on your branch! Again, it's an index error but I'm _pretty sure_ it's 
not an off-by-one (or maybe you can help me spot my error!). What I _think_ 
this plan does is:
   > 
   > * Reads a 1-column Parquet
   > * Computes two new columns: a copy of the first column, and the first 
column plus 10, then uses emit to only select the second and third columns of 
the output.
   > * Computes one new column: a copy of the second column, then uses emit to 
only select the first column
   > 
   > Both of these steps work in isolation, but together, they fail.
   > 
   > ```r
   > library(arrow, warn.conflicts = FALSE)
   > #> Some features are not enabled in this build of Arrow. Run 
`arrow_info()` for more information.
   > 
   > plan_as_json <- '{
   >   "extensionUris": [
   >     {
   >       "extensionUriAnchor": 1,
   >       "uri": 
"https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml";
   >     }
   >   ],
   >   "extensions": [
   >     {
   >       "extensionFunction": {
   >         "extensionUriReference": 1,
   >         "functionAnchor": 2,
   >         "name": "add"
   >       }
   >     }
   >   ],
   >   "relations": [
   >     {
   >       "rel": {
   >         "project": {
   >           "common": {"emit": {"outputMapping": [2]}},
   >           "input": {
   >             "project": {
   >               "common": {"emit": {"outputMapping": [1, 2]}},
   >               "input": {
   >                 "read": {
   >                   "baseSchema": {
   >                     "names": ["int"],
   >                     "struct": {"types": [{"i32": {}}]}
   >                   },
   >                   "localFiles": {
   >                     "items": [
   >                       {"uriFile": "file://THIS_IS_THE_TEMP_FILE", 
"parquet": {}}
   >                     ]
   >                   }
   >                 }
   >               },
   >               "expressions": [
   >                 {"selection": {"directReference": {"structField": 
{"field": 0}}}},
   >                 {
   >                   "scalarFunction": {
   >                     "functionReference": 2,
   >                     "outputType": {"i32": {}},
   >                     "arguments": [
   >                       {"enum": {"unspecified": {}}},
   >                       {"value": {"selection": {"directReference": 
{"structField": {"field": 0}}}}},
   >                       {"value": {"literal": {"fp64": 10}}}
   >                     ]
   >                   }
   >                 }
   >               ]
   >             }
   >           },
   >           "expressions": [
   >             {"selection": {"directReference": {"structField": {"field": 
0}}}}
   >           ]
   >         }
   >       }
   >     }
   >   ]
   > }'
   > 
   > temp_parquet <- tempfile()
   > write_parquet(data.frame(int = 1:5), temp_parquet)
   > plan_as_json <- gsub("THIS_IS_THE_TEMP_FILE", temp_parquet, plan_as_json)
   > arrow:::do_exec_plan_substrait(plan_as_json)
   > #> Error: Invalid: No match for FieldRef.FieldPath(2) in FieldPath(1): 
int32
   > #> FieldPath(2): double
   > #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/type.h:1796 
 CheckNonEmpty(matches, root)
   > #> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/expression.cc:437
  ref->FindOne(in)
   > #> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/project_node.cc:68
  expr.Bind(*inputs[0]->output_schema(), plan->exec_context())
   > #> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:535
  MakeExecNode(this->factory_name, plan, std::move(inputs), *this->options, 
registry)
   > #> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:530
  std::get<Declaration>(input).AddToPlan(plan, registry)
   > #> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:530
  std::get<Declaration>(input).AddToPlan(plan, registry)
   > ```
   > 
   > Created on 2022-10-04 by the [reprex 
package](https://reprex.tidyverse.org) (v2.0.1)
   > 
   > The more readable dplyr synatax is:
   > 
   > ```r
   > # remotes::install_github("paleolimbot/substrait-r@enable-arrow")
   > # (requires libarrow built against #14295)
   > library(substrait, warn.conflicts = FALSE)
   > library(dplyr, warn.conflicts = FALSE)
   > 
   > # Adds a column to the end
   > data.frame(int = 1:5) |> 
   >   arrow_substrait_compiler() |> 
   >   mutate(int10 = int + 10) |> 
   >   collect()
   > #> # A tibble: 5 × 2
   > #>     int int10
   > #>   <int> <dbl>
   > #> 1     1    11
   > #> 2     2    12
   > #> 3     3    13
   > #> 4     4    14
   > #> 5     5    15
   > 
   > # Selects the first column
   > data.frame(int = 1:5) |> 
   >   arrow_substrait_compiler() |> 
   >   select(int) |> 
   >   collect()
   > #> # A tibble: 5 × 1
   > #>     int
   > #>   <int>
   > #> 1     1
   > #> 2     2
   > #> 3     3
   > #> 4     4
   > #> 5     5
   > 
   > # Chaining together fails
   > data.frame(int = 1:5) |> 
   >   arrow_substrait_compiler() |> 
   >   mutate(int10 = int + 10) |> 
   >   select(int) |> 
   >   collect()
   > #> Error: Invalid: No match for FieldRef.FieldPath(2) in FieldPath(1): 
int32
   > #> FieldPath(2): double
   > #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/type.h:1796 
 CheckNonEmpty(matches, root)
   > #> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/expression.cc:437
  ref->FindOne(in)
   > #> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/project_node.cc:68
  expr.Bind(*inputs[0]->output_schema(), plan->exec_context())
   > #> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:535
  MakeExecNode(this->factory_name, plan, std::move(inputs), *this->options, 
registry)
   > #> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:530
  std::get<Declaration>(input).AddToPlan(plan, registry)
   > #> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:530
  std::get<Declaration>(input).AddToPlan(plan, registry)
   > ```
   > 
   > Created on 2022-10-04 by the [reprex 
package](https://reprex.tidyverse.org) (v2.0.1)
   
   First of all thanks for finding out another issue. I added a fix for it. 
   
   Your understanding about the plan is accurate. 
   
   Flow of Operators: Read ==> Project -> Emit(Project) -> Project -> 
Emit(Project)
   Schema Field changing  ==>  1 Field -> 3 Fields -> 2 Fields -> 3 Fields -> 1 
Field


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to