[ https://issues.apache.org/jira/browse/ARROW-15590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530201#comment-17530201 ]
Weston Pace edited comment on ARROW-15590 at 4/29/22 7:16 PM: -------------------------------------------------------------- The Substrait spec (the website) doesn't always match the .proto yet. This is not a great thing but it's a work in progress. Feel free to open some PRs against the site if you want. In the meantime I find it easier to work with the proto: {noformat} message JoinRel { RelCommon common = 1; Rel left = 2; Rel right = 3; Expression expression = 4; Expression post_join_filter = 5; JoinType type = 6; ... } {noformat} The {{post_join_filter}} is not on the site today and should match {{HashJoinNodeOptions::filter}}. {{LeftInput}} and {{RightInput}} correspond to the inputs specified when adding a join to a plan and so they aren't in {{HashJoinNodeOptions}}: {noformat} MakeExecNode("hashjoin", plan.get(), {LeftInput, RightInput}, join_options)); {noformat} You are correct that we do not handle expressions in general for the join condition. So I think the best thing to do here initially is restrict the set of allowed plans. If the expression is not a call then reject it. If the expression is a call then it must be one of two functions, "equal" or "is_not_distinct_from". In either case the function has two arguments. Both arguments must be a {{FieldReference}}. We can convert from a Substrait {{FieldReference}} to an Arrow {{FieldRef}} and so that will give you left keys and right keys. There is an Arrow options {{HashJoinNodeOptions::key_cmp}}. If the Substrait function is "equal" then use {{JoinKeyCmp::Eq}}. If the Substrait function is "is_not_distinct_from" then use {{JoinKeyCmp::Is}}. With the above approach you will always have exactly one left key, one right key, and one join type. Later (could be in this PR or a follow-up) we can also handle expressions that are an and'ed set of equality expressions: {noformat} and(equal(field(3),field(5)), equal(field(1),field(7)), equal(field(2), field(12))) {noformat} In this case the number of keys/join types you have would depend on the number of equality expressions in the and (3 in the above example). was (Author: westonpace): The Substrait spec (the website) doesn't always match the .proto yet. This is not a great thing but it's a work in progress. Feel free to open some PRs against the site if you want. In the meantime I find it easier to work with the proto: ``` message JoinRel { RelCommon common = 1; Rel left = 2; Rel right = 3; Expression expression = 4; Expression post_join_filter = 5; JoinType type = 6; ... } ``` The {{post_join_filter}} is not on the site today and should match {{HashJoinNodeOptions::filter}}. {{LeftInput}} and {{RightInput}} correspond to the inputs specified when adding a join to a plan and so they aren't in {{HashJoinNodeOptions}}: ``` MakeExecNode("hashjoin", plan.get(), {LeftInput, RightInput}, join_options)); ``` You are correct that we do not handle expressions in general for the join condition. So I think the best thing to do here initially is restrict the set of allowed plans. If the expression is not a call then reject it. If the expression is a call then it must be one of two functions, "equal" or "is_not_distinct_from". In either case the function has two arguments. Both arguments must be a {{FieldReference}}. We can convert from a Substrait {{FieldReference}} to an Arrow {{FieldRef}} and so that will give you left keys and right keys. There is an Arrow options {{HashJoinNodeOptions::key_cmp}}. If the Substrait function is "equal" then use {{JoinKeyCmp::Eq}}. If the Substrait function is "is_not_distinct_from" then use {{JoinKeyCmp::Is}}. With the above approach you will always have exactly one left key, one right key, and one join type. Later (could be in this PR or a follow-up) we can also handle expressions that are an and'ed set of equality expressions: {noformat} and(equal(field(3),field(5)), equal(field(1),field(7)), equal(field(2), field(12))) {noformat} In this case the number of keys/join types you have would depend on the number of equality expressions in the and (3 in the above example). > [C++] Add support for joins to the Substrait consumer > ----------------------------------------------------- > > Key: ARROW-15590 > URL: https://issues.apache.org/jira/browse/ARROW-15590 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Weston Pace > Assignee: Vibhatha Lakmal Abeykoon > Priority: Major > Labels: substrait > > The streaming execution engine supports joins. The Substrait consumer does > not currently consume joins. We should add support for this. We may want to > split this PR into subtasks as there are many different kinds of joins and we > may not support all of them immediately. -- This message was sent by Atlassian Jira (v8.20.7#820007)