Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]
Blizzara commented on PR #15211: URL: https://github.com/apache/datafusion/pull/15211#issuecomment-273805 > I merged this PR up from main to resolve a conflict and plan to merge it when the CI passes Thanks @alamb , should be good to go! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]
Blizzara commented on code in PR #15211: URL: https://github.com/apache/datafusion/pull/15211#discussion_r2004258327 ## datafusion/substrait/src/logical_plan/consumer.rs: ## @@ -1074,6 +1074,9 @@ pub async fn from_project_rel( // leaving only explicit expressions. let mut explicit_exprs: Vec = vec![]; +// For WindowFunctions, we need to wrap them in a Window relation. If there are duplicates, +// we can do the window'ing only once, then the project will duplicate the result. +let mut window_exprs: HashSet = HashSet::new(); Review Comment: Actually no, the ordering we use here doesn't matter, since the LogicalPlanBuilder::window_plan calls `group_window_expr_by_sort_keys` which sorts the expressions anyways. At least I think so. So I just added a comment to note that: https://github.com/apache/datafusion/pull/15211/commits/696bf5d95980f5bbcdd3809d1d0944a63f18588b -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]
alamb commented on PR #15211: URL: https://github.com/apache/datafusion/pull/15211#issuecomment-2737963386 I merged this PR up from main to resolve a conflict and plan to merge it when the CI passes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]
alamb commented on code in PR #15211: URL: https://github.com/apache/datafusion/pull/15211#discussion_r2004425310 ## datafusion/substrait/src/logical_plan/consumer.rs: ## @@ -1074,6 +1074,9 @@ pub async fn from_project_rel( // leaving only explicit expressions. let mut explicit_exprs: Vec = vec![]; +// For WindowFunctions, we need to wrap them in a Window relation. If there are duplicates, +// we can do the window'ing only once, then the project will duplicate the result. +let mut window_exprs: HashSet = HashSet::new(); Review Comment: Thanks @Blizzara -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]
alamb merged PR #15211: URL: https://github.com/apache/datafusion/pull/15211 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]
Blizzara commented on code in PR #15211: URL: https://github.com/apache/datafusion/pull/15211#discussion_r2004246503 ## datafusion/substrait/src/logical_plan/consumer.rs: ## @@ -1074,6 +1074,9 @@ pub async fn from_project_rel( // leaving only explicit expressions. let mut explicit_exprs: Vec = vec![]; +// For WindowFunctions, we need to wrap them in a Window relation. If there are duplicates, +// we can do the window'ing only once, then the project will duplicate the result. +let mut window_exprs: HashSet = HashSet::new(); Review Comment: Hm, I don't think it matters per se, since the project below then puts things into right places based on the names. However, for consistency it might be nice for the order to stay. Let me see if I can quickly do that! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]
alamb commented on code in PR #15211: URL: https://github.com/apache/datafusion/pull/15211#discussion_r1999332505 ## datafusion/substrait/src/logical_plan/consumer.rs: ## @@ -1074,6 +1074,9 @@ pub async fn from_project_rel( // leaving only explicit expressions. let mut explicit_exprs: Vec = vec![]; +// For WindowFunctions, we need to wrap them in a Window relation. If there are duplicates, +// we can do the window'ing only once, then the project will duplicate the result. +let mut window_exprs: HashSet = HashSet::new(); Review Comment: It seems like an improvement over what is on main, so I think we could merge this PR as is, but this seems the potential reordering might cause potentially confusing intermittently variable output -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]
alamb commented on code in PR #15211: URL: https://github.com/apache/datafusion/pull/15211#discussion_r1999329931 ## datafusion/substrait/src/logical_plan/consumer.rs: ## @@ -1074,6 +1074,9 @@ pub async fn from_project_rel( // leaving only explicit expressions. let mut explicit_exprs: Vec = vec![]; +// For WindowFunctions, we need to wrap them in a Window relation. If there are duplicates, +// we can do the window'ing only once, then the project will duplicate the result. +let mut window_exprs: HashSet = HashSet::new(); Review Comment: Doesn't order of the expressions matter? If so, I think you could use an `IndexSet` rather than an `HashSet` to preserve the input order too -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]
westonpace commented on PR #15211: URL: https://github.com/apache/datafusion/pull/15211#issuecomment-2727463180 > By "serializer", do you mean "consumer" or "producer"? This PR is for the part that turns Substrait into DF plans, so as far as I know the plan is valid Substrait and this PR helps make it valid DF as well. Not sure if I understood the question though? It was my mistake. I misunderstood the PR. I see now this is a change in the consumer and it makes perfect sense now. For some reason I though this was a change to the producer and I was confused. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]
Blizzara commented on PR #15211: URL: https://github.com/apache/datafusion/pull/15211#issuecomment-2725937469 > Does this error come from Datafusion or from Substrait? I wouldn't think Substrait would care about duplicate names (since it doesn't use names). DataFusion. Substrait indeed doesn't care. > If the error is coming from Datafusion then maybe it's not a valid logical plan? Exactly. It's valid Substrait (produced by Isthmus, and why would it not be?), but not valid DF since DF has extra conditions on the logical plan. > I'm not opposed to the change but do we want to have the Substrait serializer be robust to invalid input? That seems like it could be a slippery slope to more work. By "serializer", do you mean "consumer" or "producer"? This PR is for the part that turns Substrait into DF plans, so as far as I know the plan is valid Substrait and this PR helps make it valid DF as well. Not sure if I understood the question though? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]
Blizzara commented on code in PR #15211: URL: https://github.com/apache/datafusion/pull/15211#discussion_r1994161768 ## datafusion/substrait/src/logical_plan/consumer.rs: ## @@ -1083,18 +1086,24 @@ pub async fn from_project_rel( // Adding the same expression here and in the project below // works because the project's builder uses columnize_expr(..) // to transform it into a column reference -input = input.window(vec![e.clone()])? +window_exprs.insert(e.clone()); } explicit_exprs.push(name_tracker.get_uniquely_named_expr(e)?); } +let input = if !window_exprs.is_empty() { +LogicalPlanBuilder::window_plan(input, window_exprs)? Review Comment: This has logic built in to separate the expressions by their windows, so it's kinda nice. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]
westonpace commented on PR #15211: URL: https://github.com/apache/datafusion/pull/15211#issuecomment-2722699472 > Currently duplicate WindowFunction calls would fail due to duplicate unqualified name. Does this error come from Datafusion or from Substrait? I wouldn't think Substrait would care about duplicate names (since it doesn't use names). If the error is coming from Datafusion then maybe it's not a valid logical plan? I'm not opposed to the change but do we want to have the Substrait serializer be robust to invalid input? That seems like it could be a slippery slope to more work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]
Blizzara commented on code in PR #15211: URL: https://github.com/apache/datafusion/pull/15211#discussion_r1993981839 ## datafusion/expr/src/logical_plan/builder.rs: ## @@ -486,7 +486,7 @@ impl LogicalPlanBuilder { /// Wrap a plan in a window pub fn window_plan( input: LogicalPlan, -window_exprs: Vec, +window_exprs: impl IntoIterator, Review Comment: this allows feeding in a HashSet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]
alamb commented on PR #15211: URL: https://github.com/apache/datafusion/pull/15211#issuecomment-2722057494 > Fyi @vbarua @westonpace @alamb, don't remember now if there were others to tag :) this is the best list I know of ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]
Blizzara commented on PR #15211: URL: https://github.com/apache/datafusion/pull/15211#issuecomment-2722050131 Fyi @vbarua @westonpace @alamb, don't remember now if there were others to tag :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]
Blizzara commented on code in PR #15211: URL: https://github.com/apache/datafusion/pull/15211#discussion_r1993980557 ## datafusion/substrait/tests/cases/logical_plans.rs: ## @@ -45,6 +45,10 @@ mod tests { "Projection: NOT DATA.D AS EXPR$0\ \n TableScan: DATA" ); + +// Trigger execution to ensure plan validity +DataFrame::new(ctx.state(), plan).show().await?; Review Comment: I just added these for all of the cases, since some of my iterations passed logical plan creation but failed at physical planning stage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org