Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]

2025-03-24 Thread via GitHub


Blizzara commented on PR #15211:
URL: https://github.com/apache/datafusion/pull/15211#issuecomment-273805

   > I merged this PR up from main to resolve a conflict and plan to merge it 
when the CI passes
   
   Thanks @alamb , should be good to go!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]

2025-03-19 Thread via GitHub


Blizzara commented on code in PR #15211:
URL: https://github.com/apache/datafusion/pull/15211#discussion_r2004258327


##
datafusion/substrait/src/logical_plan/consumer.rs:
##
@@ -1074,6 +1074,9 @@ pub async fn from_project_rel(
 // leaving only explicit expressions.
 
 let mut explicit_exprs: Vec = vec![];
+// For WindowFunctions, we need to wrap them in a Window relation. If 
there are duplicates,
+// we can do the window'ing only once, then the project will duplicate 
the result.
+let mut window_exprs: HashSet = HashSet::new();

Review Comment:
   Actually no, the ordering we use here doesn't matter, since the 
LogicalPlanBuilder::window_plan calls `group_window_expr_by_sort_keys` which 
sorts the expressions anyways. At least I think so. So I just added a comment 
to note that: 
https://github.com/apache/datafusion/pull/15211/commits/696bf5d95980f5bbcdd3809d1d0944a63f18588b



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]

2025-03-19 Thread via GitHub


alamb commented on PR #15211:
URL: https://github.com/apache/datafusion/pull/15211#issuecomment-2737963386

   I merged this PR up from main to resolve a conflict and plan to merge it 
when the CI passes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]

2025-03-19 Thread via GitHub


alamb commented on code in PR #15211:
URL: https://github.com/apache/datafusion/pull/15211#discussion_r2004425310


##
datafusion/substrait/src/logical_plan/consumer.rs:
##
@@ -1074,6 +1074,9 @@ pub async fn from_project_rel(
 // leaving only explicit expressions.
 
 let mut explicit_exprs: Vec = vec![];
+// For WindowFunctions, we need to wrap them in a Window relation. If 
there are duplicates,
+// we can do the window'ing only once, then the project will duplicate 
the result.
+let mut window_exprs: HashSet = HashSet::new();

Review Comment:
   Thanks @Blizzara 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]

2025-03-19 Thread via GitHub


alamb merged PR #15211:
URL: https://github.com/apache/datafusion/pull/15211


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]

2025-03-19 Thread via GitHub


Blizzara commented on code in PR #15211:
URL: https://github.com/apache/datafusion/pull/15211#discussion_r2004246503


##
datafusion/substrait/src/logical_plan/consumer.rs:
##
@@ -1074,6 +1074,9 @@ pub async fn from_project_rel(
 // leaving only explicit expressions.
 
 let mut explicit_exprs: Vec = vec![];
+// For WindowFunctions, we need to wrap them in a Window relation. If 
there are duplicates,
+// we can do the window'ing only once, then the project will duplicate 
the result.
+let mut window_exprs: HashSet = HashSet::new();

Review Comment:
   Hm, I don't think it matters per se, since the project below then puts 
things into right places based on the names. However, for consistency it might 
be nice for the order to stay. Let me see if I can quickly do that!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]

2025-03-17 Thread via GitHub


alamb commented on code in PR #15211:
URL: https://github.com/apache/datafusion/pull/15211#discussion_r1999332505


##
datafusion/substrait/src/logical_plan/consumer.rs:
##
@@ -1074,6 +1074,9 @@ pub async fn from_project_rel(
 // leaving only explicit expressions.
 
 let mut explicit_exprs: Vec = vec![];
+// For WindowFunctions, we need to wrap them in a Window relation. If 
there are duplicates,
+// we can do the window'ing only once, then the project will duplicate 
the result.
+let mut window_exprs: HashSet = HashSet::new();

Review Comment:
   It seems like an improvement over what is on main, so I think we could merge 
this PR as is, but this seems the potential reordering might cause potentially 
confusing intermittently variable output



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]

2025-03-17 Thread via GitHub


alamb commented on code in PR #15211:
URL: https://github.com/apache/datafusion/pull/15211#discussion_r1999329931


##
datafusion/substrait/src/logical_plan/consumer.rs:
##
@@ -1074,6 +1074,9 @@ pub async fn from_project_rel(
 // leaving only explicit expressions.
 
 let mut explicit_exprs: Vec = vec![];
+// For WindowFunctions, we need to wrap them in a Window relation. If 
there are duplicates,
+// we can do the window'ing only once, then the project will duplicate 
the result.
+let mut window_exprs: HashSet = HashSet::new();

Review Comment:
   Doesn't order of the expressions matter? If so, I think you could use an 
`IndexSet` rather than an `HashSet` to preserve the input order too



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]

2025-03-16 Thread via GitHub


westonpace commented on PR #15211:
URL: https://github.com/apache/datafusion/pull/15211#issuecomment-2727463180

   > By "serializer", do you mean "consumer" or "producer"? This PR is for the 
part that turns Substrait into DF plans, so as far as I know the plan is valid 
Substrait and this PR helps make it valid DF as well. Not sure if I understood 
the question though?
   
   It was my mistake.  I misunderstood the PR.  I see now this is a change in 
the consumer and it makes perfect sense now.
   
   For some reason I though this was a change to the producer and I was 
confused.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]

2025-03-14 Thread via GitHub


Blizzara commented on PR #15211:
URL: https://github.com/apache/datafusion/pull/15211#issuecomment-2725937469

   > Does this error come from Datafusion or from Substrait? I wouldn't think 
Substrait would care about duplicate names (since it doesn't use names).
   
   DataFusion. Substrait indeed doesn't care.
   
   > If the error is coming from Datafusion then maybe it's not a valid logical 
plan?
   
   Exactly. It's valid Substrait (produced by Isthmus, and why would it not 
be?), but not valid DF since DF has extra conditions on the logical plan.
   
   > I'm not opposed to the change but do we want to have the Substrait 
serializer be robust to invalid input? That seems like it could be a slippery 
slope to more work.
   
   By "serializer", do you mean "consumer" or "producer"? This PR is for the 
part that turns Substrait into DF plans, so as far as I know the plan is valid 
Substrait and this PR helps make it valid DF as well. Not sure if I understood 
the question though?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]

2025-03-13 Thread via GitHub


Blizzara commented on code in PR #15211:
URL: https://github.com/apache/datafusion/pull/15211#discussion_r1994161768


##
datafusion/substrait/src/logical_plan/consumer.rs:
##
@@ -1083,18 +1086,24 @@ pub async fn from_project_rel(
 // Adding the same expression here and in the project below
 // works because the project's builder uses columnize_expr(..)
 // to transform it into a column reference
-input = input.window(vec![e.clone()])?
+window_exprs.insert(e.clone());
 }
 explicit_exprs.push(name_tracker.get_uniquely_named_expr(e)?);
 }
 
+let input = if !window_exprs.is_empty() {
+LogicalPlanBuilder::window_plan(input, window_exprs)?

Review Comment:
   This has logic built in to separate the expressions by their windows, so 
it's kinda nice.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]

2025-03-13 Thread via GitHub


westonpace commented on PR #15211:
URL: https://github.com/apache/datafusion/pull/15211#issuecomment-2722699472

   > Currently duplicate WindowFunction calls would fail due to duplicate 
unqualified name.
   
   Does this error come from Datafusion or from Substrait?  I wouldn't think 
Substrait would care about duplicate names (since it doesn't use names).
   
   If the error is coming from Datafusion then maybe it's not a valid logical 
plan?
   
   I'm not opposed to the change but do we want to have the Substrait 
serializer be robust to invalid input?  That seems like it could be a slippery 
slope to more work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]

2025-03-13 Thread via GitHub


Blizzara commented on code in PR #15211:
URL: https://github.com/apache/datafusion/pull/15211#discussion_r1993981839


##
datafusion/expr/src/logical_plan/builder.rs:
##
@@ -486,7 +486,7 @@ impl LogicalPlanBuilder {
 /// Wrap a plan in a window
 pub fn window_plan(
 input: LogicalPlan,
-window_exprs: Vec,
+window_exprs: impl IntoIterator,

Review Comment:
   this allows feeding in a HashSet



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]

2025-03-13 Thread via GitHub


alamb commented on PR #15211:
URL: https://github.com/apache/datafusion/pull/15211#issuecomment-2722057494

   > Fyi @vbarua @westonpace @alamb, don't remember now if there were others to 
tag :)
   
   this is the best list I know of !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]

2025-03-13 Thread via GitHub


Blizzara commented on PR #15211:
URL: https://github.com/apache/datafusion/pull/15211#issuecomment-2722050131

   Fyi @vbarua @westonpace @alamb, don't remember now if there were others to 
tag :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]

2025-03-13 Thread via GitHub


Blizzara commented on code in PR #15211:
URL: https://github.com/apache/datafusion/pull/15211#discussion_r1993980557


##
datafusion/substrait/tests/cases/logical_plans.rs:
##
@@ -45,6 +45,10 @@ mod tests {
 "Projection: NOT DATA.D AS EXPR$0\
 \n  TableScan: DATA"
 );
+
+// Trigger execution to ensure plan validity
+DataFrame::new(ctx.state(), plan).show().await?;

Review Comment:
   I just added these for all of the cases, since some of my iterations passed 
logical plan creation but failed at physical planning stage.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org