Re: [PR] Add LogicalPlanSignature and use in the optimizer loop [datafusion]

2024-07-05 Thread via GitHub


findepi commented on code in PR #5623:
URL: https://github.com/apache/datafusion/pull/5623#discussion_r1667008120


##
datafusion/optimizer/src/optimizer.rs:
##
@@ -330,15 +324,14 @@ impl Optimizer {
 }
 log_plan(!("Optimized plan (pass {i})"), _plan);
 
-// TODO this is an expensive way to see if the optimizer did 
anything and
-// it would be better to change the OptimizerRule trait to return 
an Option
-// instead
-if old_plan.as_ref() == _plan {
+// HashSet::insert returns, whether the value was newly inserted.
+let plan_is_fresh =
+previous_plans.insert(LogicalPlanSignature::new(_plan));
+if !plan_is_fresh {

Review Comment:
   filed https://github.com/apache/datafusion/issues/11285 for this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add LogicalPlanSignature and use in the optimizer loop [datafusion]

2024-07-05 Thread via GitHub


alamb commented on code in PR #5623:
URL: https://github.com/apache/datafusion/pull/5623#discussion_r1666983056


##
datafusion/optimizer/src/optimizer.rs:
##
@@ -330,15 +324,14 @@ impl Optimizer {
 }
 log_plan(!("Optimized plan (pass {i})"), _plan);
 
-// TODO this is an expensive way to see if the optimizer did 
anything and
-// it would be better to change the OptimizerRule trait to return 
an Option
-// instead
-if old_plan.as_ref() == _plan {
+// HashSet::insert returns, whether the value was newly inserted.
+let plan_is_fresh =
+previous_plans.insert(LogicalPlanSignature::new(_plan));
+if !plan_is_fresh {

Review Comment:
   Thank you for bringing this up @mslapek  and @findepi 
   
   
   > Suboptimal plan can be orders of magnitude more expensive to execute, so 
allowing it to run may cause unavailability for others, but I see your point. 
It's especially difficult to transition from bug lenient treatment to more 
strict. It should be gradual. I like the idea of having this initially 
controlled with a flag. In tests and on CI this flag should be set to "fail". 
Then we can switch the flag on for runtime as well.
   
   
   My thoughts are that I can see the tradeoffs with both behavior:
   1. Fail fast (raise an error if cycle is detected) that @findepi increases 
the liklihood that such an error would be found and fixed quickly (as it would 
prevent the query in question from running)
   2. Lenient (ignore error)  user still gets their answer (though maybe woudl 
take much longer)
   
   One middle ground might be as @findepi  suggests and use a flag -- we could 
default to raising an error if a cycle was detected but have a way for users to 
ignore the error and proceed anyways. As long as the error message said how to 
work around the error I think it would be fine.
   
   In fact we have a similar setting already for failed optimizer rules, for 
many of the same reasons discussed, that we could model the behavior on 
`datafusion.optimizer.skip_failed_rules`: 
https://datafusion.apache.org/user-guide/configs.html
   
   
   datafusion.optimizer.skip_failed_rules | false | When set to true, the 
logical plan optimizer will produce warning messages if any optimization rules 
produce errors and then proceed to the next rule. When set to false, any rules 
that produce errors will cause the query to fail
   -- | -- | --
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add LogicalPlanSignature and use in the optimizer loop [datafusion]

2024-07-03 Thread via GitHub


findepi commented on code in PR #5623:
URL: https://github.com/apache/datafusion/pull/5623#discussion_r1664596130


##
datafusion/optimizer/src/optimizer.rs:
##
@@ -330,15 +324,14 @@ impl Optimizer {
 }
 log_plan(!("Optimized plan (pass {i})"), _plan);
 
-// TODO this is an expensive way to see if the optimizer did 
anything and
-// it would be better to change the OptimizerRule trait to return 
an Option
-// instead
-if old_plan.as_ref() == _plan {
+// HashSet::insert returns, whether the value was newly inserted.
+let plan_is_fresh =
+previous_plans.insert(LogicalPlanSignature::new(_plan));
+if !plan_is_fresh {

Review Comment:
   @mslapek thanks for response!
   
   > What motivated you to review this merged (one year ago) PR?
   
   I find the source PR as an efficient way to ask a question about the code 
and reach people with context -- code author(s) and reviewers -- in one shot.
   
   > Yes, a cycle shows a logical error in the optimizer - because at each step 
the performance should be improved.
   
   Agreed!
   
   > At the same time, it does NOT imply a correctness error. The plan probably 
will be still correct (but not optimal).
   
   Agreed!
   
   > We should take the perspective of a user. If a data scientist does 
research, should we harm the availability of the database?
   
   Suboptimal plan can be orders of magnitude more expensive to execute, so 
allowing it to run may cause unavailability for others, but I see your point. 
It's especially difficult to transition from bug lenient treatment to more 
strict. It should be gradual. I like the idea of having this initially 
controlled with a flag. In tests and on CI this flag should be set to "fail". 
Then we can switch the flag on for runtime as well.
   
   @alamb @jackwener thoughts?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add LogicalPlanSignature and use in the optimizer loop [datafusion]

2024-07-03 Thread via GitHub


mslapek commented on code in PR #5623:
URL: https://github.com/apache/datafusion/pull/5623#discussion_r1664441817


##
datafusion/optimizer/src/optimizer.rs:
##
@@ -330,15 +324,14 @@ impl Optimizer {
 }
 log_plan(!("Optimized plan (pass {i})"), _plan);
 
-// TODO this is an expensive way to see if the optimizer did 
anything and
-// it would be better to change the OptimizerRule trait to return 
an Option
-// instead
-if old_plan.as_ref() == _plan {
+// HashSet::insert returns, whether the value was newly inserted.
+let plan_is_fresh =
+previous_plans.insert(LogicalPlanSignature::new(_plan));
+if !plan_is_fresh {

Review Comment:
   Yes, a cycle shows a logical error in the optimizer - because at each step 
the performance should be improved.
   
   At the same time, it does NOT imply a correctness error. The plan probably 
will be still correct (but not optimal).
   
   > IMO cycle should be error condition
   
   We should take the perspective of a user. If a data scientist does research, 
should we harm the availability of the database?
   
   IMO Maybe a log of this situation could be useful, with a configuration flag 
turning this into an error...
   
   @findepi Btw. What motivated you to review this merged (one year ago) PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add LogicalPlanSignature and use in the optimizer loop [datafusion]

2024-07-03 Thread via GitHub


findepi commented on code in PR #5623:
URL: https://github.com/apache/datafusion/pull/5623#discussion_r1664186593


##
datafusion/optimizer/src/optimizer.rs:
##
@@ -330,15 +324,14 @@ impl Optimizer {
 }
 log_plan(!("Optimized plan (pass {i})"), _plan);
 
-// TODO this is an expensive way to see if the optimizer did 
anything and
-// it would be better to change the OptimizerRule trait to return 
an Option
-// instead
-if old_plan.as_ref() == _plan {
+// HashSet::insert returns, whether the value was newly inserted.
+let plan_is_fresh =
+previous_plans.insert(LogicalPlanSignature::new(_plan));
+if !plan_is_fresh {

Review Comment:
   At this point we detected optimization cycle. Cycles are bad, so we exit. 
Exiting isn't ideal because our plan is optimized yet. We simply stopped 
applying rules.
   
   IMO cycle should be error condition



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org