[GitHub] [spark] sigmod commented on a change in pull request #34470: [SPARK-37199][SQL] Add deterministic field to QueryPlan

GitBox Wed, 03 Nov 2021 17:34:09 -0700


sigmod commented on a change in pull request #34470:
URL: https://github.com/apache/spark/pull/34470#discussion_r742194450




##########
File path: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala
##########
@@ -1931,18 +1931,29 @@ class SubquerySuite extends QueryTest with 
SharedSparkSession with AdaptiveSpark
         sql(
           """
             |SELECT c1, s, s * 10 FROM (
-            |  SELECT c1, (SELECT FIRST(c2) FROM t2 WHERE t1.c1 = t2.c1) s 
FROM t1)
+            |  SELECT c1, (SELECT MIN(c2) FROM t2 WHERE t1.c1 = t2.c1) s FROM 
t1)

Review comment:
       Can we not change the test query and assert the error instead? 
   
   > Just a side note - I have been arguing, that first/last should be 
deterministic functions
   
   +1 even though FIRST/LAST are not truly deterministic during execution. 
   
   The purpose of this field is for determining the eligibility of query 
rewrites. Postgres has a nice categorization of those:
   https://www.postgresql.org/docs/8.3/xfunc-volatility.html
   
   SUM, AVG are not completely deterministic (when running distributed-ly) 
neither, but we can still do query optimizations over them, and I think it'd be 
fine for LAST/FIRST belong too.
   

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala
##########
@@ -1931,18 +1931,29 @@ class SubquerySuite extends QueryTest with 
SharedSparkSession with AdaptiveSpark
         sql(
           """
             |SELECT c1, s, s * 10 FROM (
-            |  SELECT c1, (SELECT FIRST(c2) FROM t2 WHERE t1.c1 = t2.c1) s 
FROM t1)
+            |  SELECT c1, (SELECT MIN(c2) FROM t2 WHERE t1.c1 = t2.c1) s FROM 
t1)

Review comment:
       Can we not change the test query and assert the error instead? 
   
   > Just a side note - I have been arguing, that first/last should be 
deterministic functions
   
   +1 even though FIRST/LAST are not truly deterministic during execution. 
   
   The purpose of this field is for determining the eligibility of query 
rewrites. Postgres has a nice categorization of those:
   https://www.postgresql.org/docs/8.3/xfunc-volatility.html
   
   SUM, AVG are not completely deterministic (when running distributed-ly) 
neither, but we can still do query optimizations over them, and I think it'd be 
fine for LAST/FIRST belong too.  Differently, rand() has to be marked as 
non-deterministic because we don't want query rewrites to move, duplicate or 
dedup it.
   

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala
##########
@@ -1931,18 +1931,29 @@ class SubquerySuite extends QueryTest with 
SharedSparkSession with AdaptiveSpark
         sql(
           """
             |SELECT c1, s, s * 10 FROM (
-            |  SELECT c1, (SELECT FIRST(c2) FROM t2 WHERE t1.c1 = t2.c1) s 
FROM t1)
+            |  SELECT c1, (SELECT MIN(c2) FROM t2 WHERE t1.c1 = t2.c1) s FROM 
t1)

Review comment:
       Can we not change the test query and assert the error instead? 
   
   > Just a side note - I have been arguing, that first/last should be 
deterministic functions
   
   +1 even though FIRST/LAST are not truly deterministic during execution. 
   
   The purpose of this field is for determining the eligibility of query 
rewrites. Postgres has a nice categorization of those:
   https://www.postgresql.org/docs/8.3/xfunc-volatility.html
   
   SUM, AVG are not completely deterministic (when running distributed-ly) 
neither, but we can still do query optimizations over them, and I think it'd be 
fine for LAST/FIRST belong too.
   

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala
##########
@@ -1931,18 +1931,29 @@ class SubquerySuite extends QueryTest with 
SharedSparkSession with AdaptiveSpark
         sql(
           """
             |SELECT c1, s, s * 10 FROM (
-            |  SELECT c1, (SELECT FIRST(c2) FROM t2 WHERE t1.c1 = t2.c1) s 
FROM t1)
+            |  SELECT c1, (SELECT MIN(c2) FROM t2 WHERE t1.c1 = t2.c1) s FROM 
t1)

Review comment:
       Can we not change the test query and assert the error instead? 
   
   > Just a side note - I have been arguing, that first/last should be 
deterministic functions
   
   +1 even though FIRST/LAST are not truly deterministic during execution. 
   
   The purpose of this field is for determining the eligibility of query 
rewrites. Postgres has a nice categorization of those:
   https://www.postgresql.org/docs/8.3/xfunc-volatility.html
   
   SUM, AVG are not completely deterministic (when running distributed-ly) 
neither, but we can still do query optimizations over them, and I think it'd be 
fine for LAST/FIRST belong too.  Differently, rand() has to be marked as 
non-deterministic because we don't want query rewrites to move, duplicate or 
dedup it.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sigmod commented on a change in pull request #34470: [SPARK-37199][SQL] Add deterministic field to QueryPlan

Reply via email to