[jira] [Commented] (SPARK-26639) The reuse subquery function maybe does not work in SPARK SQL
[ https://issues.apache.org/jira/browse/SPARK-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512501#comment-17512501 ] Stu commented on SPARK-26639: - Ah, thanks for sharing that [~petertoth] ! > The reuse subquery function maybe does not work in SPARK SQL > > > Key: SPARK-26639 > URL: https://issues.apache.org/jira/browse/SPARK-26639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Ke Jia >Priority: Major > > The subquery reuse feature has done in > [https://github.com/apache/spark/pull/14548] > In my test, I found the visualized plan do show the subquery is executed > once. But the stage of same subquery execute maybe not once. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26639) The reuse subquery function maybe does not work in SPARK SQL
[ https://issues.apache.org/jira/browse/SPARK-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511358#comment-17511358 ] Stu edited comment on SPARK-26639 at 3/24/22, 10:13 PM: Here's another example of this happening, in Spark 3.1.2. I'm running the following code: {code:java} WITH t AS ( SELECT random() as a ) SELECT * FROM t UNION SELECT * FROM t {code} The CTE has a non-deterministic function. If it was pre-calculated, the same random value would be chosen for `a` in both unioned queries, and the output would be deduplicated into a single record. This is not the case. The output is two records, with different random values. In our platform, some folks like to write complex CTEs and reference them multiple times. Recalculating these for every reference is quite computationally expensive, so we recommend to create separate tables in these cases, but don't have any way to enforce this. Fixing this bug would save a good number of compute hours! was (Author: stubartmess): Here's another example of this happening, in Spark 3.1.2. I'm running the following code: {code:java} WITH t AS ( SELECT random() as a ) SELECT * FROM t UNION SELECT * FROM t {code} The CTE has a non-deterministic function. If it was pre-calculated, the same random value would be chosen for `a` in both unioned queries, and the output would be deduplicated into a single record. This is not the case. The output is two records, with different random values. > The reuse subquery function maybe does not work in SPARK SQL > > > Key: SPARK-26639 > URL: https://issues.apache.org/jira/browse/SPARK-26639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Ke Jia >Priority: Major > > The subquery reuse feature has done in > [https://github.com/apache/spark/pull/14548] > In my test, I found the visualized plan do show the subquery is executed > once. But the stage of same subquery execute maybe not once. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26639) The reuse subquery function maybe does not work in SPARK SQL
[ https://issues.apache.org/jira/browse/SPARK-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511358#comment-17511358 ] Stu commented on SPARK-26639: - Here's another example of this happening, in Spark 3.1.2. I'm running the following code: {code:java} WITH t AS ( SELECT random() as a ) SELECT * FROM t UNION SELECT * FROM t {code} The CTE has a non-deterministic function. If it was pre-calculated, the same random value would be chosen for `a` in both unioned queries, and the output would be deduplicated into a single record. This is not the case. The output is two records, with different random values. > The reuse subquery function maybe does not work in SPARK SQL > > > Key: SPARK-26639 > URL: https://issues.apache.org/jira/browse/SPARK-26639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Ke Jia >Priority: Major > > The subquery reuse feature has done in > [https://github.com/apache/spark/pull/14548] > In my test, I found the visualized plan do show the subquery is executed > once. But the stage of same subquery execute maybe not once. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23599) The UUID() expression is too non-deterministic
[ https://issues.apache.org/jira/browse/SPARK-23599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466961#comment-17466961 ] Stu commented on SPARK-23599: - We have encountered this problem with Spark 3.1.2, resulting in duplicate values in a situation where a spark executor died. As suggested in the description, this error was hard to track down and difficult to replicate. > The UUID() expression is too non-deterministic > -- > > Key: SPARK-23599 > URL: https://issues.apache.org/jira/browse/SPARK-23599 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hövell >Assignee: L. C. Hsieh >Priority: Critical > Fix For: 2.3.1, 2.4.0 > > > The current {{Uuid()}} expression uses {{java.util.UUID.randomUUID}} for UUID > generation. There are a couple of major problems with this: > - It is non-deterministic across task retries. This breaks Spark's processing > model, and this will to very hard to trace bugs, like non-deterministic > shuffles, duplicates and missing rows. > - It uses a single secure random for UUID generation. This uses a single JVM > wide lock, and this can lead to lock contention and other performance > problems. > We should move to something that is deterministic between retries. This can > be done by using seeded PRNGs for which we set the seed during planning. It > is important here to use a PRNG that provides enough entropy for creating a > proper UUID. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33883) Can repeat "where" twice without error in spark sql
[ https://issues.apache.org/jira/browse/SPARK-33883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17255603#comment-17255603 ] Stu commented on SPARK-33883: - that makes so much sense, thanks! > Can repeat "where" twice without error in spark sql > --- > > Key: SPARK-33883 > URL: https://issues.apache.org/jira/browse/SPARK-33883 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Stu >Priority: Minor > Attachments: image-2020-12-28-18-24-18-395.png, > image-2020-12-28-18-32-25-960.png > > > the following sql code works, despite having bad syntax ("where" is mentioned > twice): > {code:java} > select * from table > where where field is not null{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33883) Can repeat "where" twice without error in spark sql
[ https://issues.apache.org/jira/browse/SPARK-33883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stu updated SPARK-33883: Description: the following sql code works, despite having bad syntax ("where" is mentioned twice): {code:java} select * from table where where field is not null{code} was: the following sql code works, despite having bad syntax (where is mentioned twice): {code:java} select * from table where where field is not null{code} > Can repeat "where" twice without error in spark sql > --- > > Key: SPARK-33883 > URL: https://issues.apache.org/jira/browse/SPARK-33883 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Stu >Priority: Minor > > the following sql code works, despite having bad syntax ("where" is mentioned > twice): > {code:java} > select * from table > where where field is not null{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33883) Can repeat "where" twice without error in spark sql
Stu created SPARK-33883: --- Summary: Can repeat "where" twice without error in spark sql Key: SPARK-33883 URL: https://issues.apache.org/jira/browse/SPARK-33883 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3 Reporter: Stu the following sql code works, despite having bad syntax (where is mentioned twice): {code:java} select * from table where where field is not null{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org