[jira] [Commented] (SPARK-26639) The reuse subquery function maybe does not work in SPARK SQL

2022-03-25 Thread Stu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512501#comment-17512501
 ] 

Stu commented on SPARK-26639:
-

Ah, thanks for sharing that [~petertoth] !

> The reuse subquery function maybe does not work in SPARK SQL
> 
>
> Key: SPARK-26639
> URL: https://issues.apache.org/jira/browse/SPARK-26639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ke Jia
>Priority: Major
>
> The subquery reuse feature has done in 
> [https://github.com/apache/spark/pull/14548]
> In my test, I found the visualized plan do show the subquery is executed 
> once. But the stage of same subquery execute maybe not once.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26639) The reuse subquery function maybe does not work in SPARK SQL

2022-03-24 Thread Stu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511358#comment-17511358
 ] 

Stu edited comment on SPARK-26639 at 3/24/22, 10:13 PM:


Here's another example of this happening, in Spark 3.1.2. I'm running the 
following code:
{code:java}
WITH t AS (
  SELECT random() as a
) 
  SELECT * FROM t
  UNION
  SELECT * FROM t {code}
The CTE has a non-deterministic function. If it was pre-calculated, the same 
random value would be chosen for `a` in both unioned queries, and the output 
would be deduplicated into a single record.

This is not the case. The output is two records, with different random values.

In our platform, some folks like to write complex CTEs and reference them 
multiple times. Recalculating these for every reference is quite 
computationally expensive, so we recommend to create separate tables in these 
cases, but don't have any way to enforce this. Fixing this bug would save a 
good number of compute hours!


was (Author: stubartmess):
Here's another example of this happening, in Spark 3.1.2. I'm running the 
following code:
{code:java}
WITH t AS (
  SELECT random() as a
) 
  SELECT * FROM t
  UNION
  SELECT * FROM t {code}
The CTE has a non-deterministic function. If it was pre-calculated, the same 
random value would be chosen for `a` in both unioned queries, and the output 
would be deduplicated into a single record.

This is not the case. The output is two records, with different random values.

> The reuse subquery function maybe does not work in SPARK SQL
> 
>
> Key: SPARK-26639
> URL: https://issues.apache.org/jira/browse/SPARK-26639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ke Jia
>Priority: Major
>
> The subquery reuse feature has done in 
> [https://github.com/apache/spark/pull/14548]
> In my test, I found the visualized plan do show the subquery is executed 
> once. But the stage of same subquery execute maybe not once.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26639) The reuse subquery function maybe does not work in SPARK SQL

2022-03-23 Thread Stu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511358#comment-17511358
 ] 

Stu commented on SPARK-26639:
-

Here's another example of this happening, in Spark 3.1.2. I'm running the 
following code:
{code:java}
WITH t AS (
  SELECT random() as a
) 
  SELECT * FROM t
  UNION
  SELECT * FROM t {code}
The CTE has a non-deterministic function. If it was pre-calculated, the same 
random value would be chosen for `a` in both unioned queries, and the output 
would be deduplicated into a single record.

This is not the case. The output is two records, with different random values.

> The reuse subquery function maybe does not work in SPARK SQL
> 
>
> Key: SPARK-26639
> URL: https://issues.apache.org/jira/browse/SPARK-26639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ke Jia
>Priority: Major
>
> The subquery reuse feature has done in 
> [https://github.com/apache/spark/pull/14548]
> In my test, I found the visualized plan do show the subquery is executed 
> once. But the stage of same subquery execute maybe not once.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23599) The UUID() expression is too non-deterministic

2021-12-30 Thread Stu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466961#comment-17466961
 ] 

Stu commented on SPARK-23599:
-

We have encountered this problem with Spark 3.1.2, resulting in duplicate 
values in a situation where a spark executor died. As suggested in the 
description, this error was hard to track down and difficult to replicate. 

> The UUID() expression is too non-deterministic
> --
>
> Key: SPARK-23599
> URL: https://issues.apache.org/jira/browse/SPARK-23599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hövell
>Assignee: L. C. Hsieh
>Priority: Critical
> Fix For: 2.3.1, 2.4.0
>
>
> The current {{Uuid()}} expression uses {{java.util.UUID.randomUUID}} for UUID 
> generation. There are a couple of major problems with this:
> - It is non-deterministic across task retries. This breaks Spark's processing 
> model, and this will to very hard to trace bugs, like non-deterministic 
> shuffles, duplicates and missing rows.
> - It uses a single secure random for UUID generation. This uses a single JVM 
> wide lock, and this can lead to lock contention and other performance 
> problems.
> We should move to something that is deterministic between retries. This can 
> be done by using seeded PRNGs for which we set the seed during planning. It 
> is important here to use a PRNG that provides enough entropy for creating a 
> proper UUID.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33883) Can repeat "where" twice without error in spark sql

2020-12-28 Thread Stu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17255603#comment-17255603
 ] 

Stu commented on SPARK-33883:
-

that makes so much sense, thanks!

> Can repeat "where" twice without error in spark sql
> ---
>
> Key: SPARK-33883
> URL: https://issues.apache.org/jira/browse/SPARK-33883
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Stu
>Priority: Minor
> Attachments: image-2020-12-28-18-24-18-395.png, 
> image-2020-12-28-18-32-25-960.png
>
>
> the following sql code works, despite having bad syntax ("where" is mentioned 
> twice):
> {code:java}
> select * from table
> where where field is not null{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33883) Can repeat "where" twice without error in spark sql

2020-12-22 Thread Stu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stu updated SPARK-33883:

Description: 
the following sql code works, despite having bad syntax ("where" is mentioned 
twice):
{code:java}
select * from table
where where field is not null{code}

  was:
the following sql code works, despite having bad syntax (where is mentioned 
twice):
{code:java}
select * from table
where where field is not null{code}


> Can repeat "where" twice without error in spark sql
> ---
>
> Key: SPARK-33883
> URL: https://issues.apache.org/jira/browse/SPARK-33883
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Stu
>Priority: Minor
>
> the following sql code works, despite having bad syntax ("where" is mentioned 
> twice):
> {code:java}
> select * from table
> where where field is not null{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33883) Can repeat "where" twice without error in spark sql

2020-12-22 Thread Stu (Jira)
Stu created SPARK-33883:
---

 Summary: Can repeat "where" twice without error in spark sql
 Key: SPARK-33883
 URL: https://issues.apache.org/jira/browse/SPARK-33883
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3
Reporter: Stu


the following sql code works, despite having bad syntax (where is mentioned 
twice):
{code:java}
select * from table
where where field is not null{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org