[jira] [Updated] (SPARK-45216) Fix non-deterministic seeded Dataset APIs

Peter Toth (Jira) Tue, 19 Sep 2023 06:50:46 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-45216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Peter Toth updated SPARK-45216:
-------------------------------
    Description: 
If we run the following example the result is the expected equal 2 columns:

{noformat}
val c = rand()
df.select(c, c)

+--------------------------+--------------------------+
|rand(-4522010140232537566)|rand(-4522010140232537566)|
+--------------------------+--------------------------+
|        0.4520819282997137|        0.4520819282997137|
+--------------------------+--------------------------+
{noformat}

 
But if we run use other similar APIs their result is incorrect:

{noformat}
val r1 = random()
val r2 = uuid()
val r3 = shuffle(col("x"))
val x = df.select(r1, r1, r2, r2, r3, r3)

+------------------+------------------+--------------------+--------------------+----------+----------+
|            rand()|            rand()|              uuid()|              
uuid()|shuffle(x)|shuffle(x)|
+------------------+------------------+--------------------+--------------------+----------+----------+
|0.7407604956381952|0.7957319451135009|e55bc4b0-74e6-4b0...|a587163b-d06b-4bb...|
 [1, 2, 3]| [2, 1, 3]|
+------------------+------------------+--------------------+--------------------+----------+----------+
{noformat}


> Fix non-deterministic seeded Dataset APIs
> -----------------------------------------
>
>                 Key: SPARK-45216
>                 URL: https://issues.apache.org/jira/browse/SPARK-45216
>             Project: Spark
>          Issue Type: Bug
>          Components: Connect, SQL
>    Affects Versions: 4.0.0
>            Reporter: Peter Toth
>            Priority: Major
>
> If we run the following example the result is the expected equal 2 columns:
> {noformat}
> val c = rand()
> df.select(c, c)
> +--------------------------+--------------------------+
> |rand(-4522010140232537566)|rand(-4522010140232537566)|
> +--------------------------+--------------------------+
> |        0.4520819282997137|        0.4520819282997137|
> +--------------------------+--------------------------+
> {noformat}
>  
> But if we run use other similar APIs their result is incorrect:
> {noformat}
> val r1 = random()
> val r2 = uuid()
> val r3 = shuffle(col("x"))
> val x = df.select(r1, r1, r2, r2, r3, r3)
> +------------------+------------------+--------------------+--------------------+----------+----------+
> |            rand()|            rand()|              uuid()|              
> uuid()|shuffle(x)|shuffle(x)|
> +------------------+------------------+--------------------+--------------------+----------+----------+
> |0.7407604956381952|0.7957319451135009|e55bc4b0-74e6-4b0...|a587163b-d06b-4bb...|
>  [1, 2, 3]| [2, 1, 3]|
> +------------------+------------------+--------------------+--------------------+----------+----------+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45216) Fix non-deterministic seeded Dataset APIs

Reply via email to