[jira] [Assigned] (SPARK-23599) The UUID() expression is too non-deterministic

Apache Spark (JIRA) Tue, 13 Mar 2018 22:50:11 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-23599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Apache Spark reassigned SPARK-23599:
------------------------------------

    Assignee: Apache Spark

> The UUID() expression is too non-deterministic
> ----------------------------------------------
>
>                 Key: SPARK-23599
>                 URL: https://issues.apache.org/jira/browse/SPARK-23599
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Herman van Hovell
>            Assignee: Apache Spark
>            Priority: Critical
>
> The current {{Uuid()}} expression uses {{java.util.UUID.randomUUID}} for UUID 
> generation. There are a couple of major problems with this:
> - It is non-deterministic across task retries. This breaks Spark's processing 
> model, and this will to very hard to trace bugs, like non-deterministic 
> shuffles, duplicates and missing rows.
> - It uses a single secure random for UUID generation. This uses a single JVM 
> wide lock, and this can lead to lock contention and other performance 
> problems.
> We should move to something that is deterministic between retries. This can 
> be done by using seeded PRNGs for which we set the seed during planning. It 
> is important here to use a PRNG that provides enough entropy for creating a 
> proper UUID.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23599) The UUID() expression is too non-deterministic

Reply via email to