Herman van Hovell created SPARK-23599:
-----------------------------------------

             Summary: The UUID() expression is too non-deterministic
                 Key: SPARK-23599
                 URL: https://issues.apache.org/jira/browse/SPARK-23599
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.3.0
            Reporter: Herman van Hovell


The current {{Uuid()}} expression uses {{java.util.UUID.randomUUID}} for UUID 
generation. There are a couple of major problems with this:
- It is non-deterministic across task retries. This breaks Spark's processing 
model, and this will to very hard to trace bugs, like non-deterministic 
shuffles, duplicates and missing rows.
- It uses a single secure random for UUID generation. This uses a single JVM 
wide lock, and this can lead to lock contention and other performance problems.

We should move to something that is deterministic between retries. This can be 
done by using seeded PRNGs for which we set the seed during planning. It is 
important here to use a PRNG that provides enough entropy for creating a proper 
UUID.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to