Is RAND() in SparkSQL deterministic when used on MySql data sources?

Gabriele Del Prete Thu, 12 Jan 2017 13:38:21 -0800

Hi all,

We need to use the rand(<seed>) function in Scala Spark SQL in our
application, but we discovered that it behavior was not deterministic, that
is, different invocations with the same <seed> would result in different
values. This is documented in some bugs, for example:
https://issues.apache.org/jira/browse/SPARK-13333 and it has to do with
partitioning.


So we refactored it by moving the rand() function from a query using Parquet
files on S3 as a datasource, to another query that we run on MySQL (still
using the Spark SLQ Scala API), assuming that MySQL quesries do not get
parallelized. Can we indeed safely assume that now rand(<seed>) will be
deterministic, or does the source of non-deterministic behavior lie in the
Spark SQL engine rather than the specific datasource ?

Gabriele



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-RAND-in-SparkSQL-deterministic-when-used-on-MySql-data-sources-tp28302.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Is RAND() in SparkSQL deterministic when used on MySql data sources?

Reply via email to