Hey everyone,

Consider the following use of spark.sql.shuffle.partitions:

case class Data(A:String = f"${(math.random*1e8).toLong}%09.0f", B: String
= f"${(math.random*1e8).toLong}%09.0f")
val dataFrame = (1 to 1000).map(_ => Data()).toDF
dataFrame.registerTempTable("data")

sqlContext.setConf( "spark.sql.shuffle.partitions", "10")
val a = sqlContext.sql("SELECT * FROM data CLUSTER BY A")

sqlContext.setConf( "spark.sql.shuffle.partitions", "20")
val b = sqlContext.sql("SELECT * FROM data CLUSTER BY A")

a.rdd.partitions.size
b.rdd.partitions.size

Expected result:
10
20
Actual result:
20
20

The bad thing about this is that the way the DataFrame is evaluated
currently depends on the global state. Multiple actions on the same
DataFrame may hence behave nondeterministically.
Furthermore, imagine a user has a job that includes multiple shuffles each
of which might require different settings. It is very hard to determine
(due to lazyness) at which place the configuration is read and hence to
place the setConf call.

Instead I propose to ensure that all relevant configuration parameters are
stored "in the lineage" at the place where the DataFrame is created so that
every call to that DataFrame at a later point will deterministically use
the context that the DataFrame was initially created in.

Best regards,
Daniel




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-parameters-like-shuffle-partitions-should-be-stored-in-the-lineage-tp13240.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Reply via email to