[ https://issues.apache.org/jira/browse/SPARK-21679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated SPARK-21679: ------------------------------ Priority: Minor (was: Major) As a general statement, it's hard to get deterministic behavior out of a distributed implementation. The order that tasks execute can sometimes matter, and it's not possible to control every RNG used by every library. It might be possible in this particular case. > KMeans Clustering is Not Deterministic > -------------------------------------- > > Key: SPARK-21679 > URL: https://issues.apache.org/jira/browse/SPARK-21679 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 2.1.0, 2.2.0 > Reporter: Christoph Brücke > Priority: Minor > > I’m trying to figure out how to use KMeans in order to achieve reproducible > results. I have found that running the same kmeans instance on the same data, > with different partitioning will produce different clusterings. > Given a simple KMeans run with fixed seed returns different results on the > same > training data, if the training data is partitioned differently. > Consider the following example. The same KMeans clustering set up is run on > identical data. The only difference is the partitioning of the training data > (one partition vs. four partitions). > {noformat} > import org.apache.spark.sql.DataFrame > import org.apache.spark.ml.clustering.KMeans > import org.apache.spark.ml.features.VectorAssembler > // generate random data for clustering > val randomData = spark.range(1, 1000).withColumn("a", > rand(123)).withColumn("b", rand(321)) > val vecAssembler = new VectorAssembler().setInputCols(Array("a", > "b")).setOutputCol("features") > val data = vecAssembler.transform(randomData) > // instantiate KMeans with fixed seed > val kmeans = new KMeans().setK(10).setSeed(9876L) > // train the model with different partitioning > val dataWith1Partition = data.repartition(1) > println("1 Partition: " + > kmeans.fit(dataWith1Partition).computeCost(dataWith1Partition)) > val dataWith4Partition = data.repartition(4) > println("4 Partition: " + > kmeans.fit(dataWith4Partition).computeCost(dataWith4Partition)) > {noformat} > I get the following related cost > {noformat} > 1 Partition: 16.028212597888057 > 4 Partition: 16.14758460544976 > {noformat} > What I want to achieve is that repeated computations of the KMeans Clustering > should yield identical result on identical training data, regardless of the > partitioning. > Looking through the Spark source code, I guess the cause is the > initialization > method of KMeans which in turn uses the `takeSample` method, which does not > seem to be partition agnostic. > Is this behaviour expected? Is there anything I could do to achieve > reproducible results? -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org