[ 
https://issues.apache.org/jira/browse/SPARK-21679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christoph Brücke updated SPARK-21679:
-------------------------------------
    Description: 
I’m trying to figure out how to use KMeans in order to achieve reproducible 
results. I have found that running the same kmeans instance on the same data, 
with different partitioning will produce different clusterings.

Given a simple KMeans run with fixed seed returns different results on the same
training data, if the training data is partitioned differently.

Consider the following example. The same KMeans clustering set up is run on
identical data. The only difference is the partitioning of the training data
(one partition vs. four partitions).

{{import org.apache.spark.sql.DataFrame
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.features.VectorAssembler

// generate random data for clustering
val randomData = spark.range(1, 1000).withColumn("a", 
rand(123)).withColumn("b", rand(321))

val vecAssembler = new VectorAssembler().setInputCols(Array("a", 
"b")).setOutputCol("features")

val data = vecAssembler.transform(randomData)

// instantiate KMeans with fixed seed
val kmeans = new KMeans().setK(10).setSeed(9876L)

// train the model with different partitioning
val dataWith1Partition = data.repartition(1)
println("1 Partition: " + 
kmeans.fit(dataWith1Partition).computeCost(dataWith1Partition))

val dataWith4Partition = data.repartition(4)
println("4 Partition: " + 
kmeans.fit(dataWith4Partition).computeCost(dataWith4Partition))}}

I get the following related cost


{{1 Partition: 16.028212597888057
4 Partition: 16.14758460544976}}


What I want to achieve is that repeated computations of the KMeans Clustering 
should yield identical result on identical training data, regardless of the 
partitioning.

Looking through the Spark source code, I guess the cause is the initialization 
method of KMeans which in turn uses the `takeSample` method, which does not 
seem to be partition agnostic.

Is this behaviour expected? Is there anything I could do to achieve 
reproducible results?

  was:
I’m trying to figure out how to use KMeans in order to achieve reproducible 
results. I have found that running the same kmeans instance on the same data, 
with different partitioning will produce different clusterings.

Given a simple KMeans run with fixed seed returns different results on the same
training data, if the training data is partitioned differently.

Consider the following example. The same KMeans clustering set up is run on
identical data. The only difference is the partitioning of the training data
(one partition vs. four partitions).

{{
import org.apache.spark.sql.DataFrame
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.features.VectorAssembler

// generate random data for clustering
val randomData = spark.range(1, 1000).withColumn("a", 
rand(123)).withColumn("b", rand(321))

val vecAssembler = new VectorAssembler().setInputCols(Array("a", 
"b")).setOutputCol("features")

val data = vecAssembler.transform(randomData)

// instantiate KMeans with fixed seed
val kmeans = new KMeans().setK(10).setSeed(9876L)

// train the model with different partitioning
val dataWith1Partition = data.repartition(1)
println("1 Partition: " + 
kmeans.fit(dataWith1Partition).computeCost(dataWith1Partition))

val dataWith4Partition = data.repartition(4)
println("4 Partition: " + 
kmeans.fit(dataWith4Partition).computeCost(dataWith4Partition))
}}

I get the following related cost


{{
1 Partition: 16.028212597888057
4 Partition: 16.14758460544976
}}


What I want to achieve is that repeated computations of the KMeans Clustering 
should yield identical result on identical training data, regardless of the 
partitioning.

Looking through the Spark source code, I guess the cause is the initialization 
method of KMeans which in turn uses the `takeSample` method, which does not 
seem to be partition agnostic.

Is this behaviour expected? Is there anything I could do to achieve 
reproducible results?


> KMeans Clustering is Not Deterministic
> --------------------------------------
>
>                 Key: SPARK-21679
>                 URL: https://issues.apache.org/jira/browse/SPARK-21679
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.1.0, 2.2.0
>            Reporter: Christoph Brücke
>
> I’m trying to figure out how to use KMeans in order to achieve reproducible 
> results. I have found that running the same kmeans instance on the same data, 
> with different partitioning will produce different clusterings.
> Given a simple KMeans run with fixed seed returns different results on the 
> same
> training data, if the training data is partitioned differently.
> Consider the following example. The same KMeans clustering set up is run on
> identical data. The only difference is the partitioning of the training data
> (one partition vs. four partitions).
> {{import org.apache.spark.sql.DataFrame
> import org.apache.spark.ml.clustering.KMeans
> import org.apache.spark.ml.features.VectorAssembler
> // generate random data for clustering
> val randomData = spark.range(1, 1000).withColumn("a", 
> rand(123)).withColumn("b", rand(321))
> val vecAssembler = new VectorAssembler().setInputCols(Array("a", 
> "b")).setOutputCol("features")
> val data = vecAssembler.transform(randomData)
> // instantiate KMeans with fixed seed
> val kmeans = new KMeans().setK(10).setSeed(9876L)
> // train the model with different partitioning
> val dataWith1Partition = data.repartition(1)
> println("1 Partition: " + 
> kmeans.fit(dataWith1Partition).computeCost(dataWith1Partition))
> val dataWith4Partition = data.repartition(4)
> println("4 Partition: " + 
> kmeans.fit(dataWith4Partition).computeCost(dataWith4Partition))}}
> I get the following related cost
> {{1 Partition: 16.028212597888057
> 4 Partition: 16.14758460544976}}
> What I want to achieve is that repeated computations of the KMeans Clustering 
> should yield identical result on identical training data, regardless of the 
> partitioning.
> Looking through the Spark source code, I guess the cause is the 
> initialization 
> method of KMeans which in turn uses the `takeSample` method, which does not 
> seem to be partition agnostic.
> Is this behaviour expected? Is there anything I could do to achieve 
> reproducible results?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to