[jira] [Resolved] (SPARK-40232) KMeans: high variability in results despite high initSteps parameter value

Sean R. Owen (Jira) Wed, 31 Aug 2022 10:28:07 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-40232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sean R. Owen resolved SPARK-40232.
----------------------------------
    Resolution: Not A Problem

No, initSteps controls an aspect of the initialization. I don't think you want 
to change it. I would expect potentially different results with different seeds 
and initializations. Maybe not really different results but I don't know if 
your maxIter is high enough or whether the comparison to sklearn is apples to 
apples. Too many vairables

> KMeans: high variability in results despite high initSteps parameter value
> --------------------------------------------------------------------------
>
>                 Key: SPARK-40232
>                 URL: https://issues.apache.org/jira/browse/SPARK-40232
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, PySpark
>    Affects Versions: 3.3.0
>            Reporter: Patryk Piekarski
>            Priority: Major
>         Attachments: sample_data.csv
>
>
> I'm running KMeans on a sample dataset using PySpark. I want the results to 
> be fairly stable, so I play with the _initSteps_ parameter. My understanding 
> is that the higher the number of steps for k-means|| initialization mode, the 
> higher the number of iterations the algorithm runs and in the end selects the 
> best model out of all iterations. And that's the behavior I observe when 
> running sklearn implementation with _n_init_ >= 10. However, when running 
> PySpark implementation, regardless of the number of partitions of underlying 
> data frame (tested on 1, 4, 8 number of partitions), even when setting 
> _initSteps_ to 10, 50, or let's say 500, the results I get with different 
> seeds are different and trainingCost value I observe is sometimes far from 
> the lowest.
> As a workaround, to force the algorithm to iterate and select the best model 
> I used a loop with dynamic seed.
> SKlearn in each iteration gets the trainingCost near 276655.
> PySpark implementation of KMeans gets there in the 2nd, 5th and 6th 
> iteration, but all the remaining iterations yield higher values.
> Does the _initSteps_ parameter work as expected? Because my findings suggest 
> that something might be off here.
> Let me know where I could upload this sample dataset (2MB)
>  
> {code:java}
> import pandas as pd
> from sklearn.cluster import KMeans as KMeansSKlearn
> df = pd.read_csv('sample_data.csv')
> minimum = 99999999
> for i in range(1,10):
>     kmeans = KMeansSKlearn(init="k-means++", n_clusters=5, n_init=10, 
> random_state=i)
>     model = kmeans.fit(df)
>     print(f'Sklearn iteration {i}: {round(model.inertia_)}')from pyspark.sql 
> import SparkSession
> spark= SparkSession.builder \
>     .appName("kmeans-test") \
>     .config('spark.driver.memory', '2g') \
>     .master("local[2]") \
>     .getOrCreate()df1 = spark.createDataFrame(df)
> from pyspark.ml.clustering import KMeans
> from pyspark.ml.feature import VectorAssembler
> assemble=VectorAssembler(inputCols=df1.columns, outputCol='features')
> assembled_data=assemble.transform(df1)
> minimum = 99999999
> for i in range(1,10):
>     kmeans = KMeans(featuresCol='features', k=5, initSteps=100, maxIter=300, 
> seed=i, tol=0.0001)
>     model = kmeans.fit(assembled_data)
>     summary = model.summary
>     print(f'PySpark iteration {i}: {round(summary.trainingCost)}'){code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40232) KMeans: high variability in results despite high initSteps parameter value

Reply via email to