[ https://issues.apache.org/jira/browse/SPARK-40232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean R. Owen resolved SPARK-40232. ---------------------------------- Resolution: Not A Problem No, initSteps controls an aspect of the initialization. I don't think you want to change it. I would expect potentially different results with different seeds and initializations. Maybe not really different results but I don't know if your maxIter is high enough or whether the comparison to sklearn is apples to apples. Too many vairables > KMeans: high variability in results despite high initSteps parameter value > -------------------------------------------------------------------------- > > Key: SPARK-40232 > URL: https://issues.apache.org/jira/browse/SPARK-40232 > Project: Spark > Issue Type: Bug > Components: ML, PySpark > Affects Versions: 3.3.0 > Reporter: Patryk Piekarski > Priority: Major > Attachments: sample_data.csv > > > I'm running KMeans on a sample dataset using PySpark. I want the results to > be fairly stable, so I play with the _initSteps_ parameter. My understanding > is that the higher the number of steps for k-means|| initialization mode, the > higher the number of iterations the algorithm runs and in the end selects the > best model out of all iterations. And that's the behavior I observe when > running sklearn implementation with _n_init_ >= 10. However, when running > PySpark implementation, regardless of the number of partitions of underlying > data frame (tested on 1, 4, 8 number of partitions), even when setting > _initSteps_ to 10, 50, or let's say 500, the results I get with different > seeds are different and trainingCost value I observe is sometimes far from > the lowest. > As a workaround, to force the algorithm to iterate and select the best model > I used a loop with dynamic seed. > SKlearn in each iteration gets the trainingCost near 276655. > PySpark implementation of KMeans gets there in the 2nd, 5th and 6th > iteration, but all the remaining iterations yield higher values. > Does the _initSteps_ parameter work as expected? Because my findings suggest > that something might be off here. > Let me know where I could upload this sample dataset (2MB) > > {code:java} > import pandas as pd > from sklearn.cluster import KMeans as KMeansSKlearn > df = pd.read_csv('sample_data.csv') > minimum = 99999999 > for i in range(1,10): > kmeans = KMeansSKlearn(init="k-means++", n_clusters=5, n_init=10, > random_state=i) > model = kmeans.fit(df) > print(f'Sklearn iteration {i}: {round(model.inertia_)}')from pyspark.sql > import SparkSession > spark= SparkSession.builder \ > .appName("kmeans-test") \ > .config('spark.driver.memory', '2g') \ > .master("local[2]") \ > .getOrCreate()df1 = spark.createDataFrame(df) > from pyspark.ml.clustering import KMeans > from pyspark.ml.feature import VectorAssembler > assemble=VectorAssembler(inputCols=df1.columns, outputCol='features') > assembled_data=assemble.transform(df1) > minimum = 99999999 > for i in range(1,10): > kmeans = KMeans(featuresCol='features', k=5, initSteps=100, maxIter=300, > seed=i, tol=0.0001) > model = kmeans.fit(assembled_data) > summary = model.summary > print(f'PySpark iteration {i}: {round(summary.trainingCost)}'){code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org