[ https://issues.apache.org/jira/browse/SPARK-31299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bryan Cutler resolved SPARK-31299. ---------------------------------- Resolution: Not A Problem > Pyspark.ml.clustering illegalArgumentException with dataframe created from > rows > ------------------------------------------------------------------------------- > > Key: SPARK-31299 > URL: https://issues.apache.org/jira/browse/SPARK-31299 > Project: Spark > Issue Type: Bug > Components: ML, PySpark > Affects Versions: 2.4.0 > Reporter: Lukas Thaler > Priority: Major > > I hope this is the right place and way to report a bug in (at least) the > PySpark API: > BisectingKMeans in the following example is only exemplary, the error occurs > with all clustering algorithms: > {code:java} > from pyspark.sql import Row > from pyspark.mllib.linalg import DenseVector > from pyspark.ml.clustering import BisectingKMeansdata = > spark.createDataFrame([Row(test_features=DenseVector([43.0, 0.0, 200.0, 1.0, > 1.0, 1.0, 0.0, 3.0])), > Row(test_features=DenseVector([44.0, 0.0, 250.0, 1.0, 1.0, 1.0, 0.0, 1.0])), > Row(test_features=DenseVector([23.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0])), > Row(test_features=DenseVector([25.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 2.0])), > Row(test_features=DenseVector([19.0, 0.0, 200.0, 1.0, 0.0, 1.0, 0.0, 1.0]))]) > kmeans = BisectingKMeans(featuresCol='test_features').setK(4).setSeed(1) > model = kmeans.fit(data) > {code} > The .fit-call in the last line will fail with the following error: > {code:java} > Py4JJavaError: An error occurred while calling o51.fit. > : java.lang.IllegalArgumentException: requirement failed: Column > test_features must be of type equal to one of the following types: > [struct<type:tinyint,size:int,indices:array<int>,values:array<double>>, > array<double>, array<float>] but was actually of type > struct<type:tinyint,size:int,indices:array<int>,values:array<double>>. > {code} > As can be seen, the data type reported to be passed to the function is the > first data type in the list of allowed data types, yet the call ends in an > error because of it. > See my [StackOverflow > issue|[https://stackoverflow.com/questions/60884142/pyspark-py4j-illegalargumentexception-with-spark-createdataframe-and-pyspark-ml]] > for more context -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org