[ 
https://issues.apache.org/jira/browse/SPARK-31299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-31299.
----------------------------------
    Resolution: Not A Problem

> Pyspark.ml.clustering illegalArgumentException with dataframe created from 
> rows
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-31299
>                 URL: https://issues.apache.org/jira/browse/SPARK-31299
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, PySpark
>    Affects Versions: 2.4.0
>            Reporter: Lukas Thaler
>            Priority: Major
>
> I hope this is the right place and way to report a bug in (at least) the 
> PySpark API:
> BisectingKMeans in the following example is only exemplary, the error occurs 
> with all clustering algorithms:
> {code:java}
> from pyspark.sql import Row
> from pyspark.mllib.linalg import DenseVector
> from pyspark.ml.clustering import BisectingKMeansdata = 
> spark.createDataFrame([Row(test_features=DenseVector([43.0, 0.0, 200.0, 1.0, 
> 1.0, 1.0, 0.0, 3.0])),
>  Row(test_features=DenseVector([44.0, 0.0, 250.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
>  Row(test_features=DenseVector([23.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
>  Row(test_features=DenseVector([25.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 2.0])),
>  Row(test_features=DenseVector([19.0, 0.0, 200.0, 1.0, 0.0, 1.0, 0.0, 1.0]))])
> kmeans = BisectingKMeans(featuresCol='test_features').setK(4).setSeed(1)
> model = kmeans.fit(data)
> {code}
> The .fit-call in the last line will fail with the following error:
> {code:java}
> Py4JJavaError: An error occurred while calling o51.fit.
> : java.lang.IllegalArgumentException: requirement failed: Column 
> test_features must be of type equal to one of the following types: 
> [struct<type:tinyint,size:int,indices:array<int>,values:array<double>>, 
> array<double>, array<float>] but was actually of type 
> struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
> {code}
> As can be seen, the data type reported to be passed to the function is the 
> first data type in the list of allowed data types, yet the call ends in an 
> error because of it.
> See my [StackOverflow 
> issue|[https://stackoverflow.com/questions/60884142/pyspark-py4j-illegalargumentexception-with-spark-createdataframe-and-pyspark-ml]]
>  for more context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to