[jira] [Updated] (SPARK-31299) Pyspark.ml.clustering illegalArgumentException with dataframe created from rows

Bryan Cutler (Jira) Wed, 01 Apr 2020 10:48:40 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-31299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Bryan Cutler updated SPARK-31299:
---------------------------------
    Description: 
I hope this is the right place and way to report a bug in (at least) the 
PySpark API:

BisectingKMeans in the following example is only exemplary, the error occurs 
with all clustering algorithms:
{code:python}
from pyspark.sql import Row
from pyspark.mllib.linalg import DenseVector
from pyspark.ml.clustering import BisectingKMeans

data = spark.createDataFrame([Row(test_features=DenseVector([43.0, 0.0, 200.0, 
1.0, 1.0, 1.0, 0.0, 3.0])),
 Row(test_features=DenseVector([44.0, 0.0, 250.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
 Row(test_features=DenseVector([23.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
 Row(test_features=DenseVector([25.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 2.0])),
 Row(test_features=DenseVector([19.0, 0.0, 200.0, 1.0, 0.0, 1.0, 0.0, 1.0]))])

kmeans = BisectingKMeans(featuresCol='test_features').setK(4).setSeed(1)
model = kmeans.fit(data)
{code}
The .fit-call in the last line will fail with the following error:
{code:java}
Py4JJavaError: An error occurred while calling o51.fit.
: java.lang.IllegalArgumentException: requirement failed: Column test_features 
must be of type equal to one of the following types: 
[struct<type:tinyint,size:int,indices:array<int>,values:array<double>>, 
array<double>, array<float>] but was actually of type 
struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
{code}
As can be seen, the data type reported to be passed to the function is the 
first data type in the list of allowed data types, yet the call ends in an 
error because of it.

See my [StackOverflow 
issue|[https://stackoverflow.com/questions/60884142/pyspark-py4j-illegalargumentexception-with-spark-createdataframe-and-pyspark-ml]]
 for more context

  was:
I hope this is the right place and way to report a bug in (at least) the 
PySpark API:

BisectingKMeans in the following example is only exemplary, the error occurs 
with all clustering algorithms:
{code:java}
from pyspark.sql import Row
from pyspark.mllib.linalg import DenseVector
from pyspark.ml.clustering import BisectingKMeansdata = 
spark.createDataFrame([Row(test_features=DenseVector([43.0, 0.0, 200.0, 1.0, 
1.0, 1.0, 0.0, 3.0])),
 Row(test_features=DenseVector([44.0, 0.0, 250.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
 Row(test_features=DenseVector([23.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
 Row(test_features=DenseVector([25.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 2.0])),
 Row(test_features=DenseVector([19.0, 0.0, 200.0, 1.0, 0.0, 1.0, 0.0, 1.0]))])

kmeans = BisectingKMeans(featuresCol='test_features').setK(4).setSeed(1)
model = kmeans.fit(data)
{code}
The .fit-call in the last line will fail with the following error:
{code:java}
Py4JJavaError: An error occurred while calling o51.fit.
: java.lang.IllegalArgumentException: requirement failed: Column test_features 
must be of type equal to one of the following types: 
[struct<type:tinyint,size:int,indices:array<int>,values:array<double>>, 
array<double>, array<float>] but was actually of type 
struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
{code}
As can be seen, the data type reported to be passed to the function is the 
first data type in the list of allowed data types, yet the call ends in an 
error because of it.

See my [StackOverflow 
issue|[https://stackoverflow.com/questions/60884142/pyspark-py4j-illegalargumentexception-with-spark-createdataframe-and-pyspark-ml]]
 for more context


> Pyspark.ml.clustering illegalArgumentException with dataframe created from 
> rows
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-31299
>                 URL: https://issues.apache.org/jira/browse/SPARK-31299
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, PySpark
>    Affects Versions: 2.4.0
>            Reporter: Lukas Thaler
>            Priority: Major
>
> I hope this is the right place and way to report a bug in (at least) the 
> PySpark API:
> BisectingKMeans in the following example is only exemplary, the error occurs 
> with all clustering algorithms:
> {code:python}
> from pyspark.sql import Row
> from pyspark.mllib.linalg import DenseVector
> from pyspark.ml.clustering import BisectingKMeans
> data = spark.createDataFrame([Row(test_features=DenseVector([43.0, 0.0, 
> 200.0, 1.0, 1.0, 1.0, 0.0, 3.0])),
>  Row(test_features=DenseVector([44.0, 0.0, 250.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
>  Row(test_features=DenseVector([23.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
>  Row(test_features=DenseVector([25.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 2.0])),
>  Row(test_features=DenseVector([19.0, 0.0, 200.0, 1.0, 0.0, 1.0, 0.0, 1.0]))])
> kmeans = BisectingKMeans(featuresCol='test_features').setK(4).setSeed(1)
> model = kmeans.fit(data)
> {code}
> The .fit-call in the last line will fail with the following error:
> {code:java}
> Py4JJavaError: An error occurred while calling o51.fit.
> : java.lang.IllegalArgumentException: requirement failed: Column 
> test_features must be of type equal to one of the following types: 
> [struct<type:tinyint,size:int,indices:array<int>,values:array<double>>, 
> array<double>, array<float>] but was actually of type 
> struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
> {code}
> As can be seen, the data type reported to be passed to the function is the 
> first data type in the list of allowed data types, yet the call ends in an 
> error because of it.
> See my [StackOverflow 
> issue|[https://stackoverflow.com/questions/60884142/pyspark-py4j-illegalargumentexception-with-spark-createdataframe-and-pyspark-ml]]
>  for more context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31299) Pyspark.ml.clustering illegalArgumentException with dataframe created from rows

Reply via email to