[jira] [Updated] (SPARK-31400) The catalogString doesn't distinguish Vectors in ml and mllib

Junpei Zhou (Jira) Thu, 09 Apr 2020 11:49:23 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-31400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Junpei Zhou updated SPARK-31400:
--------------------------------
    Description: 
h2. Bug Description

The `catalogString` is not detailed enough to distinguish the 
pyspark.ml.linalg.Vectors and pyspark.mllib.linalg.Vectors.
h2. How to reproduce the bug

[Here|https://spark.apache.org/docs/latest/ml-features#minmaxscaler] is an 
example from the official document (Python code). If I keep all other lines 
untouched, and only modify the Vectors import line, which means:
{code:java}
# from pyspark.ml.linalg import Vectors
from pyspark.mllib.linalg import Vectors
{code}
Or you can directly execute the following code snippet:
{code:java}
from pyspark.ml.feature import MinMaxScaler
# from pyspark.ml.linalg import Vectors
from pyspark.mllib.linalg import Vectors
dataFrame = spark.createDataFrame([
    (0, Vectors.dense([1.0, 0.1, -1.0]),),
    (1, Vectors.dense([2.0, 1.1, 1.0]),),
    (2, Vectors.dense([3.0, 10.1, 3.0]),)
], ["id", "features"])
scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")
scalerModel = scaler.fit(dataFrame)
{code}
It will raise an error:
{code:java}
IllegalArgumentException: 'requirement failed: Column features must be of type 
struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was 
actually struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.'
{code}
However, the actually struct and the desired struct are exactly the same 
string, which cannot provide useful information to the programmer. I would 
suggest making the catalogString distinguish pyspark.ml.linalg.Vectors and 
pyspark.mllib.linalg.Vectors.

Thanks!

 

  was:
h2. Bug Description

The `catalogString` is not detailed enough to distinguish the 
pyspark.ml.linalg.Vectors and pyspark.mllib.linalg.Vectors.
h2. How to reproduce the bug

[Here|https://spark.apache.org/docs/latest/ml-features#minmaxscaler] is an 
example from the official document (Python code). If I keep all other lines 
untouched, and only modify the Vectors import line, which means:

 
{code:java}
# from pyspark.ml.linalg import Vectors
from pyspark.mllib.linalg import Vectors
{code}
Or you can directly execute the following code snippet:

 

 
{code:java}
from pyspark.ml.feature import MinMaxScaler
# from pyspark.ml.linalg import Vectors
from pyspark.mllib.linalg import Vectors
dataFrame = spark.createDataFrame([
    (0, Vectors.dense([1.0, 0.1, -1.0]),),
    (1, Vectors.dense([2.0, 1.1, 1.0]),),
    (2, Vectors.dense([3.0, 10.1, 3.0]),)
], ["id", "features"])
scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")
scalerModel = scaler.fit(dataFrame)
{code}
It will raise an error:

 
{code:java}
IllegalArgumentException: 'requirement failed: Column features must be of type 
struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was 
actually struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.'
{code}
However, the actually struct and the desired struct are exactly the same 
string, which cannot provide useful information to the programmer. I would 
suggest making the catalogString distinguish pyspark.ml.linalg.Vectors and 
pyspark.mllib.linalg.Vectors.

Thanks!

 


> The catalogString doesn't distinguish Vectors in ml and mllib
> -------------------------------------------------------------
>
>                 Key: SPARK-31400
>                 URL: https://issues.apache.org/jira/browse/SPARK-31400
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib
>    Affects Versions: 2.4.5
>         Environment: Ubuntu 16.04
>            Reporter: Junpei Zhou
>            Priority: Major
>
> h2. Bug Description
> The `catalogString` is not detailed enough to distinguish the 
> pyspark.ml.linalg.Vectors and pyspark.mllib.linalg.Vectors.
> h2. How to reproduce the bug
> [Here|https://spark.apache.org/docs/latest/ml-features#minmaxscaler] is an 
> example from the official document (Python code). If I keep all other lines 
> untouched, and only modify the Vectors import line, which means:
> {code:java}
> # from pyspark.ml.linalg import Vectors
> from pyspark.mllib.linalg import Vectors
> {code}
> Or you can directly execute the following code snippet:
> {code:java}
> from pyspark.ml.feature import MinMaxScaler
> # from pyspark.ml.linalg import Vectors
> from pyspark.mllib.linalg import Vectors
> dataFrame = spark.createDataFrame([
>     (0, Vectors.dense([1.0, 0.1, -1.0]),),
>     (1, Vectors.dense([2.0, 1.1, 1.0]),),
>     (2, Vectors.dense([3.0, 10.1, 3.0]),)
> ], ["id", "features"])
> scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")
> scalerModel = scaler.fit(dataFrame)
> {code}
> It will raise an error:
> {code:java}
> IllegalArgumentException: 'requirement failed: Column features must be of 
> type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> 
> but was actually 
> struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.'
> {code}
> However, the actually struct and the desired struct are exactly the same 
> string, which cannot provide useful information to the programmer. I would 
> suggest making the catalogString distinguish pyspark.ml.linalg.Vectors and 
> pyspark.mllib.linalg.Vectors.
> Thanks!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31400) The catalogString doesn't distinguish Vectors in ml and mllib

Reply via email to