[
https://issues.apache.org/jira/browse/SPARK-31400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Junpei Zhou updated SPARK-31400:
Description:
h2. Bug Description
The `catalogString` is not detailed enough to distinguish the
pyspark.ml.linalg.Vectors and pyspark.mllib.linalg.Vectors.
h2. How to reproduce the bug
[Here|https://spark.apache.org/docs/latest/ml-features#minmaxscaler] is an
example from the official document (Python code). If I keep all other lines
untouched, and only modify the Vectors import line, which means:
{code:java}
# from pyspark.ml.linalg import Vectors
from pyspark.mllib.linalg import Vectors
{code}
Or you can directly execute the following code snippet:
{code:java}
from pyspark.ml.feature import MinMaxScaler
# from pyspark.ml.linalg import Vectors
from pyspark.mllib.linalg import Vectors
dataFrame = spark.createDataFrame([
(0, Vectors.dense([1.0, 0.1, -1.0]),),
(1, Vectors.dense([2.0, 1.1, 1.0]),),
(2, Vectors.dense([3.0, 10.1, 3.0]),)
], ["id", "features"])
scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")
scalerModel = scaler.fit(dataFrame)
{code}
It will raise an error:
{code:java}
IllegalArgumentException: 'requirement failed: Column features must be of type
struct,values:array> but was
actually struct,values:array>.'
{code}
However, the actually struct and the desired struct are exactly the same
string, which cannot provide useful information to the programmer. I would
suggest making the catalogString distinguish pyspark.ml.linalg.Vectors and
pyspark.mllib.linalg.Vectors.
Thanks!
was:
h2. Bug Description
The `catalogString` is not detailed enough to distinguish the
pyspark.ml.linalg.Vectors and pyspark.mllib.linalg.Vectors.
h2. How to reproduce the bug
[Here|https://spark.apache.org/docs/latest/ml-features#minmaxscaler] is an
example from the official document (Python code). If I keep all other lines
untouched, and only modify the Vectors import line, which means:
{code:java}
# from pyspark.ml.linalg import Vectors
from pyspark.mllib.linalg import Vectors
{code}
Or you can directly execute the following code snippet:
{code:java}
from pyspark.ml.feature import MinMaxScaler
# from pyspark.ml.linalg import Vectors
from pyspark.mllib.linalg import Vectors
dataFrame = spark.createDataFrame([
(0, Vectors.dense([1.0, 0.1, -1.0]),),
(1, Vectors.dense([2.0, 1.1, 1.0]),),
(2, Vectors.dense([3.0, 10.1, 3.0]),)
], ["id", "features"])
scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")
scalerModel = scaler.fit(dataFrame)
{code}
It will raise an error:
{code:java}
IllegalArgumentException: 'requirement failed: Column features must be of type
struct,values:array> but was
actually struct,values:array>.'
{code}
However, the actually struct and the desired struct are exactly the same
string, which cannot provide useful information to the programmer. I would
suggest making the catalogString distinguish pyspark.ml.linalg.Vectors and
pyspark.mllib.linalg.Vectors.
Thanks!
> The catalogString doesn't distinguish Vectors in ml and mllib
> -
>
> Key: SPARK-31400
> URL: https://issues.apache.org/jira/browse/SPARK-31400
> Project: Spark
> Issue Type: Bug
> Components: ML, MLlib
>Affects Versions: 2.4.5
> Environment: Ubuntu 16.04
>Reporter: Junpei Zhou
>Priority: Major
>
> h2. Bug Description
> The `catalogString` is not detailed enough to distinguish the
> pyspark.ml.linalg.Vectors and pyspark.mllib.linalg.Vectors.
> h2. How to reproduce the bug
> [Here|https://spark.apache.org/docs/latest/ml-features#minmaxscaler] is an
> example from the official document (Python code). If I keep all other lines
> untouched, and only modify the Vectors import line, which means:
> {code:java}
> # from pyspark.ml.linalg import Vectors
> from pyspark.mllib.linalg import Vectors
> {code}
> Or you can directly execute the following code snippet:
> {code:java}
> from pyspark.ml.feature import MinMaxScaler
> # from pyspark.ml.linalg import Vectors
> from pyspark.mllib.linalg import Vectors
> dataFrame = spark.createDataFrame([
> (0, Vectors.dense([1.0, 0.1, -1.0]),),
> (1, Vectors.dense([2.0, 1.1, 1.0]),),
> (2, Vectors.dense([3.0, 10.1, 3.0]),)
> ], ["id", "features"])
> scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")
> scalerModel = scaler.fit(dataFrame)
> {code}
> It will raise an error:
> {code:java}
> IllegalArgumentException: 'requirement failed: Column features must be of
> type struct,values:array>
> but was actually
> struct,values:array>.'
> {code}
> However, the actually struct and the desired struct are exactly the same
> string, which cannot provide useful information to the programmer. I would
> suggest making the catalogString distingui