[ https://issues.apache.org/jira/browse/SPARK-31400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Junpei Zhou updated SPARK-31400: -------------------------------- Description: h2. Bug Description The `catalogString` is not detailed enough to distinguish the pyspark.ml.linalg.Vectors and pyspark.mllib.linalg.Vectors. h2. How to reproduce the bug [Here|https://spark.apache.org/docs/latest/ml-features#minmaxscaler] is an example from the official document (Python code). If I keep all other lines untouched, and only modify the Vectors import line, which means: {code:java} # from pyspark.ml.linalg import Vectors from pyspark.mllib.linalg import Vectors {code} Or you can directly execute the following code snippet: {code:java} from pyspark.ml.feature import MinMaxScaler # from pyspark.ml.linalg import Vectors from pyspark.mllib.linalg import Vectors dataFrame = spark.createDataFrame([ (0, Vectors.dense([1.0, 0.1, -1.0]),), (1, Vectors.dense([2.0, 1.1, 1.0]),), (2, Vectors.dense([3.0, 10.1, 3.0]),) ], ["id", "features"]) scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures") scalerModel = scaler.fit(dataFrame) {code} It will raise an error: {code:java} IllegalArgumentException: 'requirement failed: Column features must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.' {code} However, the actually struct and the desired struct are exactly the same string, which cannot provide useful information to the programmer. I would suggest making the catalogString distinguish pyspark.ml.linalg.Vectors and pyspark.mllib.linalg.Vectors. Thanks! was: h2. Bug Description The `catalogString` is not detailed enough to distinguish the pyspark.ml.linalg.Vectors and pyspark.mllib.linalg.Vectors. h2. How to reproduce the bug [Here|https://spark.apache.org/docs/latest/ml-features#minmaxscaler] is an example from the official document (Python code). If I keep all other lines untouched, and only modify the Vectors import line, which means: {code:java} # from pyspark.ml.linalg import Vectors from pyspark.mllib.linalg import Vectors {code} Or you can directly execute the following code snippet: {code:java} from pyspark.ml.feature import MinMaxScaler # from pyspark.ml.linalg import Vectors from pyspark.mllib.linalg import Vectors dataFrame = spark.createDataFrame([ (0, Vectors.dense([1.0, 0.1, -1.0]),), (1, Vectors.dense([2.0, 1.1, 1.0]),), (2, Vectors.dense([3.0, 10.1, 3.0]),) ], ["id", "features"]) scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures") scalerModel = scaler.fit(dataFrame) {code} It will raise an error: {code:java} IllegalArgumentException: 'requirement failed: Column features must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.' {code} However, the actually struct and the desired struct are exactly the same string, which cannot provide useful information to the programmer. I would suggest making the catalogString distinguish pyspark.ml.linalg.Vectors and pyspark.mllib.linalg.Vectors. Thanks! > The catalogString doesn't distinguish Vectors in ml and mllib > ------------------------------------------------------------- > > Key: SPARK-31400 > URL: https://issues.apache.org/jira/browse/SPARK-31400 > Project: Spark > Issue Type: Bug > Components: ML, MLlib > Affects Versions: 2.4.5 > Environment: Ubuntu 16.04 > Reporter: Junpei Zhou > Priority: Major > > h2. Bug Description > The `catalogString` is not detailed enough to distinguish the > pyspark.ml.linalg.Vectors and pyspark.mllib.linalg.Vectors. > h2. How to reproduce the bug > [Here|https://spark.apache.org/docs/latest/ml-features#minmaxscaler] is an > example from the official document (Python code). If I keep all other lines > untouched, and only modify the Vectors import line, which means: > {code:java} > # from pyspark.ml.linalg import Vectors > from pyspark.mllib.linalg import Vectors > {code} > Or you can directly execute the following code snippet: > {code:java} > from pyspark.ml.feature import MinMaxScaler > # from pyspark.ml.linalg import Vectors > from pyspark.mllib.linalg import Vectors > dataFrame = spark.createDataFrame([ > (0, Vectors.dense([1.0, 0.1, -1.0]),), > (1, Vectors.dense([2.0, 1.1, 1.0]),), > (2, Vectors.dense([3.0, 10.1, 3.0]),) > ], ["id", "features"]) > scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures") > scalerModel = scaler.fit(dataFrame) > {code} > It will raise an error: > {code:java} > IllegalArgumentException: 'requirement failed: Column features must be of > type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> > but was actually > struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.' > {code} > However, the actually struct and the desired struct are exactly the same > string, which cannot provide useful information to the programmer. I would > suggest making the catalogString distinguish pyspark.ml.linalg.Vectors and > pyspark.mllib.linalg.Vectors. > Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org