[ 
https://issues.apache.org/jira/browse/SPARK-23535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16380334#comment-16380334
 ] 

Marco Gaido commented on SPARK-23535:
-------------------------------------

I checked and each tool behaves in its own way when this case happens. sklearn 
behaves as you described. Rapidminer returns the min value if the value is less 
than the max value and the max value otherwise (ie. 0 if v < 1 else 1). Matlab 
assumes that this is not the case, otherwise it doesn't perform any 
transformation.

I am not sure if Spark has to strictly mirror sklearn's behavior since this 
case is not handled in a standard way across the tools. What do you think 
[~mlnick] [~srowen] [~josephkb]?

> MinMaxScaler return 0.5 for an all zero column
> ----------------------------------------------
>
>                 Key: SPARK-23535
>                 URL: https://issues.apache.org/jira/browse/SPARK-23535
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.0.0
>            Reporter: Yigal Weinberger
>            Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When applying MinMaxScaler on a column that contains only 0 the output is 0.5 
> for all the column. 
> This is inconsistent with sklearn implementation
>  
> Steps to reproduce:
>  
>  
> {code:java}
> from pyspark.ml.feature import MinMaxScaler
> from pyspark.ml.linalg import Vectors
> dataFrame = spark.createDataFrame([
>     (0, Vectors.dense([1.0, 0.1, -1.0]),),
>     (1, Vectors.dense([2.0, 1.1, 1.0]),),
>     (2, Vectors.dense([3.0, 10.1, 3.0]),)
> ], ["id", "features"])
> scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")
> # Compute summary statistics and generate MinMaxScalerModel
> scalerModel = scaler.fit(dataFrame)
> # rescale each feature to range [min, max].
> scaledData = scalerModel.transform(dataFrame)
> print("Features scaled to range: [%f, %f]" % (scaler.getMin(), 
> scaler.getMax()))
> scaledData.select("features", "scaledFeatures").show()
> {code}
> Features scaled to range: [0.000000, 1.000000]
> +--------------+--------------+
> |features|scaledFeatures|
> +--------------+--------------+
> | [1.0,0.1,0.0]| [0.0,0.0,*0.5*]| |
> [2.0,1.1,0.0]| [0.5,0.1,*0.5*]| |
> [3.0,10.1,0.0]| [1.0,1.0,*0.5*]|
> +--------------+--------------+
>  
> VS.
> {code:java}
> from sklearn.preprocessing import MinMaxScaler
> mms = MinMaxScaler(copy=False)
> test = np.array([[1.0, 0.1, 0],[2.0, 1.1, 0],[3.0, 10.1, 0]])
> print (mms.fit_transform(test))
> {code}
>  
> Output:
> [[ 0. 0. *0.* ]
> [ 0.5 0.1 *0.* ]
> [ 1. 1. *0.* ]]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to