[jira] [Assigned] (SPARK-29144) Binarizer handle sparse vectors incorrectly with negative threshold

Sean Owen (Jira) Fri, 20 Sep 2019 17:24:41 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-29144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sean Owen reassigned SPARK-29144:
---------------------------------

    Assignee: zhengruifeng

> Binarizer handle sparse vectors incorrectly with negative threshold
> -------------------------------------------------------------------
>
>                 Key: SPARK-29144
>                 URL: https://issues.apache.org/jira/browse/SPARK-29144
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0
>            Reporter: zhengruifeng
>            Assignee: zhengruifeng
>            Priority: Minor
>             Fix For: 3.0.0
>
>
> the process on sparse vector is wrong if thread<0:
> {code:java}
> scala> val data = Seq((0, Vectors.sparse(3, Array(1), Array(0.5))), (1, 
> Vectors.dense(Array(0.0, 0.5, 0.0))))
> data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((0,(3,[1],[0.5])), 
> (1,[0.0,0.5,0.0]))
> scala> val df = data.toDF("id", "feature")
> df: org.apache.spark.sql.DataFrame = [id: int, feature: vector]
> scala> val binarizer: Binarizer = new 
> Binarizer().setInputCol("feature").setOutputCol("binarized_feature").setThreshold(-0.5)
> binarizer: org.apache.spark.ml.feature.Binarizer = binarizer_1c07ac2ae3c8
> scala> binarizer.transform(df).show()
> +---+-------------+-----------------+
> | id|      feature|binarized_feature|
> +---+-------------+-----------------+
> |  0|(3,[1],[0.5])|    [0.0,1.0,0.0]|
> |  1|[0.0,0.5,0.0]|    [1.0,1.0,1.0]|
> +---+-------------+-----------------+
> {code}
> expected outputs of the above two input vectors should be the same.
>  
> To deal with sparse vectors with threshold < 0, we have two options:
> 1, return 1 for non-active items, but this will convert sparse vectors to 
> dense ones
> 2, throw an exception like what Scikit-Learn's 
> [Binarizer|https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html]
>  does:
> {code:java}
> import numpy as np
> from scipy.sparse import csr_matrix
> from sklearn.preprocessing import Binarizer
> row = np.array([0, 0, 1, 2, 2, 2])
> col = np.array([0, 2, 2, 0, 1, 2])
> data = np.array([1, 2, 3, 4, 5, 6])
> a = csr_matrix((data, (row, col)), shape=(3, 3))
> binarizer = Binarizer(threshold=-1.0)
> binarizer.transform(a)
> Traceback (most recent call last):  File "<ipython-input-24-7e12ab26b3ed>", 
> line 1, in <module>
>     binarizer.transform(a)  File 
> "/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py",
>  line 1874, in transform
>     return binarize(X, threshold=self.threshold, copy=copy)  File 
> "/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py",
>  line 1774, in binarize
>     raise ValueError('Cannot binarize a sparse matrix with threshold 
> 'ValueError: Cannot binarize a sparse matrix with threshold < 0 {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29144) Binarizer handle sparse vectors incorrectly with negative threshold

Reply via email to