[ https://issues.apache.org/jira/browse/SPARK-29144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen reassigned SPARK-29144: --------------------------------- Assignee: zhengruifeng > Binarizer handle sparse vectors incorrectly with negative threshold > ------------------------------------------------------------------- > > Key: SPARK-29144 > URL: https://issues.apache.org/jira/browse/SPARK-29144 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0 > Reporter: zhengruifeng > Assignee: zhengruifeng > Priority: Minor > Fix For: 3.0.0 > > > the process on sparse vector is wrong if thread<0: > {code:java} > scala> val data = Seq((0, Vectors.sparse(3, Array(1), Array(0.5))), (1, > Vectors.dense(Array(0.0, 0.5, 0.0)))) > data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((0,(3,[1],[0.5])), > (1,[0.0,0.5,0.0])) > scala> val df = data.toDF("id", "feature") > df: org.apache.spark.sql.DataFrame = [id: int, feature: vector] > scala> val binarizer: Binarizer = new > Binarizer().setInputCol("feature").setOutputCol("binarized_feature").setThreshold(-0.5) > binarizer: org.apache.spark.ml.feature.Binarizer = binarizer_1c07ac2ae3c8 > scala> binarizer.transform(df).show() > +---+-------------+-----------------+ > | id| feature|binarized_feature| > +---+-------------+-----------------+ > | 0|(3,[1],[0.5])| [0.0,1.0,0.0]| > | 1|[0.0,0.5,0.0]| [1.0,1.0,1.0]| > +---+-------------+-----------------+ > {code} > expected outputs of the above two input vectors should be the same. > > To deal with sparse vectors with threshold < 0, we have two options: > 1, return 1 for non-active items, but this will convert sparse vectors to > dense ones > 2, throw an exception like what Scikit-Learn's > [Binarizer|https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html] > does: > {code:java} > import numpy as np > from scipy.sparse import csr_matrix > from sklearn.preprocessing import Binarizer > row = np.array([0, 0, 1, 2, 2, 2]) > col = np.array([0, 2, 2, 0, 1, 2]) > data = np.array([1, 2, 3, 4, 5, 6]) > a = csr_matrix((data, (row, col)), shape=(3, 3)) > binarizer = Binarizer(threshold=-1.0) > binarizer.transform(a) > Traceback (most recent call last): File "<ipython-input-24-7e12ab26b3ed>", > line 1, in <module> > binarizer.transform(a) File > "/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py", > line 1874, in transform > return binarize(X, threshold=self.threshold, copy=copy) File > "/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py", > line 1774, in binarize > raise ValueError('Cannot binarize a sparse matrix with threshold > 'ValueError: Cannot binarize a sparse matrix with threshold < 0 {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org