[ https://issues.apache.org/jira/browse/SPARK-20214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15956237#comment-15956237 ]
Liang-Chi Hsieh commented on SPARK-20214: ----------------------------------------- Confirmed that dok_matrix.tocsc() won't guarantee sorted indices: {code} >>> from scipy.sparse import lil_matrix >>> lil = lil_matrix((4, 1)) >>> lil[1, 0] = 1 >>> lil[3, 0] = 2 >>> dok = lil.todok() >>> csc = dok.tocsc() >>> csc.has_sorted_indices 0 >>> csc.indices array([3, 1], dtype=int32) {code} I checked the source codes of scipy. The only way to guarantee it is {{csc_matrix.tocsr()}} and {{csr_matrix.tocsc()}}. > pyspark.mllib SciPyTests test_serialize > --------------------------------------- > > Key: SPARK-20214 > URL: https://issues.apache.org/jira/browse/SPARK-20214 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, PySpark, Tests > Affects Versions: 2.0.2, 2.1.1, 2.2.0 > Reporter: Joseph K. Bradley > > I've seen a few failures of this line: > https://github.com/apache/spark/blame/402bf2a50ddd4039ff9f376b641bd18fffa54171/python/pyspark/mllib/tests.py#L847 > It converts a scipy.sparse.lil_matrix to a dok_matrix and then to a > pyspark.mllib.linalg.Vector. The failure happens in the conversion to a > vector and indicates that the dok_matrix is not returning its values in > sorted order. (Actually, the failure is in _convert_to_vector, which converts > the dok_matrix to a csc_matrix and then passes the CSC data to the MLlib > Vector constructor.) Here's the stack trace: > {code} > Traceback (most recent call last): > File "/home/jenkins/workspace/python/pyspark/mllib/tests.py", line 847, in > test_serialize > self.assertEqual(sv, _convert_to_vector(lil.todok())) > File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", > line 78, in _convert_to_vector > return SparseVector(l.shape[0], csc.indices, csc.data) > File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", > line 556, in __init__ > % (self.indices[i], self.indices[i + 1])) > TypeError: Indices 3 and 1 are not strictly increasing > {code} > This seems like a bug in _convert_to_vector, where we really should check > {{csc_matrix.has_sorted_indices}} first. > I haven't seen this bug in pyspark.ml.linalg, but it probably exists there > too. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org