Joseph K. Bradley created SPARK-20214:
-----------------------------------------

             Summary: pyspark.mllib SciPyTests test_serialize
                 Key: SPARK-20214
                 URL: https://issues.apache.org/jira/browse/SPARK-20214
             Project: Spark
          Issue Type: Bug
          Components: ML, MLlib, PySpark, Tests
    Affects Versions: 2.0.2, 2.1.1, 2.2.0
            Reporter: Joseph K. Bradley


I've seen a few failures of this line: 
https://github.com/apache/spark/blame/402bf2a50ddd4039ff9f376b641bd18fffa54171/python/pyspark/mllib/tests.py#L847

It converts a scipy.sparse.lil_matrix to a dok_matrix and then to a 
pyspark.mllib.linalg.Vector.  The failure happens in the conversion to a vector 
and indicates that the dok_matrix is not returning its values in sorted order. 
(Actually, the failure is in _convert_to_vector, which converts the dok_matrix 
to a csc_matrix and then passes the CSC data to the MLlib Vector constructor.) 
Here's the stack trace:
{code}
Traceback (most recent call last):
  File "/home/jenkins/workspace/python/pyspark/mllib/tests.py", line 847, in 
test_serialize
    self.assertEqual(sv, _convert_to_vector(lil.todok()))
  File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 
78, in _convert_to_vector
    return SparseVector(l.shape[0], csc.indices, csc.data)
  File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 
556, in __init__
    % (self.indices[i], self.indices[i + 1]))
TypeError: Indices 3 and 1 are not strictly increasing
{code}

This seems like a bug in _convert_to_vector, where we really should check 
{{csc_matrix.has_sorted_indices}} first.

I haven't seen this bug in pyspark.ml.linalg, but it probably exists there too.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to