Anuj Nagpall created SPARK-24693:
------------------------------------

             Summary: Row order preservation for operations on MLlib 
IndexedRowMatrix
                 Key: SPARK-24693
                 URL: https://issues.apache.org/jira/browse/SPARK-24693
             Project: Spark
          Issue Type: Bug
          Components: MLlib
            Reporter: Anuj Nagpall


In both IndexedRowMatrix.computeSVD and IndexedRowMatrix.multiply indices are 
dropped before calling the methods from RowMatrix. For example for 
IndexedRowMatrix.computeSVD:

   val svd = toRowMatrix().computeSVD(k, computeU, rCond)

and for IndexedRowMatrix.multiply:

   val mat = toRowMatrix().multiply(B).

After computing these results, they are zipped with the original indices, e.g. 
for IndexedRowMatrix.computeSVD

   val indexedRows = indices.zip(svd.U.rows).map { case (i, v) =>
      IndexedRow(i, v)
   }

and for IndexedRowMatrix.multiply:
   
   val indexedRows = rows.map(_.index).zip(mat.rows).map { case (i, v) =>
      IndexedRow(i, v)
   }

I have experienced that for IndexedRowMatrix.computeSVD().U and 
IndexedRowMatrix.multiply() (which both depend on RowMatrix.multiply) row 
indices can get mixed (when running Spark jobs with multiple 
executors/machines): i.e. the vectors and indices of the result do not seem to 
correspond anymore. 

To me it looks like this is caused by zipping RDDs that have a different 
ordering?

For the IndexedRowMatrix.multiply I have observed that ordering within 
partitions is preserved, but that it seems to get mixed up between partitions. 
For example, for:

part1Index1 part1Vector1
part1Index2 part1Vector2
part2Index1 part2Vector1
part2Index2 part2Vector2

I got:

part2Index1 part1Vector1
part2Index2 part1Vector2
part1Index1 part2Vector1
part1Index2 part2Vector2

Another observation is that the mapPartitions in RowMatrix.multiply :

val AB = rows.mapPartitions { iter =>

had an "preservesPartitioning = true" argument in version 1.0, but this is no 
longer there.










--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to