Repository: spark
Updated Branches:
  refs/heads/branch-1.6 2ddd10486 -> bfcc8cfee


[SPARK-11497][MLLIB][PYTHON] PySpark RowMatrix Constructor Has Type Erasure 
Issue

As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our 
PySpark `RowMatrix` constructor.  As discussed on the dev list 
[here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html),
 there appears to be an issue with type erasure with RDDs coming from Java, and 
by extension from PySpark.  Although we are attempting to construct a 
`RowMatrix` from an `RDD[Vector]` in 
[PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115),
 the `Vector` type is erased, resulting in an `RDD[Object]`.  Thus, when 
calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` 
in which an `Object` cannot be cast to a Spark `Vector`.  As noted in the 
aforementioned dev list thread, this issue was also encountered with 
`DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a 
`Vector` type.  `IndexedRowMatrix` and `CoordinateMatri
 x` do not appear to have this issue likely due to their related helper 
functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with 
pattern matching, thus preserving the types.

This PR currently contains that retagging fix applied to the `createRowMatrix` 
helper function in `PythonMLlibAPI`.  This PR blocks #9441, so once this is 
merged, the other can be rebased.

cc holdenk

Author: Mike Dusenberry <mwdus...@us.ibm.com>

Closes #9458 from 
dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue.

(cherry picked from commit 1b8220387e6903564f765fabb54be0420c3e99d7)
Signed-off-by: Joseph K. Bradley <jos...@databricks.com>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bfcc8cfe
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bfcc8cfe
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bfcc8cfe

Branch: refs/heads/branch-1.6
Commit: bfcc8cfee7219e63d2f53fc36627f95dc60428eb
Parents: 2ddd104
Author: Mike Dusenberry <mwdus...@us.ibm.com>
Authored: Fri Dec 11 14:21:33 2015 -0800
Committer: Joseph K. Bradley <jos...@databricks.com>
Committed: Fri Dec 11 14:21:48 2015 -0800

----------------------------------------------------------------------
 .../scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/bfcc8cfe/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
----------------------------------------------------------------------
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
index 2aa6aec..8d546e3 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
@@ -1143,7 +1143,7 @@ private[python] class PythonMLLibAPI extends Serializable 
{
    * Wrapper around RowMatrix constructor.
    */
   def createRowMatrix(rows: JavaRDD[Vector], numRows: Long, numCols: Int): 
RowMatrix = {
-    new RowMatrix(rows.rdd, numRows, numCols)
+    new RowMatrix(rows.rdd.retag(classOf[Vector]), numRows, numCols)
   }
 
   /**


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to