[ 
https://issues.apache.org/jira/browse/SPARK-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14995990#comment-14995990
 ] 

Mike Dusenberry commented on SPARK-11497:
-----------------------------------------

So far, I've only encountered it as an issue with the tallSkinnyQR wrapper.  We 
can just postpone this fix and the wrapper for 1.7.

> PySpark RowMatrix Constructor Has Type Erasure Issue
> ----------------------------------------------------
>
>                 Key: SPARK-11497
>                 URL: https://issues.apache.org/jira/browse/SPARK-11497
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib, PySpark
>    Affects Versions: 1.5.0, 1.5.1, 1.6.0
>            Reporter: Mike Dusenberry
>            Assignee: Mike Dusenberry
>            Priority: Minor
>
> Implementing tallSkinnyQR in SPARK-9656 uncovered a bug with our PySpark 
> RowMatrix constructor. As discussed on the dev list 
> [here|http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html],
>  there appears to be an issue with type erasure with RDDs coming from Java, 
> and by extension from PySpark. Although we are attempting to construct a 
> RowMatrix from an RDD[Vector] in 
> [PythonMLlibAPI|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115],
>  the Vector type is erased, resulting in an RDD[Object]. Thus, when calling 
> Scala's tallSkinnyQR from PySpark, we get a Java ClassCastException in which 
> an Object cannot be cast to a Spark Vector. As noted in the aforementioned 
> dev list thread, this issue was also encountered with DecisionTrees, and the 
> fix involved an explicit retag of the RDD with a Vector type. Thus, this PR 
> will apply that fix to the createRowMatrix helper function in PythonMLlibAPI. 
> IndexedRowMatrix and CoordinateMatrix do not appear to have this issue likely 
> due to their related helper functions in PythonMLlibAPI creating the RDDs 
> explicitly from DataFrames with pattern matching, thus preserving the types. 
> The following reproduces this issue on the latest Git head, 1.5.1, and 1.5.0:
> {code}
> from pyspark.mllib.linalg.distributed import RowMatrix
> rows = sc.parallelize([[3, -6], [4, -8], [0, 1]])
> mat = RowMatrix(rows)
> mat._java_matrix_wrapper.call("tallSkinnyQR", True)
> {code}
> Should result in the following exception:
> {code}
> java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
> [Lorg.apache.spark.mllib.linalg.Vector;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to