[ https://issues.apache.org/jira/browse/SPARK-11497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14995990#comment-14995990 ]
Mike Dusenberry commented on SPARK-11497: ----------------------------------------- So far, I've only encountered it as an issue with the tallSkinnyQR wrapper. We can just postpone this fix and the wrapper for 1.7. > PySpark RowMatrix Constructor Has Type Erasure Issue > ---------------------------------------------------- > > Key: SPARK-11497 > URL: https://issues.apache.org/jira/browse/SPARK-11497 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark > Affects Versions: 1.5.0, 1.5.1, 1.6.0 > Reporter: Mike Dusenberry > Assignee: Mike Dusenberry > Priority: Minor > > Implementing tallSkinnyQR in SPARK-9656 uncovered a bug with our PySpark > RowMatrix constructor. As discussed on the dev list > [here|http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html], > there appears to be an issue with type erasure with RDDs coming from Java, > and by extension from PySpark. Although we are attempting to construct a > RowMatrix from an RDD[Vector] in > [PythonMLlibAPI|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115], > the Vector type is erased, resulting in an RDD[Object]. Thus, when calling > Scala's tallSkinnyQR from PySpark, we get a Java ClassCastException in which > an Object cannot be cast to a Spark Vector. As noted in the aforementioned > dev list thread, this issue was also encountered with DecisionTrees, and the > fix involved an explicit retag of the RDD with a Vector type. Thus, this PR > will apply that fix to the createRowMatrix helper function in PythonMLlibAPI. > IndexedRowMatrix and CoordinateMatrix do not appear to have this issue likely > due to their related helper functions in PythonMLlibAPI creating the RDDs > explicitly from DataFrames with pattern matching, thus preserving the types. > The following reproduces this issue on the latest Git head, 1.5.1, and 1.5.0: > {code} > from pyspark.mllib.linalg.distributed import RowMatrix > rows = sc.parallelize([[3, -6], [4, -8], [0, 1]]) > mat = RowMatrix(rows) > mat._java_matrix_wrapper.call("tallSkinnyQR", True) > {code} > Should result in the following exception: > {code} > java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to > [Lorg.apache.spark.mllib.linalg.Vector; > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org