Henry Davidge created SPARK-28140:
-------------------------------------

             Summary: Pyspark API to create spark.mllib RowMatrix from DataFrame
                 Key: SPARK-28140
                 URL: https://issues.apache.org/jira/browse/SPARK-28140
             Project: Spark
          Issue Type: Improvement
          Components: MLlib, PySpark
    Affects Versions: 3.0.0
            Reporter: Henry Davidge


Since many functions are only implemented in spark.mllib, it is often necessary 
to convert DataFrames of spark.ml vectors to spark.mllib distributed matrix 
formats. The first step, converting the spark.ml vectors to the spark.mllib 
equivalent, is straightforward. However, to the best of my knowledge it's not 
possible to convert the resulting DataFrame to a RowMatrix without using a 
python lambda function, which can have a significant performance hit. In my 
recent use case, SVD took 3.5m using the Scala API, but 12m using Python.

To get around this performance hit, I propose adding a constructor to the 
Pyspark RowMatrix class that accepts a DataFrame with a single column of 
spark.mllib vectors. I'd be happy to add an equivalent API for IndexedRowMatrix 
if there is demand.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to