[ 
https://issues.apache.org/jira/browse/SPARK-28140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-28140:
------------------------------
    Priority: Minor  (was: Major)

> Pyspark API to create spark.mllib RowMatrix from DataFrame
> ----------------------------------------------------------
>
>                 Key: SPARK-28140
>                 URL: https://issues.apache.org/jira/browse/SPARK-28140
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib, PySpark
>    Affects Versions: 3.0.0
>            Reporter: Henry Davidge
>            Priority: Minor
>
> Since many functions are only implemented in spark.mllib, it is often 
> necessary to convert DataFrames of spark.ml vectors to spark.mllib 
> distributed matrix formats. The first step, converting the spark.ml vectors 
> to the spark.mllib equivalent, is straightforward. However, to the best of my 
> knowledge it's not possible to convert the resulting DataFrame to a RowMatrix 
> without using a python lambda function, which can have a significant 
> performance hit. In my recent use case, SVD took 3.5m using the Scala API, 
> but 12m using Python.
> To get around this performance hit, I propose adding a constructor to the 
> Pyspark RowMatrix class that accepts a DataFrame with a single column of 
> spark.mllib vectors. I'd be happy to add an equivalent API for 
> IndexedRowMatrix if there is demand.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to