[ https://issues.apache.org/jira/browse/SPARK-28140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated SPARK-28140: ------------------------------ Priority: Minor (was: Major) > Pyspark API to create spark.mllib RowMatrix from DataFrame > ---------------------------------------------------------- > > Key: SPARK-28140 > URL: https://issues.apache.org/jira/browse/SPARK-28140 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark > Affects Versions: 3.0.0 > Reporter: Henry Davidge > Priority: Minor > > Since many functions are only implemented in spark.mllib, it is often > necessary to convert DataFrames of spark.ml vectors to spark.mllib > distributed matrix formats. The first step, converting the spark.ml vectors > to the spark.mllib equivalent, is straightforward. However, to the best of my > knowledge it's not possible to convert the resulting DataFrame to a RowMatrix > without using a python lambda function, which can have a significant > performance hit. In my recent use case, SVD took 3.5m using the Scala API, > but 12m using Python. > To get around this performance hit, I propose adding a constructor to the > Pyspark RowMatrix class that accepts a DataFrame with a single column of > spark.mllib vectors. I'd be happy to add an equivalent API for > IndexedRowMatrix if there is demand. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org