[ https://issues.apache.org/jira/browse/SPARK-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joseph K. Bradley reassigned SPARK-4285: ---------------------------------------- Assignee: (was: Joseph K. Bradley) > Transpose RDD[Vector] to column store for ML > -------------------------------------------- > > Key: SPARK-4285 > URL: https://issues.apache.org/jira/browse/SPARK-4285 > Project: Spark > Issue Type: Sub-task > Components: MLlib > Reporter: Joseph K. Bradley > Priority: Minor > > For certain ML algorithms, a column store is more efficient than a row store > (which is currently used everywhere). E.g., deep decision trees can be > faster to train when partitioning by features. > Proposal: Provide a method with the following API (probably in util/): > ``` > def rowToColumnStore(data: RDD[Vector]): RDD[(Int, Vector)] > ``` > The input Vectors will be data rows/instances, and the output Vectors will be > columns/features paired with column/feature indices. > **Question**: Is it important to maintain matrix structure? That is, should > output Vectors in the same partition be adjacent columns in the matrix? -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org