[ https://issues.apache.org/jira/browse/SPARK-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joseph K. Bradley updated SPARK-4285: ------------------------------------- Target Version/s: 1.6.0 (was: 1.5.0) > Transpose RDD[Vector] to column store for ML > -------------------------------------------- > > Key: SPARK-4285 > URL: https://issues.apache.org/jira/browse/SPARK-4285 > Project: Spark > Issue Type: Sub-task > Components: MLlib > Reporter: Joseph K. Bradley > Assignee: Joseph K. Bradley > Priority: Minor > > For certain ML algorithms, a column store is more efficient than a row store > (which is currently used everywhere). E.g., deep decision trees can be > faster to train when partitioning by features. > Proposal: Provide a method with the following API (probably in util/): > ``` > def rowToColumnStore(data: RDD[Vector]): RDD[(Int, Vector)] > ``` > The input Vectors will be data rows/instances, and the output Vectors will be > columns/features paired with column/feature indices. > **Question**: Is it important to maintain matrix structure? That is, should > output Vectors in the same partition be adjacent columns in the matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org