[ https://issues.apache.org/jira/browse/SYSTEMML-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
LI Guobao updated SYSTEMML-2418: -------------------------------- Description: In the context of ml, the training data will be usually overfitted in spark driver node. So to partition such enormous data is no more feasible in CP. This task aims to do the data partitioning in distributed way which means that the workers will receive its split of training data and do the data partition locally according to different schemes. And then all the data will be grouped by the given key (i.e., the worker id) and at last be written into the seperate HDFS file. (was: In the context of ps, the training data will be partitioned according to the different schemes. This conversion is executed in driver node and the partitioned data should be distributed to workers via broadcast. Due to the 2G limitation of spark broadcast, we could leverage the _PartitionedBroadcast_ class to do this conversion. Afterwards, the partitioned broadcast object can be passed to workers for launching its job.) > Spark data partitioner > ---------------------- > > Key: SYSTEMML-2418 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2418 > Project: SystemML > Issue Type: Sub-task > Reporter: LI Guobao > Assignee: LI Guobao > Priority: Major > > In the context of ml, the training data will be usually overfitted in spark > driver node. So to partition such enormous data is no more feasible in CP. > This task aims to do the data partitioning in distributed way which means > that the workers will receive its split of training data and do the data > partition locally according to different schemes. And then all the data will > be grouped by the given key (i.e., the worker id) and at last be written into > the seperate HDFS file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)