[jira] [Updated] (SYSTEMML-2418) Spark data partitioner

LI Guobao (JIRA) Wed, 27 Jun 2018 10:03:50 -0700


     [ 
https://issues.apache.org/jira/browse/SYSTEMML-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


LI Guobao updated SYSTEMML-2418:
--------------------------------
    Description: In the context of ml, the training data will be usually 
overfitted in spark driver node. So to partition such enormous data is no more 
feasible in CP. This task aims to do the data partitioning in distributed way 
which means that the workers will receive its split of training data and do the 
data partition locally according to different schemes. And then all the data 
will be grouped by the given key (i.e., the worker id) and at last be written 
into the seperate HDFS file in scratch place.  (was: In the context of ml, the 
training data will be usually overfitted in spark driver node. So to partition 
such enormous data is no more feasible in CP. This task aims to do the data 
partitioning in distributed way which means that the workers will receive its 
split of training data and do the data partition locally according to different 
schemes. And then all the data will be grouped by the given key (i.e., the 
worker id) and at last be written into the seperate HDFS file.)

> Spark data partitioner
> ----------------------
>
>                 Key: SYSTEMML-2418
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2418
>             Project: SystemML
>          Issue Type: Sub-task
>            Reporter: LI Guobao
>            Assignee: LI Guobao
>            Priority: Major
>
> In the context of ml, the training data will be usually overfitted in spark 
> driver node. So to partition such enormous data is no more feasible in CP. 
> This task aims to do the data partitioning in distributed way which means 
> that the workers will receive its split of training data and do the data 
> partition locally according to different schemes. And then all the data will 
> be grouped by the given key (i.e., the worker id) and at last be written into 
> the seperate HDFS file in scratch place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SYSTEMML-2418) Spark data partitioner

Reply via email to