[ 
https://issues.apache.org/jira/browse/MADLIB-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968621#comment-16968621
 ] 

Nikhil Kak commented on MADLIB-1378:
------------------------------------

Closed via https://github.com/apache/madlib/pull/439

> Preprocessor should evenly distribute data on an arbitrary number of segments
> -----------------------------------------------------------------------------
>
>                 Key: MADLIB-1378
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1378
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Deep Learning
>            Reporter: Yuhao Zhang
>            Priority: Major
>             Fix For: v1.17
>
>
> We need to implement a feature for the preprocessor to generate distribution 
> keys that ensure gpdb will distribute the data in a controlled way.
> We want to assign the distribution key such that all data on each segment has 
> a unique distribution key common for all rows in that segment.
> Currently, `training_preprocessor_d`l and `validation_preprocessor_dl` 
> doesn't guarantee even distribution of the data among segments, especially 
> when the number of buffers is not much larger than the number of segments.
> We should fix the preprocessor so that it always distributes the data as 
> evenly as possible among the segments.
> Another problem is that often training with too large a number of segments 
> results in slower accuracy convergence--the optimal number of segments will 
> not usually match the total number of segments in the cluster.  For this 
> reason or others, a user may wish to use only a subset of the segments 
> available.
> We should add a `num_segments` option to both preprocessors, and ensure that 
> data is distributed evenly among those segments.  It should throw an error if 
> the number of segments passed in is larger than the total number of segments, 
> and default to using all segments.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to