Yuhao Zhang created MADLIB-1378:
-----------------------------------

             Summary: Preprocessor should evenly distribute data on arbitrary 
number of segments
                 Key: MADLIB-1378
                 URL: https://issues.apache.org/jira/browse/MADLIB-1378
             Project: Apache MADlib
          Issue Type: New Feature
          Components: Deep Learning
            Reporter: Yuhao Zhang
             Fix For: v1.17


We need to implement a feature for the preprocessor to generate distribution 
keys that ensure gpdb will distribute the data in a controlled way.

We want to assign the distribution key such that all data on each segment has a 
unique distribution key common for all rows in that segment.

Currently, `training_preprocessor_d`l and `validation_preprocessor_dl` doesn't 
guarantee even distribution of the data among segments, especially when the 
number of buffers is not much larger than the number of segments.

We should fix the preprocessor so that it always distributes the data as evenly 
as possible among the segments.

Another problem is that often training with too large a number of segments 
results in slower accuracy convergence--the optimal number of segments will not 
usually match the total number of segments in the cluster.  For this reason or 
others, a user may wish to use only a subset of the segments available.

We should add a `num_segments` option to both preprocessors, and ensure that 
data is distributed evenly among those segments.  It should throw an error if 
the number of segments passed in is larger than the total number of segments, 
and default to using all segments.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to