[
https://issues.apache.org/jira/browse/MADLIB-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nikhil Kak closed MADLIB-1378.
------------------------------
Resolution: Fixed
> Preprocessor should evenly distribute data on an arbitrary number of segments
> -----------------------------------------------------------------------------
>
> Key: MADLIB-1378
> URL: https://issues.apache.org/jira/browse/MADLIB-1378
> Project: Apache MADlib
> Issue Type: New Feature
> Components: Deep Learning
> Reporter: Yuhao Zhang
> Priority: Major
> Fix For: v1.17
>
>
> We need to implement a feature for the preprocessor to generate distribution
> keys that ensure gpdb will distribute the data in a controlled way.
> We want to assign the distribution key such that all data on each segment has
> a unique distribution key common for all rows in that segment.
> Currently, `training_preprocessor_d`l and `validation_preprocessor_dl`
> doesn't guarantee even distribution of the data among segments, especially
> when the number of buffers is not much larger than the number of segments.
> We should fix the preprocessor so that it always distributes the data as
> evenly as possible among the segments.
> Another problem is that often training with too large a number of segments
> results in slower accuracy convergence--the optimal number of segments will
> not usually match the total number of segments in the cluster. For this
> reason or others, a user may wish to use only a subset of the segments
> available.
> We should add a `num_segments` option to both preprocessors, and ensure that
> data is distributed evenly among those segments. It should throw an error if
> the number of segments passed in is larger than the total number of segments,
> and default to using all segments.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)