to my understanding, this is not about fault-tolerance, i.e., restart when worker/server fail, right?
I can help to review. ping @Mu for advice. 2017-12-05 13:13 GMT-08:00 CodingCat <coding...@apache.org>: > ping > > On Sat, Dec 2, 2017 at 10:04 AM, CodingCat <coding...@apache.org> wrote: > > > ping > > > > On Fri, Dec 1, 2017 at 12:18 AM, Nan Zhu <zhunanmcg...@gmail.com> wrote: > > > >> Hi, all > >> > >> I have been working on integrating MXNet with Spark in a more > >> full-fledged manner. > >> > >> One of the most critical pre-conditions is to make parameter server in > >> mxnet support multiple workers per process. I created the PR in > >> https://github.com/dmlc/ps-lite/pull/121 (OK, sorry for being late....I > >> should have finished it earlier) > >> > >> This PR includes some refactoring of those too long methods, to > highlight > >> the changes > >> > >> 1. https://github.com/dmlc/ps-lite/pull/112 includes the changes > related > >> to refactoring > >> > >> 2. https://github.com/CodingCat/ps-lite/pull/3/files includes the > >> changes related to the key functionality > >> > >> 3. https://github.com/dmlc/ps-lite/pull/121 contains everything (Please > >> review this one) > >> > >> > >> I am not sure who is the current owner of ps-lite, please help to share > >> your thoughts on the implementation. Only after this PR is merged and > >> ps-lite version is synced in mxnet repo, I can file the successive PRs > in > >> mxnet > >> > >> Thank you very much! > >> > >> Nan > >> > > > > > -- Yizhi Liu DMLC member Amazon Web Services Vancouver, Canada