[GitHub] [incubator-mxnet] anandj91 commented on issue #15124: [MXNET-1294] Priority-based parameter propagation for improved data parallel training throughput
anandj91 commented on issue #15124: [MXNET-1294] Priority-based parameter propagation for improved data parallel training throughput URL: https://github.com/apache/incubator-mxnet/pull/15124#issuecomment-542440290 I'm facing some design level challenges to properly implement Priority based update (P3) on top of PushPull API. MXNet does a simple load balancing before pushing or pulling key-values by splitting NDArrays equally to the parameter servers. P3 requires a round-robin style parameter distribution which means slicing a large NDArray into thousands of smaller ones. Much more granular than current default distribution strategy and each PS would get more than one slice. With the way mxnet and ps-lite designed right now, ps-lite assumes a single ZPush/ZPull/ZPushPull belongs to a single layer/NDArray. It also assumes that one slice only belong to one PS. These assumption need to be broken for implementing P3. What I have done right now is to add round-robin (RR) distribution strategy along with the default one and use a boolean flag to switch between these two. When user chooses to use RR, KVStore consider each slice as separate key-value pair. Otherwise fallback to the default mode. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-mxnet] anandj91 commented on issue #15124: [MXNET-1294] Priority-based parameter propagation for improved data parallel training throughput
anandj91 commented on issue #15124: [MXNET-1294] Priority-based parameter propagation for improved data parallel training throughput URL: https://github.com/apache/incubator-mxnet/pull/15124#issuecomment-511931055 This PR is waiting on https://github.com/apache/incubator-mxnet/pull/15559 for the PushPull API in KVStore. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-mxnet] anandj91 commented on issue #15124: [MXNET-1294] Priority-based parameter propagation for improved data parallel training throughput
anandj91 commented on issue #15124: [MXNET-1294] Priority-based parameter propagation for improved data parallel training throughput URL: https://github.com/apache/incubator-mxnet/pull/15124#issuecomment-509320881 @roywei The current implementation uses multiple `ThreadVar` to specify dependency between push and pull between slices. After some benchmarking on large models like VGG-19, I found that this causes large overhead and the training performance reduce to 50%. Instead I'm planning to introduce a new API for pushpull which combines push and pull of one slice. I had an offline discussion with @eric-haibin-lin and he is fine with this approach. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-mxnet] anandj91 commented on issue #15124: [MXNET-1294] Priority-based parameter propagation for improved data parallel training throughput
anandj91 commented on issue #15124: [MXNET-1294] Priority-based parameter propagation for improved data parallel training throughput URL: https://github.com/apache/incubator-mxnet/pull/15124#issuecomment-503795371 Modified the code to address the review comments. Sorry for the delay. I have added a new flag in the config.mk to enable priority based update (USE_PRIORITY_UPDATE). the flag is disabled by default. @eric-haibin-lin Can you please tell me on how to add a new test case to build the code with this flag turned on and run the dist_kvstore test cases? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-mxnet] anandj91 commented on issue #15124: [MXNET-1294] Priority-based parameter propagation for improved data parallel training throughput
anandj91 commented on issue #15124: [MXNET-1294] Priority-based parameter propagation for improved data parallel training throughput URL: https://github.com/apache/incubator-mxnet/pull/15124#issuecomment-499566737 Is there a way to rerun the test cases without doing a git push? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services