[GitHub] [incubator-mxnet] anandj91 commented on issue #15124: [MXNET-1294] Priority-based parameter propagation for improved data parallel training throughput

2019-10-15 Thread GitBox
anandj91 commented on issue #15124: [MXNET-1294] Priority-based parameter 
propagation for improved data parallel training throughput
URL: https://github.com/apache/incubator-mxnet/pull/15124#issuecomment-542440290
 
 
   I'm facing some design level challenges to properly implement Priority based 
update (P3) on top of PushPull API. MXNet does a simple load balancing before 
pushing or pulling key-values by splitting NDArrays equally to the parameter 
servers. P3 requires a round-robin style parameter distribution which means 
slicing a large NDArray into thousands of smaller ones. Much more granular than 
current default distribution strategy and each PS would get more than one slice.
   
   With the way mxnet and ps-lite designed right now, ps-lite assumes a single 
ZPush/ZPull/ZPushPull belongs to a single layer/NDArray. It also assumes that 
one slice only belong to one PS. These assumption need to be broken for 
implementing P3. What I have done right now is to add round-robin (RR) 
distribution strategy along with the default one and use a boolean flag to 
switch between these two. When user chooses to use RR, KVStore consider each 
slice as separate key-value pair. Otherwise fallback to the default mode.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-mxnet] anandj91 commented on issue #15124: [MXNET-1294] Priority-based parameter propagation for improved data parallel training throughput

2019-07-16 Thread GitBox
anandj91 commented on issue #15124: [MXNET-1294] Priority-based parameter 
propagation for improved data parallel training throughput
URL: https://github.com/apache/incubator-mxnet/pull/15124#issuecomment-511931055
 
 
   This PR is waiting on https://github.com/apache/incubator-mxnet/pull/15559 
for the PushPull API in KVStore.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-mxnet] anandj91 commented on issue #15124: [MXNET-1294] Priority-based parameter propagation for improved data parallel training throughput

2019-07-08 Thread GitBox
anandj91 commented on issue #15124: [MXNET-1294] Priority-based parameter 
propagation for improved data parallel training throughput
URL: https://github.com/apache/incubator-mxnet/pull/15124#issuecomment-509320881
 
 
   @roywei The current implementation uses multiple `ThreadVar` to specify 
dependency between push and pull between slices. After some benchmarking on 
large models like VGG-19, I found that this causes large overhead and the 
training performance reduce to 50%.
   
   Instead I'm planning to introduce a new API for pushpull which combines push 
and pull of one slice. I had an offline discussion with @eric-haibin-lin and he 
is fine with this approach.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-mxnet] anandj91 commented on issue #15124: [MXNET-1294] Priority-based parameter propagation for improved data parallel training throughput

2019-06-19 Thread GitBox
anandj91 commented on issue #15124: [MXNET-1294] Priority-based parameter 
propagation for improved data parallel training throughput
URL: https://github.com/apache/incubator-mxnet/pull/15124#issuecomment-503795371
 
 
   Modified the code to address the review comments. Sorry for the delay.
   I have added a new flag in the config.mk to enable priority based update 
(USE_PRIORITY_UPDATE). the flag is disabled by default. @eric-haibin-lin Can 
you please tell me on how to add a new test case to build the code with this 
flag turned on and run the dist_kvstore test cases?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-mxnet] anandj91 commented on issue #15124: [MXNET-1294] Priority-based parameter propagation for improved data parallel training throughput

2019-06-06 Thread GitBox
anandj91 commented on issue #15124: [MXNET-1294] Priority-based parameter 
propagation for improved data parallel training throughput
URL: https://github.com/apache/incubator-mxnet/pull/15124#issuecomment-499566737
 
 
   Is there a way to rerun the test cases without doing a git push?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services