[GitHub] zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers

2018-08-16 Thread GitBox
zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection 
refused" while training with multiple workers
URL: 
https://github.com/apache/incubator-mxnet/issues/11872#issuecomment-413576178
 
 
   That should align with the (input batch_size, data shape, worker number), 
usually several GB is recommended for multi-gpu training.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers

2018-08-15 Thread GitBox
zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection 
refused" while training with multiple workers
URL: 
https://github.com/apache/incubator-mxnet/issues/11872#issuecomment-413429662
 
 
   @ifeherva  `docker run --shm-size xxx`, if not specified, docker has no 
shared memory 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers

2018-08-13 Thread GitBox
zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection 
refused" while training with multiple workers
URL: 
https://github.com/apache/incubator-mxnet/issues/11872#issuecomment-412667553
 
 
   With #11908 been merged, I am closing this for now. Feel free to ping me if 
it still exists. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers

2018-07-27 Thread GitBox
zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection 
refused" while training with multiple workers
URL: 
https://github.com/apache/incubator-mxnet/issues/11872#issuecomment-408510384
 
 
   I have figured out that the pre-fetch strategy for data loader is too 
aggressive which might cause the related issue with shared mem. 
   The fix is included in https://github.com/apache/incubator-mxnet/pull/11908
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers

2018-07-26 Thread GitBox
zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection 
refused" while training with multiple workers
URL: 
https://github.com/apache/incubator-mxnet/issues/11872#issuecomment-408217525
 
 
   Temporary solutions:
   
   1. Increase shared memory if it's too small, you can use `df -h /dev/shm` to 
check the shared memory size and usage: edit `/etc/sysctl.conf`, add a line or 
edit `add a line kernel.shmmax = 4,294,967,296` for example to use maximum 4G 
shared mem.
   2. Reduce `num_workers`, if you set `num_workers = 0`, no multiprocess 
worker will be used, but it's slower.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers

2018-07-24 Thread GitBox
zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection 
refused" while training with multiple workers
URL: 
https://github.com/apache/incubator-mxnet/issues/11872#issuecomment-407532503
 
 
   This is related to recent change that we switched from shared memory to file 
descriptor on linux for inter-processing communication.  Still investigating 
solutions for that. 
   Of course we can add an option to enable either way as fallback solution.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services