[GitHub] zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers
zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers URL: https://github.com/apache/incubator-mxnet/issues/11872#issuecomment-413576178 That should align with the (input batch_size, data shape, worker number), usually several GB is recommended for multi-gpu training. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers
zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers URL: https://github.com/apache/incubator-mxnet/issues/11872#issuecomment-413429662 @ifeherva `docker run --shm-size xxx`, if not specified, docker has no shared memory This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers
zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers URL: https://github.com/apache/incubator-mxnet/issues/11872#issuecomment-412667553 With #11908 been merged, I am closing this for now. Feel free to ping me if it still exists. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers
zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers URL: https://github.com/apache/incubator-mxnet/issues/11872#issuecomment-408510384 I have figured out that the pre-fetch strategy for data loader is too aggressive which might cause the related issue with shared mem. The fix is included in https://github.com/apache/incubator-mxnet/pull/11908 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers
zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers URL: https://github.com/apache/incubator-mxnet/issues/11872#issuecomment-408217525 Temporary solutions: 1. Increase shared memory if it's too small, you can use `df -h /dev/shm` to check the shared memory size and usage: edit `/etc/sysctl.conf`, add a line or edit `add a line kernel.shmmax = 4,294,967,296` for example to use maximum 4G shared mem. 2. Reduce `num_workers`, if you set `num_workers = 0`, no multiprocess worker will be used, but it's slower. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers
zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers URL: https://github.com/apache/incubator-mxnet/issues/11872#issuecomment-407532503 This is related to recent change that we switched from shared memory to file descriptor on linux for inter-processing communication. Still investigating solutions for that. Of course we can add an option to enable either way as fallback solution. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services