[GitHub] zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers

2018-08-16 Thread GitBox
zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers URL: https://github.com/apache/incubator-mxnet/issues/11872#issuecomment-413576178 That should align with the (input batch_size, data shape, worker number), usually

[GitHub] zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers

2018-08-15 Thread GitBox
zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers URL: https://github.com/apache/incubator-mxnet/issues/11872#issuecomment-413429662 @ifeherva `docker run --shm-size xxx`, if not specified, docker has no shared

[GitHub] zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers

2018-08-13 Thread GitBox
zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers URL: https://github.com/apache/incubator-mxnet/issues/11872#issuecomment-412667553 With #11908 been merged, I am closing this for now. Feel free to ping me if it still

[GitHub] zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers

2018-07-27 Thread GitBox
zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers URL: https://github.com/apache/incubator-mxnet/issues/11872#issuecomment-408510384 I have figured out that the pre-fetch strategy for data loader is too aggressive

[GitHub] zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers

2018-07-26 Thread GitBox
zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers URL: https://github.com/apache/incubator-mxnet/issues/11872#issuecomment-408217525 Temporary solutions: 1. Increase shared memory if it's too small, you can use

[GitHub] zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers

2018-07-24 Thread GitBox
zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection refused" while training with multiple workers URL: https://github.com/apache/incubator-mxnet/issues/11872#issuecomment-407532503 This is related to recent change that we switched from shared memory to file