zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection
refused" while training with multiple workers
URL:
https://github.com/apache/incubator-mxnet/issues/11872#issuecomment-413576178
That should align with the (input batch_size, data shape, worker number),
usually
zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection
refused" while training with multiple workers
URL:
https://github.com/apache/incubator-mxnet/issues/11872#issuecomment-413429662
@ifeherva `docker run --shm-size xxx`, if not specified, docker has no
shared
zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection
refused" while training with multiple workers
URL:
https://github.com/apache/incubator-mxnet/issues/11872#issuecomment-412667553
With #11908 been merged, I am closing this for now. Feel free to ping me if
it still
zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection
refused" while training with multiple workers
URL:
https://github.com/apache/incubator-mxnet/issues/11872#issuecomment-408510384
I have figured out that the pre-fetch strategy for data loader is too
aggressive
zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection
refused" while training with multiple workers
URL:
https://github.com/apache/incubator-mxnet/issues/11872#issuecomment-408217525
Temporary solutions:
1. Increase shared memory if it's too small, you can use
zhreshold commented on issue #11872: "socket.error: [Errno 111] Connection
refused" while training with multiple workers
URL:
https://github.com/apache/incubator-mxnet/issues/11872#issuecomment-407532503
This is related to recent change that we switched from shared memory to file