ThomasDelteil commented on a change in pull request #10628: [MXNET-342] Fix the multi worker Dataloader URL: https://github.com/apache/incubator-mxnet/pull/10628#discussion_r183161704
########## File path: python/mxnet/gluon/data/dataset.py ########## @@ -173,8 +173,15 @@ class RecordFileDataset(Dataset): Path to rec file. """ def __init__(self, filename): - idx_file = os.path.splitext(filename)[0] + '.idx' - self._record = recordio.MXIndexedRecordIO(idx_file, filename, 'r') + self._filename = filename + self.reload_recordfile() + + def reload_recordfile(self): + """ + Reload the record file. + """ + idx_file = os.path.splitext(self._filename)[0] + '.idx' + self._record = recordio.MXIndexedRecordIO(idx_file, self._filename, 'r') Review comment: Ok digging a bit more, it seems that the `multiprocessing` package does not close file descriptors since it is simply calling `os.fork()`. I have updated the description of the PR to reflect the issue. tldr; a `file description` keeps track of the byte offset position it is in the file. When forking, all children processes get a duplicate of the original `file descriptor`, however they all refer to the same `file description` and when they try to move the current offset of the `file description` at the same time, they cause a crash. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services