[GitHub] [incubator-mxnet] TristonC edited a comment on issue #20959: GPU memory leak when using gluon.data.DataLoader with num_workers>0
TristonC edited a comment on issue #20959: URL: https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1074580675 I wonder you might be interested in digger little deeper @ann-qin-lu. The it seems current gluon dataloader using fork to start a worker process in multiprocessing.Pool(..) function call (as it is default in Unix-like system). It might be a problem for this issue as the child process inherit everything from its parent process. It might be a good idea to use spawn instead of using fork this function. Unfortunately, I ran into a issue that blocks my test of multiprocessing.get_context('spawn').Pool(...) . ```bash Traceback (most recent call last): File "", line 1, in File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "/usr/lib/python3.8/multiprocessing/spawn.py", line 126, in _main self = reduction.pickle.load(from_parent) File "/opt/mxnet/python/mxnet/gluon/data/dataloader.py", line 58, in rebuild_ndarray return nd.NDArray(nd.ndarray._new_from_shared_mem(pid, fd, shape, dtype)) File "/opt/mxnet/python/mxnet/ndarray/ndarray.py", line 193, in _new_from_shared_mem check_call(_LIB.MXNDArrayCreateFromSharedMemEx( File "/opt/mxnet/python/mxnet/base.py", line 246, in check_call raise get_last_ffi_error() mxnet.base.MXNetError: Traceback (most recent call last): File "../src/storage/./cpu_shared_storage_manager.h", line 179 MXNetError: Check failed: ptr != ((void *) -1) (0x vs. 0x) : Failed to map shared memory. mmap failed with error Permission denied -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For additional commands, e-mail: issues-h...@mxnet.apache.org
[GitHub] [incubator-mxnet] TristonC commented on issue #20959: GPU memory leak when using gluon.data.DataLoader with num_workers>0
TristonC commented on issue #20959: URL: https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1074580675 I wonder you might be interested in digger little deeper @ann-qin-lu. The it seems current gluon dataloader using fork to start a worker process in multiprocessing.Pool(..) function call (as it is default in Unix-like system). It might be a problem for this issue as the child process inherit everything from its parent process. It might be a good idea to use spawn instead of using fork this function. Unfortunately, I ran into a issue that blocks my test of multiprocessing.get_context('spawn').Pool(...) . ```bash Traceback (most recent call last): File "", line 1, in File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "/usr/lib/python3.8/multiprocessing/spawn.py", line 126, in _main self = reduction.pickle.load(from_parent) File "/opt/mxne t/python/mxnet/gluon/data/dataloader.py", line 58, in rebuild_ndarray return nd.NDArray(nd.ndarray._new_from_shared_mem(pid, fd, shape, dtype)) File "/opt/mxnet/python/mxnet/ndarray/ndarray.py", line 193, in _new_from_shared_mem check_call(_LIB.MXNDArrayCreateFromSharedMemEx( File "/opt/mxnet/python/mxnet/base.py", line 246, in check_call raise get_last_ffi_error() mxnet.base.MXNetError: Traceback (mo st recent call last): File "../src/storage/./cpu_shared_storage_manager.h", line 179 MXNetError: Check failed: ptr != ((void *) -1) (0x vs. 0x) : Failed to map shared memory. mmap failed with error Permission denied -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For additional commands, e-mail: issues-h...@mxnet.apache.org
[GitHub] [incubator-mxnet] DickJC123 commented on issue #20967: Unittest failures seen in test_dnnl
DickJC123 commented on issue #20967: URL: https://github.com/apache/incubator-mxnet/issues/20967#issuecomment-1074161900 Thank you very much for this thorough response! I'll look forward to smoother CI pipelines runs once these fixes are merged. On an unrelated point, I notice the dnnl tests push out the complete Python3-onednn-cpu job runtime to 2hr, 15min. For example, there are these tests: ``` [2022-03-17T23:13:35.468Z] 900.98s call tests/python/dnnl/test_quantization_dnnl.py::test_quantize_gluon_with_forward [2022-03-17T23:13:35.468Z] 404.96s call tests/python/dnnl/test_dnnl.py::test_Deconvolution [2022-03-17T23:13:35.468Z] 232.17s call tests/python/dnnl/test_dnnl.py::test_convolution [2022-03-17T23:13:35.468Z] 220.28s call tests/python/dnnl/test_dnnl.py::test_pooling ``` Is it possible to trim down the runtime of these tests, or perhaps run the tests with xdist on multiple workers (each with fewer cores presumably)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For additional commands, e-mail: issues-h...@mxnet.apache.org
[GitHub] [incubator-mxnet] TristonC edited a comment on issue #20959: GPU memory leak when using gluon.data.DataLoader with num_workers>0
TristonC edited a comment on issue #20959: URL: https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1074140831 Thanks @ann-qin-lu for your update. I will address issue issue with the MXNet team soon. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For additional commands, e-mail: issues-h...@mxnet.apache.org
[GitHub] [incubator-mxnet] TristonC commented on issue #20959: GPU memory leak when using gluon.data.DataLoader with num_workers>0
TristonC commented on issue #20959: URL: https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1074140831 Thanks @ann-qin-lu for your update. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For additional commands, e-mail: issues-h...@mxnet.apache.org
[GitHub] [incubator-mxnet] bartekkuncer edited a comment on issue #20967: Unittest failures seen in test_dnnl
bartekkuncer edited a comment on issue #20967: URL: https://github.com/apache/incubator-mxnet/issues/20967#issuecomment-1073966623 Hi @DickJC123, 1. https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/5/pipeline/288/ test_fc_subgraph.py::test_fc_transpose[mxnet.numpy-int8-True-data_shape2] - Not related to oneDNN upgrade, @anko-intel analyzed it and here is PR with the fix https://github.com/apache/incubator-mxnet/pull/20969 2. https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/2/pipeline/ (test_conv_subgraph.py::test_pos_conv_act_add[True-gelu-True-data_shape1]) AND https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/3/pipeline/288 (test_conv_subgraph.py::test_pos_conv_act_add[True-leakyrelu-True-data_shape1]) - These two suffer from reading data from a neither updated nor zeroed register during convolution with the number of input channels lower than 4 and blocked weights (https://github.com/apache/incubator-mxnet/issues/20826). The full fix will arrive in v2.6 of oneDNN, therefore temporarily we will force plain format (at least on this axis) to make the convolution work properly and the test to pass PR: https://github.com/apache/incubator-mxnet/pull/20970. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For additional commands, e-mail: issues-h...@mxnet.apache.org
[GitHub] [incubator-mxnet] bartekkuncer edited a comment on issue #20967: Unittest failures seen in test_dnnl
bartekkuncer edited a comment on issue #20967: URL: https://github.com/apache/incubator-mxnet/issues/20967#issuecomment-1073966623 Hi @DickJC123, 1. https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/5/pipeline/288/ test_fc_subgraph.py::test_fc_transpose[mxnet.numpy-int8-True-data_shape2] - Not related to oneDNN upgrade, @anko-intel analyzed it and here is PR with the fix https://github.com/apache/incubator-mxnet/pull/20969 2. https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/2/pipeline/ (test_conv_subgraph.py::test_pos_conv_act_add[True-gelu-True-data_shape1]) AND https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/3/pipeline/288 (test_conv_subgraph.py::test_pos_conv_act_add[True-leakyrelu-True-data_shape1]) - These two suffer from reading data from a neither updated nor zeroed register during convolution with the number of input channels lower than 4 and blocked weights (https://github.com/apache/incubator-mxnet/issues/20826). The full fix will arrive in v2.6 of oneDNN, therefore temporarily we will force plain format (at least in this axis) to make the convolution work properly and the test to pass PR: https://github.com/apache/incubator-mxnet/pull/20970. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For additional commands, e-mail: issues-h...@mxnet.apache.org
[GitHub] [incubator-mxnet] bartekkuncer edited a comment on issue #20967: Unittest failures seen in test_dnnl
bartekkuncer edited a comment on issue #20967: URL: https://github.com/apache/incubator-mxnet/issues/20967#issuecomment-1073966623 Hi @DickJC123, 1. https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/5/pipeline/288/ test_fc_subgraph.py::test_fc_transpose[mxnet.numpy-int8-True-data_shape2] - Not related to oneDNN upgrade, @anko-intel analyzed it and here is PR with the fix https://github.com/apache/incubator-mxnet/pull/20969 2. https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/2/pipeline/ (test_conv_subgraph.py::test_pos_conv_act_add[True-gelu-True-data_shape1]) AND https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/3/pipeline/288 (test_conv_subgraph.py::test_pos_conv_act_add[True-leakyrelu-True-data_shape1]) - These two suffer from reading data from a neither updated nor zeroed register during convolution with the number of input channels lower than 4 and blocked weights (https://github.com/apache/incubator-mxnet/issues/20826). The full fix will arrive in v2.6 of oneDNN and temporarily we will force plain format (at least in this axis) to make the convolution work properly and the test to pass PR: https://github.com/apache/incubator-mxnet/pull/20970. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For additional commands, e-mail: issues-h...@mxnet.apache.org
[GitHub] [incubator-mxnet] bartekkuncer edited a comment on issue #20967: Unittest failures seen in test_dnnl
bartekkuncer edited a comment on issue #20967: URL: https://github.com/apache/incubator-mxnet/issues/20967#issuecomment-1073966623 Hi @DickJC123, 1. https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/5/pipeline/288/ test_fc_subgraph.py::test_fc_transpose[mxnet.numpy-int8-True-data_shape2] - Not related to oneDNN upgrade, @anko-intel analyzed it and here is PR with the fix https://github.com/apache/incubator-mxnet/pull/20969 2. https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/2/pipeline/ (test_conv_subgraph.py::test_pos_conv_act_add[True-gelu-True-data_shape1]) AND https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/3/pipeline/288 (test_conv_subgraph.py::test_pos_conv_act_add[True-leakyrelu-True-data_shape1]) - These two suffer from reading data from a neither updated nor zeroed register during convolution with the number of input channels lower than 4 and blocked weights (https://github.com/apache/incubator-mxnet/issues/20826). The full fix will arrive in v2.6 and temporarily we will force plain format (at least in this axis) to make the convolution work properly and the test to pass PR: https://github.com/apache/incubator-mxnet/pull/20970. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For additional commands, e-mail: issues-h...@mxnet.apache.org
[GitHub] [incubator-mxnet] bartekkuncer commented on issue #20967: Unittest failures seen in test_dnnl
bartekkuncer commented on issue #20967: URL: https://github.com/apache/incubator-mxnet/issues/20967#issuecomment-1073966623 Hi @DickJC123, 1. https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/5/pipeline/288/ test_fc_subgraph.py::test_fc_transpose[mxnet.numpy-int8-True-data_shape2] - Not related to oneDNN upgrade, @anko-intel analyzed it and here is PR with the fix https://github.com/apache/incubator-mxnet/pull/20969 2. https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/2/pipeline/ (test_conv_subgraph.py::test_pos_conv_act_add[True-gelu-True-data_shape1]) AND https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/3/pipeline/288 (test_conv_subgraph.py::test_pos_conv_act_add[True-leakyrelu-True-data_shape1]) - These two suffer from reading data from a neither updated nor zeroed register during convolution with the number of input channels lower than 4 and blocked weights (https://github.com/apache/incubator-mxnet/issues/20826). The full fix will arrive in v2.6 and temporarily we will force plain format (at least in this axis) to make the convolution to work properly and the test to pass PR: https://github.com/apache/incubator-mxnet/pull/20970. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For additional commands, e-mail: issues-h...@mxnet.apache.org
[GitHub] [incubator-mxnet] bartekkuncer commented on issue #20826: tests/python/dnnl/subgraphs/test_conv_subgraph.py::test_pos_conv_...[...-data_shape1] fail with oneDNN v2.4+
bartekkuncer commented on issue #20826: URL: https://github.com/apache/incubator-mxnet/issues/20826#issuecomment-1073960486 After further investigation found out that the real problem was reading from neither updated nor cleared register during convolution with weights stored in blocked format. This problem does not occur when the amount of input channels is 4 and higher as full register is being overwritten. Fix for this issue is coming in v2.6 of oneDNN. Original shapes have been restored in the test. There was a temporary fix in place: https://github.com/apache/incubator-mxnet/pull/20847 forcing blocking the weights by 8 instead of 16, but it turned out only to make the problem occur less often. Therefore until v2.6 of oneDNN arrives there will be no blocking the weights for convolutions with the amount of input channels lower than 4: https://github.com/apache/incubator-mxnet/pull/20970. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For additional commands, e-mail: issues-h...@mxnet.apache.org
[GitHub] [incubator-mxnet] pribadihcr commented on issue #13146: Feature request: mx.init.TruncatedNormal
pribadihcr commented on issue #13146: URL: https://github.com/apache/incubator-mxnet/issues/13146#issuecomment-1073676825 +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For additional commands, e-mail: issues-h...@mxnet.apache.org