[GitHub] [incubator-mxnet] TristonC edited a comment on issue #20959: GPU memory leak when using gluon.data.DataLoader with num_workers>0

2022-03-21 Thread GitBox


TristonC edited a comment on issue #20959:
URL: 
https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1074580675


   I wonder you might be interested in digger little deeper @ann-qin-lu. The it 
seems current gluon dataloader using fork to start a worker process in 
multiprocessing.Pool(..) function call (as it is default in Unix-like system). 
It might be a problem for this issue as the child process inherit everything 
from its parent process. It might be a good idea to use spawn instead of using 
fork this function. Unfortunately, I ran into a issue that blocks my test of 
multiprocessing.get_context('spawn').Pool(...) . 
   ```bash
   Traceback (most recent call last):   


   File "", line 1, in 
   File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
   exitcode = _main(fd, parent_sentinel)


   File "/usr/lib/python3.8/multiprocessing/spawn.py", line 126, in _main 
   self = reduction.pickle.load(from_parent)

   
   File "/opt/mxnet/python/mxnet/gluon/data/dataloader.py", line 58, in 
rebuild_ndarray
   return nd.NDArray(nd.ndarray._new_from_shared_mem(pid, fd, shape, dtype))
   File "/opt/mxnet/python/mxnet/ndarray/ndarray.py", line 193, in 
_new_from_shared_mem 
   check_call(_LIB.MXNDArrayCreateFromSharedMemEx(  

   
   File "/opt/mxnet/python/mxnet/base.py", line 246, in check_call  
   raise get_last_ffi_error() 
   mxnet.base.MXNetError: Traceback (most recent call last):

   
   File "../src/storage/./cpu_shared_storage_manager.h", line 179 
   MXNetError: Check failed: ptr != ((void *) -1) (0x vs. 
0x) : Failed to map shared memory. mmap failed with error 
Permission denied


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org
For additional commands, e-mail: issues-h...@mxnet.apache.org



[GitHub] [incubator-mxnet] TristonC commented on issue #20959: GPU memory leak when using gluon.data.DataLoader with num_workers>0

2022-03-21 Thread GitBox


TristonC commented on issue #20959:
URL: 
https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1074580675


   I wonder you might be interested in digger little deeper @ann-qin-lu. The it 
seems current gluon dataloader using fork to start a worker process in 
multiprocessing.Pool(..) function call (as it is default in Unix-like system). 
It might be a problem for this issue as the child process inherit everything 
from its parent process. It might be a good idea to use spawn instead of using 
fork this function. Unfortunately, I ran into a issue that blocks my test of 
multiprocessing.get_context('spawn').Pool(...) . 
   ```bash
   Traceback (most recent call last):   

File "", line 1, in 

   File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in 
spawn_main  
exitcode = _main(fd, parent_sentinel)   

 File "/usr/lib/python3.8/multiprocessing/spawn.py", line 
126, in _main   
self = reduction.pickle.load(from_parent)   

 File "/opt/mxne
 t/python/mxnet/gluon/data/dataloader.py", line 58, in rebuild_ndarray  
   return 
nd.NDArray(nd.ndarray._new_from_shared_mem(pid, fd, shape, dtype))  
  File 
"/opt/mxnet/python/mxnet/ndarray/ndarray.py", line 193, in _new_from_shared_mem 

check_call(_LIB.MXNDArrayCreateFromSharedMemEx( 

 File "/opt/mxnet/python/mxnet/base.py", line 246, in check_call

  raise get_last_ffi_error()

 mxnet.base.MXNetError: Traceback (mo
 st recent call last):  
  File 
"../src/storage/./cpu_shared_storage_manager.h", line 179   

MXNetError: Check failed: ptr != ((void *) -1) (0x vs. 
0x) : Failed to map shared memory. mmap failed with error 
Permission denied


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org
For additional commands, e-mail: issues-h...@mxnet.apache.org



[GitHub] [incubator-mxnet] DickJC123 commented on issue #20967: Unittest failures seen in test_dnnl

2022-03-21 Thread GitBox


DickJC123 commented on issue #20967:
URL: 
https://github.com/apache/incubator-mxnet/issues/20967#issuecomment-1074161900


   Thank you very much for this thorough response!  I'll look forward to 
smoother CI pipelines runs once these fixes are merged.
   
   On an unrelated point, I notice the dnnl tests push out the complete 
Python3-onednn-cpu job runtime to 2hr, 15min.  For example, there are these 
tests:
   ```
   [2022-03-17T23:13:35.468Z] 900.98s call 
tests/python/dnnl/test_quantization_dnnl.py::test_quantize_gluon_with_forward
   [2022-03-17T23:13:35.468Z] 404.96s call 
tests/python/dnnl/test_dnnl.py::test_Deconvolution
   [2022-03-17T23:13:35.468Z] 232.17s call 
tests/python/dnnl/test_dnnl.py::test_convolution
   [2022-03-17T23:13:35.468Z] 220.28s call 
tests/python/dnnl/test_dnnl.py::test_pooling
   ```
   Is it possible to trim down the runtime of these tests, or perhaps run the 
tests with xdist on multiple workers (each with fewer cores presumably)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org
For additional commands, e-mail: issues-h...@mxnet.apache.org



[GitHub] [incubator-mxnet] TristonC edited a comment on issue #20959: GPU memory leak when using gluon.data.DataLoader with num_workers>0

2022-03-21 Thread GitBox


TristonC edited a comment on issue #20959:
URL: 
https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1074140831


   Thanks @ann-qin-lu for your update.  I will address issue issue with the 
MXNet team soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org
For additional commands, e-mail: issues-h...@mxnet.apache.org



[GitHub] [incubator-mxnet] TristonC commented on issue #20959: GPU memory leak when using gluon.data.DataLoader with num_workers>0

2022-03-21 Thread GitBox


TristonC commented on issue #20959:
URL: 
https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1074140831


   Thanks @ann-qin-lu for your update. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org
For additional commands, e-mail: issues-h...@mxnet.apache.org



[GitHub] [incubator-mxnet] bartekkuncer edited a comment on issue #20967: Unittest failures seen in test_dnnl

2022-03-21 Thread GitBox


bartekkuncer edited a comment on issue #20967:
URL: 
https://github.com/apache/incubator-mxnet/issues/20967#issuecomment-1073966623


   Hi @DickJC123,
   1. 
https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/5/pipeline/288/
 test_fc_subgraph.py::test_fc_transpose[mxnet.numpy-int8-True-data_shape2]
   - Not related to oneDNN upgrade, @anko-intel analyzed it and here is PR with 
the fix https://github.com/apache/incubator-mxnet/pull/20969
   2. 
https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/2/pipeline/
 (test_conv_subgraph.py::test_pos_conv_act_add[True-gelu-True-data_shape1])
   AND
   
https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/3/pipeline/288
 (test_conv_subgraph.py::test_pos_conv_act_add[True-leakyrelu-True-data_shape1])
   - These two suffer from reading data from a neither updated nor zeroed 
register during convolution with the number of input channels lower than 4 and 
blocked weights (https://github.com/apache/incubator-mxnet/issues/20826). The 
full fix will arrive in v2.6 of oneDNN, therefore temporarily we will force 
plain format (at least on this axis) to make the convolution work properly and 
the test to pass PR: https://github.com/apache/incubator-mxnet/pull/20970.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org
For additional commands, e-mail: issues-h...@mxnet.apache.org



[GitHub] [incubator-mxnet] bartekkuncer edited a comment on issue #20967: Unittest failures seen in test_dnnl

2022-03-21 Thread GitBox


bartekkuncer edited a comment on issue #20967:
URL: 
https://github.com/apache/incubator-mxnet/issues/20967#issuecomment-1073966623


   Hi @DickJC123,
   1. 
https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/5/pipeline/288/
 test_fc_subgraph.py::test_fc_transpose[mxnet.numpy-int8-True-data_shape2]
   - Not related to oneDNN upgrade, @anko-intel analyzed it and here is PR with 
the fix https://github.com/apache/incubator-mxnet/pull/20969
   2. 
https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/2/pipeline/
 (test_conv_subgraph.py::test_pos_conv_act_add[True-gelu-True-data_shape1])
   AND
   
https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/3/pipeline/288
 (test_conv_subgraph.py::test_pos_conv_act_add[True-leakyrelu-True-data_shape1])
   - These two suffer from reading data from a neither updated nor zeroed 
register during convolution with the number of input channels lower than 4 and 
blocked weights (https://github.com/apache/incubator-mxnet/issues/20826). The 
full fix will arrive in v2.6 of oneDNN, therefore temporarily we will force 
plain format (at least in this axis) to make the convolution work properly and 
the test to pass PR: https://github.com/apache/incubator-mxnet/pull/20970.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org
For additional commands, e-mail: issues-h...@mxnet.apache.org



[GitHub] [incubator-mxnet] bartekkuncer edited a comment on issue #20967: Unittest failures seen in test_dnnl

2022-03-21 Thread GitBox


bartekkuncer edited a comment on issue #20967:
URL: 
https://github.com/apache/incubator-mxnet/issues/20967#issuecomment-1073966623


   Hi @DickJC123,
   1. 
https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/5/pipeline/288/
 test_fc_subgraph.py::test_fc_transpose[mxnet.numpy-int8-True-data_shape2]
   - Not related to oneDNN upgrade, @anko-intel analyzed it and here is PR with 
the fix https://github.com/apache/incubator-mxnet/pull/20969
   2. 
https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/2/pipeline/
 (test_conv_subgraph.py::test_pos_conv_act_add[True-gelu-True-data_shape1])
   AND
   
https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/3/pipeline/288
 (test_conv_subgraph.py::test_pos_conv_act_add[True-leakyrelu-True-data_shape1])
   - These two suffer from reading data from a neither updated nor zeroed 
register during convolution with the number of input channels lower than 4 and 
blocked weights (https://github.com/apache/incubator-mxnet/issues/20826). The 
full fix will arrive in v2.6 of oneDNN and temporarily we will force plain 
format (at least in this axis) to make the convolution work properly and the 
test to pass PR: https://github.com/apache/incubator-mxnet/pull/20970.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org
For additional commands, e-mail: issues-h...@mxnet.apache.org



[GitHub] [incubator-mxnet] bartekkuncer edited a comment on issue #20967: Unittest failures seen in test_dnnl

2022-03-21 Thread GitBox


bartekkuncer edited a comment on issue #20967:
URL: 
https://github.com/apache/incubator-mxnet/issues/20967#issuecomment-1073966623


   Hi @DickJC123,
   1. 
https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/5/pipeline/288/
 test_fc_subgraph.py::test_fc_transpose[mxnet.numpy-int8-True-data_shape2]
   - Not related to oneDNN upgrade, @anko-intel analyzed it and here is PR with 
the fix https://github.com/apache/incubator-mxnet/pull/20969
   2. 
https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/2/pipeline/
 (test_conv_subgraph.py::test_pos_conv_act_add[True-gelu-True-data_shape1])
   AND
   
https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/3/pipeline/288
 (test_conv_subgraph.py::test_pos_conv_act_add[True-leakyrelu-True-data_shape1])
   - These two suffer from reading data from a neither updated nor zeroed 
register during convolution with the number of input channels lower than 4 and 
blocked weights (https://github.com/apache/incubator-mxnet/issues/20826). The 
full fix will arrive in v2.6 and temporarily we will force plain format (at 
least in this axis) to make the convolution work properly and the test to pass 
PR: https://github.com/apache/incubator-mxnet/pull/20970.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org
For additional commands, e-mail: issues-h...@mxnet.apache.org



[GitHub] [incubator-mxnet] bartekkuncer commented on issue #20967: Unittest failures seen in test_dnnl

2022-03-21 Thread GitBox


bartekkuncer commented on issue #20967:
URL: 
https://github.com/apache/incubator-mxnet/issues/20967#issuecomment-1073966623


   Hi @DickJC123,
   1. 
https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/5/pipeline/288/
 test_fc_subgraph.py::test_fc_transpose[mxnet.numpy-int8-True-data_shape2]
   - Not related to oneDNN upgrade, @anko-intel analyzed it and here is PR with 
the fix https://github.com/apache/incubator-mxnet/pull/20969
   2. 
https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/2/pipeline/
 (test_conv_subgraph.py::test_pos_conv_act_add[True-gelu-True-data_shape1])
   AND
   
https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20965/3/pipeline/288
 (test_conv_subgraph.py::test_pos_conv_act_add[True-leakyrelu-True-data_shape1])
   - These two suffer from reading data from a neither updated nor zeroed 
register during convolution with the number of input channels lower than 4 and 
blocked weights (https://github.com/apache/incubator-mxnet/issues/20826). The 
full fix will arrive in v2.6 and temporarily we will force plain format (at 
least in this axis) to make the convolution to work properly and the test to 
pass PR: https://github.com/apache/incubator-mxnet/pull/20970.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org
For additional commands, e-mail: issues-h...@mxnet.apache.org



[GitHub] [incubator-mxnet] bartekkuncer commented on issue #20826: tests/python/dnnl/subgraphs/test_conv_subgraph.py::test_pos_conv_...[...-data_shape1] fail with oneDNN v2.4+

2022-03-21 Thread GitBox


bartekkuncer commented on issue #20826:
URL: 
https://github.com/apache/incubator-mxnet/issues/20826#issuecomment-1073960486


   After further investigation found out that the real problem was reading from 
neither updated nor cleared register during convolution with weights stored in 
blocked format. This problem does not occur when the amount of input channels 
is 4 and higher as full register is being overwritten. Fix for this issue is 
coming in v2.6 of oneDNN.
   
   Original shapes have been restored in the test. There was a temporary fix in 
place: https://github.com/apache/incubator-mxnet/pull/20847 forcing blocking 
the weights by 8 instead of 16, but it turned out only to make the problem 
occur less often. Therefore until v2.6 of oneDNN arrives there will be no 
blocking the weights for convolutions with the amount of input channels lower 
than 4: https://github.com/apache/incubator-mxnet/pull/20970.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org
For additional commands, e-mail: issues-h...@mxnet.apache.org



[GitHub] [incubator-mxnet] pribadihcr commented on issue #13146: Feature request: mx.init.TruncatedNormal

2022-03-21 Thread GitBox


pribadihcr commented on issue #13146:
URL: 
https://github.com/apache/incubator-mxnet/issues/13146#issuecomment-1073676825


   +1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org
For additional commands, e-mail: issues-h...@mxnet.apache.org