mseth10 opened a new issue #19688:
URL: https://github.com/apache/incubator-mxnet/issues/19688


   ## Problem statement
   MXNet CI is running OOM [1] while building MXNet binaries for unix-cpu and 
unix-gpu stages. This is an intermittent failure and the work around is to 
re-trigger CI a few times. The issue is caused due to some of the numpy .cc 
files being too large causing gcc to use too much memory. The issue was not 
pronounced with gcc7, but with the recent update to use gcc8 [2] for CI builds, 
we have started to see this OOM error.
   
   The fix is to refactor the numpy .cc files into smaller files so that the 
objects created during compilation don't use much memory. Here is the list of 
the largest objects (>10MB in size) generated currently on Mac CPU build:
   ```
    11M ./operator/numpy/linalg/np_norm_backward.cc.o
    11M ./operator/numpy/np_kron.cc.o
    11M ./operator/numpy/random/np_location_scale_op.cc.o
    12M ./operator/numpy/np_insert_op_slice.cc.o
    12M ./operator/numpy/np_insert_op_tensor.cc.o
    13M ./operator/numpy/np_elemwise_broadcast_op_extended_sec.cc.o
    13M ./operator/numpy/np_elemwise_unary_op_basic.cc.o
    13M ./operator/numpy/np_percentile_op.cc.o
    14M ./operator/numpy/np_matrix_op.cc.o
    14M ./operator/numpy/np_moments_op.cc.o
    14M ./operator/numpy/np_where_op.cc.o
    15M ./operator/numpy/np_einsum_op.cc.o
    16M ./operator/numpy/np_elemwise_broadcast_op_extended.cc.o
    21M ./operator/numpy/np_broadcast_reduce_op_value.cc.o
    22M ./operator/numpy/linalg/np_norm_forward.cc.o
    24M ./operator/numpy/np_elemwise_broadcast_op.cc.o
    34M ./operator/numpy/np_elemwise_broadcast_logic_op.cc.o
   ```
   The corresponding cc files to above objects contains more than 210 operator 
registrations, and to refactor those into smaller files will need a 
considerable time and effort from the community. With 5 operators per day, 
that's more than 40 days of developers effort.
   
   ## Proposed solutions
   Option 1: We keep using gcc8 for CI builds and start working on refactoring 
these numpy .cc files. This would mean the community will have to face the CI 
failures for 40 days (could be less if more community members contribute).
   
   Option 2: We go back to using gcc7 for CI builds, potentially solving the CI 
problem immediately, while we work on refactoring the numpy files. Reverting to 
gcc7 would take 2 days and then refactoring would take another 40 days.
   
   I personally would prefer Option 2 for the reason that it saves contributors 
time in getting their PRs merged quickly, as well as saves on the CI resources. 
Would like to request community feedback on the same.
   
   Going forward we also need to add a check to MXNet CI for build time memory 
usage. Any ideas for the same would be highly appreciated.
   
   ## References
   - [1] CI failure: 
https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-19588/12/pipeline/
   - [2] CI builds gcc update commit: 
https://github.com/apache/incubator-mxnet/commit/afc76b0f82839dfe00f7cab04be6de3df94564fe


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org
For additional commands, e-mail: issues-h...@mxnet.apache.org

Reply via email to