leezu opened a new pull request #18408:
URL: https://github.com/apache/incubator-mxnet/pull/18408


   The first pipeline that fails with illegal instruction errors is 
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job/detail/mxnet-cd-release-job/1145/pipeline/305
   The last working one is  
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job/detail/mxnet-cd-release-job/1142/pipeline/483
   
   As
   
https://github.com/apache/incubator-mxnet/commit/47b0bdd00e7c5e1c9a448809b02e68c0e4b72e96
 was merged in between the two runs, one hypothesis is that OpenBLAS build on 
CD is including instructions that are only available on the CPU arch used for 
the build. That shouldn't happen, as OpenBLAS is built with `DYNAMIC_ARCH=1` 
https://github.com/apache/incubator-mxnet/blob/2219f1ad77b685d4e615fb8cd7f1992e9764ca7c/tools/dependencies/openblas.sh#L36
   but it turns out there is an OpenBLAS bug that causes this.
   
   I reproduced the issue locally by building the libopenblas.so and 
libmxnet.so via the staticbuild script on c5 instance and running the onnx 
unittests (which are shown as segfaulting in the CD log) after changing the 
instance type to c1.
   Looking at the coredump, I find that the illegal instruction occurs in 
`cblas_sgemm` OpenBLAS function:
   
   ```
   #0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:51
   #1  <signal handler called>
   #2  0x00007f91d4252ecf in sgemm_kernel_direct () from 
/home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libopenblas.so.0
   #3  0x00007f91d2a6247c in cblas_sgemm () from 
/home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libopenblas.so.0
   #4  0x00007f91d6b5eaa0 in void linalg_batch_gemm<mshadow::cpu, 
float>(mshadow::Tensor<mshadow::cpu, 3, float> const&, 
mshadow::Tensor<mshadow::cpu, 3, float> const&, mshadow::Tensor<m
   shadow::cpu, 3, float> const&, float, float, bool, bool, 
mshadow::Stream<mshadow::cpu>*) () from 
/home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libmxnet.so
   #5  0x00007f91daacd2bc in void mxnet::op::LaOpGemmForward<mshadow::cpu, 2, 
2, 2, 1, mxnet::op::gemm2>(nnvm::NodeAttrs const&, mxnet::OpContext const&, 
std::vector<mxnet::TBlob, std::a
   llocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, 
std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, 
std::allocator<mxnet::TBlob> > const&) ()
      from /home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libmxnet.so
   #6  0x00007f91d5d3bf4b in 
mxnet::exec::FComputeExecutor::Run(mxnet::RunContext, bool) () from 
/home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libmxnet.so
   #7  0x00007f91d5d498ed in ?? () from 
/home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libmxnet.so
   #8  0x00007f91d5d4998f in ?? () from 
/home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libmxnet.so
   #9  0x00007f91d5d2801c in 
mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, 
mxnet::engine::OprBlock*) ()
      from /home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libmxnet.so
   #10 0x00007f91d5d288e7 in std::_Function_handler<void 
(std::shared_ptr<dmlc::ManualEvent>), 
mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, 
bool)::{lam
   bda()#1}::operator()() 
const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data
 const&, std::shared_ptr<dmlc::ManualEvent>&&) ()
      from /home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libmxnet.so
   #11 0x00007f91d5d24eaa in 
std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void 
(std::shared_ptr<dmlc::ManualEvent>)>, std::shared_ptr<dmlc::ManualEvent> > > >:
   :_M_run() () from 
/home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libmxnet.so
   #12 0x00007f91dc5ca1ff in ?? () from 
/home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libmxnet.so
   #13 0x00007f91f8bac6db in start_thread (arg=0x7f91ad7b6700) at 
pthread_create.c:463
   #14 0x00007f91f813088f in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:95
   ```
   
   This is a longstanding issue upstream and was fortunately fixed a few weeks 
ago (though no release containing the fix exists yet): 
https://github.com/xianyi/OpenBLAS/pull/2533
   
   
   Backporting the fix to the latest stable release and updating our static 
build scripts to make use of it, fixes the issue in my local c5-build, c1-test 
setup.
   
   Further, I add the `DYNAMIC_OLDER=1` flag to the openblas build, to support 
dynamic architecture selection featuren in OpenBLAS for older CPUs.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to