leezu opened a new pull request #18408: URL: https://github.com/apache/incubator-mxnet/pull/18408
The first pipeline that fails with illegal instruction errors is http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job/detail/mxnet-cd-release-job/1145/pipeline/305 The last working one is http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job/detail/mxnet-cd-release-job/1142/pipeline/483 As https://github.com/apache/incubator-mxnet/commit/47b0bdd00e7c5e1c9a448809b02e68c0e4b72e96 was merged in between the two runs, one hypothesis is that OpenBLAS build on CD is including instructions that are only available on the CPU arch used for the build. That shouldn't happen, as OpenBLAS is built with `DYNAMIC_ARCH=1` https://github.com/apache/incubator-mxnet/blob/2219f1ad77b685d4e615fb8cd7f1992e9764ca7c/tools/dependencies/openblas.sh#L36 but it turns out there is an OpenBLAS bug that causes this. I reproduced the issue locally by building the libopenblas.so and libmxnet.so via the staticbuild script on c5 instance and running the onnx unittests (which are shown as segfaulting in the CD log) after changing the instance type to c1. Looking at the coredump, I find that the illegal instruction occurs in `cblas_sgemm` OpenBLAS function: ``` #0 raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:51 #1 <signal handler called> #2 0x00007f91d4252ecf in sgemm_kernel_direct () from /home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libopenblas.so.0 #3 0x00007f91d2a6247c in cblas_sgemm () from /home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libopenblas.so.0 #4 0x00007f91d6b5eaa0 in void linalg_batch_gemm<mshadow::cpu, float>(mshadow::Tensor<mshadow::cpu, 3, float> const&, mshadow::Tensor<mshadow::cpu, 3, float> const&, mshadow::Tensor<m shadow::cpu, 3, float> const&, float, float, bool, bool, mshadow::Stream<mshadow::cpu>*) () from /home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libmxnet.so #5 0x00007f91daacd2bc in void mxnet::op::LaOpGemmForward<mshadow::cpu, 2, 2, 2, 1, mxnet::op::gemm2>(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::a llocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&) () from /home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libmxnet.so #6 0x00007f91d5d3bf4b in mxnet::exec::FComputeExecutor::Run(mxnet::RunContext, bool) () from /home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libmxnet.so #7 0x00007f91d5d498ed in ?? () from /home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libmxnet.so #8 0x00007f91d5d4998f in ?? () from /home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libmxnet.so #9 0x00007f91d5d2801c in mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*) () from /home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libmxnet.so #10 0x00007f91d5d288e7 in std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lam bda()#1}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&) () from /home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libmxnet.so #11 0x00007f91d5d24eaa in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)>, std::shared_ptr<dmlc::ManualEvent> > > >: :_M_run() () from /home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libmxnet.so #12 0x00007f91dc5ca1ff in ?? () from /home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libmxnet.so #13 0x00007f91f8bac6db in start_thread (arg=0x7f91ad7b6700) at pthread_create.c:463 #14 0x00007f91f813088f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 ``` This is a longstanding issue upstream and was fortunately fixed a few weeks ago (though no release containing the fix exists yet): https://github.com/xianyi/OpenBLAS/pull/2533 Backporting the fix to the latest stable release and updating our static build scripts to make use of it, fixes the issue in my local c5-build, c1-test setup. Further, I add the `DYNAMIC_OLDER=1` flag to the openblas build, to support dynamic architecture selection featuren in OpenBLAS for older CPUs. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org