Hi MXNet devs,
I'd like some feedback on the following proposal before I start implementing it. Context: I am working on migrating a classification product currently using Caffe to MXNet. Along the way I'm encountering some issues loading and augmenting the images dataset. Basically it seems my dataset contains some technically invalid images. When loading them using mx.io.ImageRecordIter (from a Python script), they get passed eventually to the OpenCV library which will throw a C++ exception. MXNet currently doesn't capture those, resulting in my script aborting with a not very clear error message: " terminate called after throwing an instance of 'cv::Exception' what(): OpenCV(3.4.3) /home/lgo/dev/opencv-3.4.3/modules/imgproc/src/resize.cpp:4044: error: (-215:Assertion failed) !ssize.empty() in function 'resize' Aborted (core dumped) " These type of issues have been reported before and I see a high level action plan has been documented in the wiki: https://cwiki.apache.org/confluence/display/MXNET/Improved+Exception+Handling+in+MXNet+-+Phase+2 See also my previous pull request, which prevents OpenCV assertions by re-implementing the same checks in MXNet code: https://github.com/apache/incubator-mxnet/pull/12999 As I'm focused now on data loading and OpenCV, I would like to propose the following implementation steps: 1. Catch cv:exception in all calls to OpenCV functions that can raise one (cv::resize, cv::imdecode, cv::addWeighted, cv::mean, cv::copyMakeBorder, cv::warpAffine ..) => a new macro CHECK_CV_NO_ASSERT 2. Create a new mxnet::Error class for OpenCV exceptions. Map the cv::exception fields to this new Error class: code, err, file, func, line, msg, what. Make the CHECK_CV_NO_ASSERT macro throw this new mxnet::Error. => struct OpenCVError: public dmlc::Error 3. Add unit tests where possible. Scope: There are many calls to OpenCV function in different parts of the MXNet code. I plan to focus on: - src/io/image_* - src/ndarray/ndarray.cc - plugin/opencv/cv_api.cc The other modules (R-package, cpp-package, example, julia, tools, plugin/sframe) are related to programming languages I don't use. The sframe plugin is not documented at all so it's not clear what it does (or why you'd keep it in the repo). Is include/mxnet/base.h a good place to define the new macro and Error struct? I'm not sure which include file is visible in all places where OpenCV calls are currently used. Some assumptions: - The public API may contain references to 3rd party library OpenCV - There is some value in knowing if an Error is the result of a call to the OpenCV library. If not, I might as well wrap std::Exception in a more generic way. If I just make these changes the main process will still abort, but now at least with a clear error message + stack trace(*). Updating all processing codes to handle OpenCVError's correctly is a next step, outside the scope of this proposal. regards, Lieven (*) Example stack trace: [23:31:30] src/io/iter_image_recordio_2.cc:172: ImageRecordIOParser2: ./train.txt.rec, use 1 threads for decoding.. [23:31:34] src/io/iter_image_recordio_2.cc:172: ImageRecordIOParser2: ./val.txt.rec, use 1 threads for decoding.. Traceback (most recent call last): File "./test_train_carmodel_resnet.py", line 126, in <module> for i, batch in enumerate(train_data): File "/home/lgo/dev/incubator-mxnet/python/mxnet/io/io.py", line 228, in __next__ return self.next() File "/home/lgo/dev/incubator-mxnet/python/mxnet/io/io.py", line 856, in next check_call(_LIB.MXDataIterNext(self.handle, ctypes.byref(next_res))) File "/home/lgo/dev/incubator-mxnet/python/mxnet/base.py", line 252, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [23:31:34] src/io/image_aug_default.cc:413: OpenCV exception caught: OpenCV(3.4.3) /home/lgo/dev/opencv-3.4.3/modules/imgproc/src/resize.cpp:4044: error: (-215:Assertion failed) !ssize.empty() in function 'resize' Stack trace returned 10 entries: [bt] (0) /home/lgo/dev/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x53) [0x7f84af55b4f3] [bt] (1) /home/lgo/dev/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x39) [0x7f84af55bd69] [bt] (2) /home/lgo/dev/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::io::DefaultImageAugmenter::Process(cv::Mat const&, std::vector<float, std::allocator<float> >*, std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul>*)+0x2941) [0x7f84b224ed11] [bt] (3) /home/lgo/dev/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::io::ImageRecordIOParser2<float>::ParseChunk(float*, float*, unsigned long, dmlc::InputSplit::Blob*)::{lambda()#1}::operator()() const+0x512) [0x7f84b22c3862] [bt] (4) /home/lgo/dev/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(+0x3808eee) [0x7f84b22c4eee] [bt] (5) /usr/lib/x86_64-linux-gnu/libgomp.so.1(GOMP_parallel+0x3f) [0x7f8487f63ecf] [bt] (6) /home/lgo/dev/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::io::ImageRecordIOParser2<float>::ParseChunk(float*, float*, unsigned long, dmlc::InputSplit::Blob*)+0x1a7) [0x7f84b22c5e97] [bt] (7) /home/lgo/dev/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::io::ImageRecordIOParser2<float>::ParseNext(mxnet::DataBatch*)+0x1f9) [0x7f84b22ca199] [bt] (8) /home/lgo/dev/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_State_impl<std::thread::_Invoker<std::tuple<dmlc::ThreadedIter<mxnet::DataBatch>::Init(std::function<bool (mxnet::DataBatch**)>, std::function<void ()>)::{lambda()#1}> > >::_M_run()+0x1f6) [0x7f84b2260676] [bt] (9) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd57f) [0x7f84e13f457f]