Hi MXNet devs,

I'd like some feedback on the following proposal before I start
implementing it.

Context:
I am working on migrating a classification product currently using Caffe to
MXNet. Along the way I'm encountering some issues loading and augmenting
the images dataset.

Basically it seems my dataset contains some technically invalid images.
When loading them using mx.io.ImageRecordIter (from a Python script), they
get passed eventually to the OpenCV library which will throw a C++
exception. MXNet currently doesn't capture those, resulting in my script
aborting with a not very clear error message:

"
terminate called after throwing an instance of 'cv::Exception'

  what():  OpenCV(3.4.3)
/home/lgo/dev/opencv-3.4.3/modules/imgproc/src/resize.cpp:4044: error:
(-215:Assertion failed) !ssize.empty() in function 'resize'

Aborted (core dumped)
"

These type of issues have been reported before and I see a high level
action plan has been documented in the wiki:
https://cwiki.apache.org/confluence/display/MXNET/Improved+Exception+Handling+in+MXNet+-+Phase+2

See also my previous pull request, which prevents OpenCV assertions by
re-implementing the same checks in MXNet code:
https://github.com/apache/incubator-mxnet/pull/12999


As I'm focused now on data loading and OpenCV, I would like to propose the
following implementation steps:
1. Catch cv:exception in all calls to OpenCV functions that can raise one
(cv::resize, cv::imdecode, cv::addWeighted, cv::mean, cv::copyMakeBorder,
cv::warpAffine ..)
=> a new macro CHECK_CV_NO_ASSERT

2. Create a new mxnet::Error class for OpenCV exceptions. Map the
cv::exception fields to this new Error class: code, err, file, func, line,
msg, what.
Make the CHECK_CV_NO_ASSERT macro throw this new mxnet::Error.
=> struct OpenCVError: public dmlc::Error

3. Add unit tests where possible.

Scope: There are many calls to OpenCV function in different parts of the
MXNet code. I plan to focus on:
- src/io/image_*
- src/ndarray/ndarray.cc
- plugin/opencv/cv_api.cc

The other modules (R-package, cpp-package, example, julia, tools,
plugin/sframe) are related to programming languages I don't use. The sframe
plugin is not documented at all so it's not clear what it does (or why
you'd keep it in the repo).

Is include/mxnet/base.h a good place to define the new macro and Error
struct? I'm not sure which include file is visible in all places where
OpenCV calls are currently used.

Some assumptions:
- The public API may contain references to 3rd party library OpenCV
- There is some value in knowing if an Error is the result of a call to the
OpenCV library. If not, I might as well wrap std::Exception in a more
generic way.

If I just make these changes the main process will still abort, but now at
least with a clear error message + stack trace(*). Updating all processing
codes to handle OpenCVError's correctly is a next step, outside the scope
of this proposal.

regards,

Lieven


(*) Example stack trace:

[23:31:30] src/io/iter_image_recordio_2.cc:172: ImageRecordIOParser2:
./train.txt.rec, use 1 threads for decoding..

[23:31:34] src/io/iter_image_recordio_2.cc:172: ImageRecordIOParser2:
./val.txt.rec, use 1 threads for decoding..

Traceback (most recent call last):

  File "./test_train_carmodel_resnet.py", line 126, in <module>

    for i, batch in enumerate(train_data):

  File "/home/lgo/dev/incubator-mxnet/python/mxnet/io/io.py", line 228, in
__next__

    return self.next()

  File "/home/lgo/dev/incubator-mxnet/python/mxnet/io/io.py", line 856, in
next

    check_call(_LIB.MXDataIterNext(self.handle, ctypes.byref(next_res)))

  File "/home/lgo/dev/incubator-mxnet/python/mxnet/base.py", line 252, in
check_call

    raise MXNetError(py_str(_LIB.MXGetLastError()))

mxnet.base.MXNetError: [23:31:34] src/io/image_aug_default.cc:413: OpenCV
exception caught:

OpenCV(3.4.3)
/home/lgo/dev/opencv-3.4.3/modules/imgproc/src/resize.cpp:4044: error:
(-215:Assertion failed) !ssize.empty() in function 'resize'



Stack trace returned 10 entries:

[bt] (0)
/home/lgo/dev/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x53)
[0x7f84af55b4f3]

[bt] (1)
/home/lgo/dev/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x39)
[0x7f84af55bd69]

[bt] (2)
/home/lgo/dev/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::io::DefaultImageAugmenter::Process(cv::Mat
const&, std::vector<float, std::allocator<float> >*,
std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul,
2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul,
18ul, 1812433253ul>*)+0x2941) [0x7f84b224ed11]

[bt] (3)
/home/lgo/dev/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::io::ImageRecordIOParser2<float>::ParseChunk(float*,
float*, unsigned long, dmlc::InputSplit::Blob*)::{lambda()#1}::operator()()
const+0x512) [0x7f84b22c3862]

[bt] (4)
/home/lgo/dev/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(+0x3808eee)
[0x7f84b22c4eee]

[bt] (5) /usr/lib/x86_64-linux-gnu/libgomp.so.1(GOMP_parallel+0x3f)
[0x7f8487f63ecf]

[bt] (6)
/home/lgo/dev/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::io::ImageRecordIOParser2<float>::ParseChunk(float*,
float*, unsigned long, dmlc::InputSplit::Blob*)+0x1a7) [0x7f84b22c5e97]

[bt] (7)
/home/lgo/dev/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::io::ImageRecordIOParser2<float>::ParseNext(mxnet::DataBatch*)+0x1f9)
[0x7f84b22ca199]

[bt] (8)
/home/lgo/dev/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_State_impl<std::thread::_Invoker<std::tuple<dmlc::ThreadedIter<mxnet::DataBatch>::Init(std::function<bool
(mxnet::DataBatch**)>, std::function<void ()>)::{lambda()#1}> >
>::_M_run()+0x1f6) [0x7f84b2260676]

[bt] (9) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd57f) [0x7f84e13f457f]

Reply via email to