I think the failing tests are really getting an issue. We now got roughly
50 test failure related issues [1], leading to a average failure rate of
50%. Considering the costs in terms of money and time per run, this is
adding up quite significantly.

Didn't we just remove MKLML from our codebase to replace it with MKLDNN? I
think removing something and marking the replacement as experimental could
be difficult from a user perspective. Personally, I don't really feel
comfortable solving the problem of known issues by marking something as
experimental. We're basically shifting the responsibility to our users that
way.

I don't think we should stop testing MKLDNN in our CI. We already had the
situation a few months ago where the solution to failed tests was to
disable them. We shouldn't go back to that.

-Marco

[1]:
https://github.com/apache/incubator-mxnet/issues?q=is%3Aopen+is%3Aissue+label%3ATest

On Thu, May 3, 2018 at 8:46 PM, Naveen Swamy <mnnav...@gmail.com> wrote:

> USE_MKLDNN is set to ON in the cmake file by default, since its
> experimental can we turn OFF  so there is some determinism when users build
> and test.
>
> https://github.com/apache/incubator-mxnet/blob/
> 60641ef1183bb4584c9356e84b6ca6d5fce58d6d/CMakeLists.txt#L23
>
>
>
>
>
>
> On a separate note, since MKLDNN is experimental can we stop building on
> master and cause PR's to queue up.
>
>
> On Thu, May 3, 2018 at 11:33 AM, Anirudh <anirudh2...@gmail.com> wrote:
>
> > Correction: I was able to reproduce the issue with MKLDNN enabled on
> > master, but not on 1.2 branch.
> >
> > On Thu, May 3, 2018 at 11:33 AM, Anirudh <anirudh2...@gmail.com> wrote:
> >
> > > Hi Pedro and Naveen,
> > >
> > > I am unable to reproduce this issue with MKLDNN on the master but not
> on
> > > the 1.2.RC2 branch.
> > >
> > > Did the following on 1.2.RC2 branch:
> > >
> > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=0
> > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> > > export MXNET_TEST_SEED=11
> > > export MXNET_MODULE_SEED=812478194
> > > export MXNET_TEST_COUNT=10000
> > > nosetests-2.7 -v tests/python/unittest/test_
> > module.py:test_forward_reshape
> > >
> > > Was able to do the 10k runs successfully.
> > >
> > > Anirudh
> > >
> > > On Thu, May 3, 2018 at 8:46 AM, Anirudh <anirudh2...@gmail.com> wrote:
> > >
> > >> Hi Pedro and Naveen,
> > >>
> > >> Is this issue reproducible when MXNet is built with USE_MKLDNN=0?
> > >> Also, there are a bunch of MKLDNN fixes that didn't go into the
> release
> > >> branch. Is this issue reproducible on the release branch ?
> > >> In my opinion, since we have marked MKLDNN as experimental feature for
> > >> the release, if it is confirmed to be a MKLDNN issue
> > >> we don't need to block the release on it.
> > >>
> > >> Anirudh
> > >>
> > >> On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy <mnnav...@gmail.com>
> > wrote:
> > >>
> > >>> Thanks for raising this issue Pedro.
> > >>>
> > >>> -1(binding)
> > >>>
> > >>> We were in a similar state for a while a year ago, a lot of effort
> went
> > >>> to
> > >>> stabilize the tests and the CI. I have seen the PR builds are
> > >>> non-deterministic and you have to retry over and over (wasting
> > resources
> > >>> and time) and hope you get lucky.
> > >>>
> > >>> Look at the dashboard for master build
> > >>> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
> mxnet/job/master/
> > >>>
> > >>> -Naveen
> > >>>
> > >>> On Thu, May 3, 2018 at 5:11 AM, Pedro Larroy <
> > >>> pedro.larroy.li...@gmail.com>
> > >>> wrote:
> > >>>
> > >>> > -1  nondeterminisitc failures on CI master:
> > >>> > https://issues.apache.org/jira/browse/MXNET-396
> > >>> >
> > >>> > Was able to reproduce once in a fresh p3 instance with DLAMI  can't
> > >>> > reproduce consistently.
> > >>> >
> > >>> > On Wed, May 2, 2018 at 9:51 PM, Anirudh <anirudh2...@gmail.com>
> > wrote:
> > >>> >
> > >>> > > Hi all,
> > >>> > >
> > >>> > > As part of RC2 release, we have addressed bugs and some concerns
> > that
> > >>> > were
> > >>> > > raised.
> > >>> > >
> > >>> > > I would like to propose a vote to release Apache MXNet
> (incubating)
> > >>> > version
> > >>> > > 1.2.0.RC2. Voting will start now (Wednesday, May 2nd) and end at
> > >>> 12:50 PM
> > >>> > > PDT, Sunday, May 6th.
> > >>> > >
> > >>> > > Link to release notes:
> > >>> > > https://cwiki.apache.org/confluence/display/MXNET/
> > >>> > > Apache+MXNet+%28incubating%29+1.2.0+Release+Notes
> > >>> > >
> > >>> > > Link to release candidate 1.2.0.rc2:
> > >>> > > https://github.com/apache/incubator-mxnet/releases/tag/1.2.0.rc2
> > >>> > >
> > >>> > > Voting results for 1.2.0.rc2:
> > >>> > > https://lists.apache.org/thread.html/
> > ebe561c609a8e32351dfe4aafc8876
> > >>> > > 199560336472726b58c3455e85@%3Cdev.mxnet.apache.org%3E
> > >>> > >
> > >>> > > View this page, click on "Build from Source", and use the source
> > code
> > >>> > > obtained from 1.2.0.rc2 tag:
> > >>> > > https://mxnet.incubator.apache.org/install/index.html
> > >>> > >
> > >>> > > (Note: The README.md points to the 1.2.0 tag and does not work at
> > the
> > >>> > > moment.)
> > >>> > >
> > >>> > > Please remember to test first before voting accordingly:
> > >>> > >
> > >>> > > +1 = approve
> > >>> > > +0 = no opinion
> > >>> > > -1 = disapprove (provide reason)
> > >>> > >
> > >>> > > Anirudh
> > >>> > >
> > >>> >
> > >>>
> > >>
> > >>
> > >
> >
>

Reply via email to