Hi Marco, Thanks a lot for triggering and checking on the tests !
Anirudh On Sat, May 5, 2018 at 8:37 AM, Marco de Abreu <marco.g.ab...@googlemail.com > wrote: > We had 4 out of 20 runs fail: > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/ > incubator-mxnet/detail/v1.2.0/26 > - already tracked at https://github.com/apache/ > incubator-mxnet/issues/10280 > since 03/27 > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/ > incubator-mxnet/detail/v1.2.0/28 > - already tracked at https://github.com/apache/incubator-mxnet/issues/9853 > since 02/21 > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/ > incubator-mxnet/detail/v1.2.0/31 > - S3 timeout > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/ > incubator-mxnet/detail/v1.2.0/32 > - already tracked at https://github.com/apache/ > incubator-mxnet/issues/10376 > since 04/03 > > Best regards, > Marco > > > On Sat, May 5, 2018 at 12:12 PM, Pedro Larroy < > pedro.larroy.li...@gmail.com> > wrote: > > > Actually I have a linking problem in my ubuntu desktop that is fixed in > > master: > > > > lc::ThreadedIter<std::vector<dmlc::data::RowBlockContainer<unsigned > int>, > > std::allocator<dmlc::data::RowBlockContainer<unsigned int> > > > > >::Init(std::function<bool > > (std::vector<dmlc::data::RowBlockContainer<unsigned int>, > > std::allocator<dmlc::data::RowBlockContainer<unsigned int> > >**)>, > > std::function<void ()>)::{lambda()#1}&)': > > /usr/include/c++/5/thread:137: undefined reference to `pthread_create' > > 3rdparty/dmlc-core/libdmlc.a(data.cc.o): In function > > `std::thread::thread<dmlc::ThreadedIter<std::vector<dmlc: > > :data::RowBlockContainer<unsigned > > long>, std::allocator<dmlc::data::RowBlockContainer<unsigned long> > > > > >::Init(std::function<bool > > (std::vector<dmlc::data::RowBlockContainer<unsigned long>, > > std::allocator<dmlc::data::RowBlockContainer<unsigned long> > >**)>, > > std::function<void > > ()>)::{lambda()#1}&>(dmlc::ThreadedIter<std::vector<dmlc: > > :data::RowBlockContainer<unsigned > > long>, std::allocator<dmlc::data::RowBlockContainer<unsigned long> > > > > >::Init(std::function<bool > > (std::vector<dmlc::data::RowBlockContainer<unsigned long>, > > std::allocator<dmlc::data::RowBlockContainer<unsigned long> > >**)>, > > std::function<void ()>)::{lambda()#1}&)': > > /usr/include/c++/5/thread:137: undefined reference to `pthread_create' > > 3rdparty/dmlc-core/libdmlc.a(data.cc.o): In function > > `std::thread::thread<dmlc::ThreadedIter<dmlc::data:: > > RowBlockContainer<unsigned > > int> >::Init(std::function<bool (dmlc::data::RowBlockContainer<unsigned > > int>**)>, std::function<void > > ()>)::{lambda()#1}&>(dmlc::ThreadedIter<dmlc::data:: > > RowBlockContainer<unsigned > > int> >::Init(std::function<bool (dmlc::data::RowBlockContainer<unsigned > > int>**)>, std::function<void ()>)::{lambda()#1}&)': > > /usr/include/c++/5/thread:137: undefined reference to `pthread_create' > > 3rdparty/dmlc-core/libdmlc.a(data.cc.o): In function > > `std::thread::thread<dmlc::ThreadedIter<dmlc::data:: > > RowBlockContainer<unsigned > > long> >::Init(std::function<bool (dmlc::data::RowBlockContainer<unsigned > > long>**)>, std::function<void > > ()>)::{lambda()#1}&>(dmlc::ThreadedIter<dmlc::data:: > > RowBlockContainer<unsigned > > long> >::Init(std::function<bool (dmlc::data::RowBlockContainer<unsigned > > long>**)>, std::function<void ()>)::{lambda()#1}&)': > > /usr/include/c++/5/thread:137: undefined reference to `pthread_create' > > 3rdparty/dmlc-core/libdmlc.a(io.cc.o): In function > > `std::thread::thread<dmlc::ThreadedIter<dmlc::io:: > > InputSplitBase::Chunk>::Init(std::function<bool > > (dmlc::io::InputSplitBase::Chunk**)>, std::function<void > > ()>)::{lambda()#1}&>(dmlc::ThreadedIter<dmlc::io:: > > InputSplitBase::Chunk>::Init(std::function<bool > > (dmlc::io::InputSplitBase::Chunk**)>, std::function<void > > ()>)::{lambda()#1}&)': > > /usr/include/c++/5/thread:137: undefined reference to `pthread_create' > > collect2: error: ld returned 1 exit status > > ninja: build stopped: subcommand failed. > > > > > > Can we update dmlc-core on the release branch? this was recently fixed: > > https://github.com/dmlc/dmlc-core/commit/b744643f386660ddc39467a04e3a98 > > 853a7419b9 > > > > On Sat, May 5, 2018 at 11:59 AM, Pedro Larroy < > > pedro.larroy.li...@gmail.com> > > wrote: > > > > > Hi > > > > > > Looks like only gluon test lambda is failing intermittently, but looks > > > like a minor numerical issue. > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/ > > > jenkins/incubator-mxnet/detail/v1.2.0/20/pipeline > > > > > > I triggered a few builds yesterday and they all passed. I think Anirudh > > is > > > right. > > > > > > Changing my vote to +1 (non binding). > > > > > > > > > Pedro. > > > > > > > > > > > > On Sat, May 5, 2018 at 12:10 AM, Jun Wu <wujun....@gmail.com> wrote: > > > > > >> +1 > > >> I built from source and ran all the model quantization examples > > >> successfully. > > >> > > >> On Fri, May 4, 2018 at 3:05 PM, Anirudh <anirudh2...@gmail.com> > wrote: > > >> > > >> > Hi Pedro, Haibin, Indhu, > > >> > > > >> > Thank you for your inputs on the release. I ran the test: > > >> > `test_module.py:test_forward_reshape` for 250k times with different > > >> seeds. > > >> > I was unable to reproduce the issue on the release branch. > > >> > If everything goes well with CI tests by Pedro running till Sunday, > I > > >> think > > >> > we should move forward with the release (given that we have enough > > +1s). > > >> > Is it possible to trigger the CI on the 1.2 branch repeatedly or at > a > > >> fixed > > >> > schedule till Sunday? > > >> > > > >> > Anirudh > > >> > > > >> > On Fri, May 4, 2018 at 11:56 AM, Indhu <indhubhara...@gmail.com> > > wrote: > > >> > > > >> > > +1 > > >> > > > > >> > > I've been using CUDA build from this branch (built from source) on > > >> Ubuntu > > >> > > for couple of days now and I haven't seen any issue. > > >> > > > > >> > > The flaky tests need to be fixed but this release need not be > > blocked > > >> for > > >> > > that. > > >> > > > > >> > > > > >> > > On Fri, May 4, 2018 at 11:32 AM, Haibin Lin < > > haibin.lin....@gmail.com > > >> > > > >> > > wrote: > > >> > > > > >> > > > I agree with Anirudh that the focus of the discussion should be > > >> limited > > >> > > to > > >> > > > the release branch, not the master branch. Anything that breaks > on > > >> > master > > >> > > > but works on release branch should not block the release itself. > > >> > > > > > >> > > > > > >> > > > Best, > > >> > > > > > >> > > > Haibin > > >> > > > > > >> > > > On Fri, May 4, 2018 at 10:58 AM, Pedro Larroy < > > >> > > > pedro.larroy.li...@gmail.com> > > >> > > > wrote: > > >> > > > > > >> > > > > I see your point. > > >> > > > > > > >> > > > > I checked the failures on the v1.2.0 branch and I don't see > > >> > segfaults, > > >> > > > just > > >> > > > > minor failures due to flaky tests. > > >> > > > > > > >> > > > > I will trigger it repeatedly a few times until Sunday to have > a > > >> and > > >> > > > change > > >> > > > > my vote accordingly. > > >> > > > > > > >> > > > > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator- > > >> > mxnet/job/v1.2.0/ > > >> > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/ > > organizations/jenkins/ > > >> > > > > incubator-mxnet/detail/v1.2.0/17/pipeline > > >> > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/ > > organizations/jenkins/ > > >> > > > > incubator-mxnet/detail/v1.2.0/15/pipeline/ > > >> > > > > > > >> > > > > > > >> > > > > Pedro. > > >> > > > > > > >> > > > > On Fri, May 4, 2018 at 7:16 PM, Anirudh < > anirudh2...@gmail.com> > > >> > wrote: > > >> > > > > > > >> > > > > > Hi Pedro, > > >> > > > > > > > >> > > > > > Thank you for the suggestions. I will try to reproduce this > > >> without > > >> > > > fixed > > >> > > > > > seeds and also run it for a longer time duration. > > >> > > > > > Having said that, running unit tests over and over for a > > couple > > >> of > > >> > > days > > >> > > > > > will likely cause > > >> > > > > > problems because there around 42 open issues for flaky > tests: > > >> > > > > > https://github.com/apache/incubator-mxnet/issues?q=is% > > >> > > > > > 3Aopen+is%3Aissue+label%3AFlaky > > >> > > > > > Also, the release branch has diverged from master around 3 > > weeks > > >> > back > > >> > > > and > > >> > > > > > it doesn't have many of the changes merged to the master. > > >> > > > > > So, my question essentially is, what will be your benchmark > to > > >> > accept > > >> > > > the > > >> > > > > > release ? > > >> > > > > > Is it that we run the test which you provided on 1.2 without > > >> fixed > > >> > > > seeds > > >> > > > > > and for a longer duration without failures ? > > >> > > > > > Or is it that all unit tests should pass over a period of 2 > > days > > >> > > > without > > >> > > > > > issues. This may require fixing all of the flaky tests which > > >> would > > >> > > > delay > > >> > > > > > the release by considerable amount of time. > > >> > > > > > Or is it something else ? > > >> > > > > > > > >> > > > > > Anirudh > > >> > > > > > > > >> > > > > > > > >> > > > > > On Fri, May 4, 2018 at 4:49 AM, Pedro Larroy < > > >> > > > > pedro.larroy.li...@gmail.com > > >> > > > > > > > > >> > > > > > wrote: > > >> > > > > > > > >> > > > > > > Could you remove the fixed seeds and run it for a couple > of > > >> hours > > >> > > > with > > >> > > > > an > > >> > > > > > > additional loop? Also I would suggest running the unit > > tests > > >> > over > > >> > > > and > > >> > > > > > over > > >> > > > > > > for a couple of days if possible. > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > Pedro. > > >> > > > > > > > > >> > > > > > > On Thu, May 3, 2018 at 8:33 PM, Anirudh < > > >> anirudh2...@gmail.com> > > >> > > > wrote: > > >> > > > > > > > > >> > > > > > > > Hi Pedro and Naveen, > > >> > > > > > > > > > >> > > > > > > > I am unable to reproduce this issue with MKLDNN on the > > >> master > > >> > but > > >> > > > not > > >> > > > > > on > > >> > > > > > > > the 1.2.RC2 branch. > > >> > > > > > > > > > >> > > > > > > > Did the following on 1.2.RC2 branch: > > >> > > > > > > > > > >> > > > > > > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas > > >> > > USE_DIST_KVSTORE=0 > > >> > > > > > > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1 > > >> > > > > > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0 > > >> > > > > > > > export MXNET_TEST_SEED=11 > > >> > > > > > > > export MXNET_MODULE_SEED=812478194 > > >> > > > > > > > export MXNET_TEST_COUNT=10000 > > >> > > > > > > > nosetests-2.7 -v tests/python/unittest/test_ > > >> > > > > > > module.py:test_forward_reshape > > >> > > > > > > > > > >> > > > > > > > Was able to do the 10k runs successfully. > > >> > > > > > > > > > >> > > > > > > > Anirudh > > >> > > > > > > > > > >> > > > > > > > On Thu, May 3, 2018 at 8:46 AM, Anirudh < > > >> anirudh2...@gmail.com > > >> > > > > >> > > > > wrote: > > >> > > > > > > > > > >> > > > > > > > > Hi Pedro and Naveen, > > >> > > > > > > > > > > >> > > > > > > > > Is this issue reproducible when MXNet is built with > > >> > > USE_MKLDNN=0? > > >> > > > > > > > > Also, there are a bunch of MKLDNN fixes that didn't go > > >> into > > >> > the > > >> > > > > > release > > >> > > > > > > > > branch. Is this issue reproducible on the release > > branch ? > > >> > > > > > > > > In my opinion, since we have marked MKLDNN as > > experimental > > >> > > > feature > > >> > > > > > for > > >> > > > > > > > the > > >> > > > > > > > > release, if it is confirmed to be a MKLDNN issue > > >> > > > > > > > > we don't need to block the release on it. > > >> > > > > > > > > > > >> > > > > > > > > Anirudh > > >> > > > > > > > > > > >> > > > > > > > > On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy < > > >> > > mnnav...@gmail.com > > >> > > > > > > >> > > > > > > wrote: > > >> > > > > > > > > > > >> > > > > > > > >> Thanks for raising this issue Pedro. > > >> > > > > > > > >> > > >> > > > > > > > >> -1(binding) > > >> > > > > > > > >> > > >> > > > > > > > >> We were in a similar state for a while a year ago, a > > lot > > >> of > > >> > > > effort > > >> > > > > > > went > > >> > > > > > > > to > > >> > > > > > > > >> stabilize the tests and the CI. I have seen the PR > > builds > > >> > are > > >> > > > > > > > >> non-deterministic and you have to retry over and over > > >> > (wasting > > >> > > > > > > resources > > >> > > > > > > > >> and time) and hope you get lucky. > > >> > > > > > > > >> > > >> > > > > > > > >> Look at the dashboard for master build > > >> > > > > > > > >> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator- > > >> > > > > > mxnet/job/master/ > > >> > > > > > > > >> > > >> > > > > > > > >> -Naveen > > >> > > > > > > > >> > > >> > > > > > > > >> On Thu, May 3, 2018 at 5:11 AM, Pedro Larroy < > > >> > > > > > > > >> pedro.larroy.li...@gmail.com> > > >> > > > > > > > >> wrote: > > >> > > > > > > > >> > > >> > > > > > > > >> > -1 nondeterminisitc failures on CI master: > > >> > > > > > > > >> > https://issues.apache.org/jira/browse/MXNET-396 > > >> > > > > > > > >> > > > >> > > > > > > > >> > Was able to reproduce once in a fresh p3 instance > > with > > >> > DLAMI > > >> > > > > > can't > > >> > > > > > > > >> > reproduce consistently. > > >> > > > > > > > >> > > > >> > > > > > > > >> > On Wed, May 2, 2018 at 9:51 PM, Anirudh < > > >> > > > anirudh2...@gmail.com> > > >> > > > > > > > wrote: > > >> > > > > > > > >> > > > >> > > > > > > > >> > > Hi all, > > >> > > > > > > > >> > > > > >> > > > > > > > >> > > As part of RC2 release, we have addressed bugs > and > > >> some > > >> > > > > concerns > > >> > > > > > > > that > > >> > > > > > > > >> > were > > >> > > > > > > > >> > > raised. > > >> > > > > > > > >> > > > > >> > > > > > > > >> > > I would like to propose a vote to release Apache > > >> MXNet > > >> > > > > > > (incubating) > > >> > > > > > > > >> > version > > >> > > > > > > > >> > > 1.2.0.RC2. Voting will start now (Wednesday, May > > 2nd) > > >> > and > > >> > > > end > > >> > > > > at > > >> > > > > > > > >> 12:50 PM > > >> > > > > > > > >> > > PDT, Sunday, May 6th. > > >> > > > > > > > >> > > > > >> > > > > > > > >> > > Link to release notes: > > >> > > > > > > > >> > > https://cwiki.apache.org/ > confluence/display/MXNET/ > > >> > > > > > > > >> > > Apache+MXNet+%28incubating%29+ > 1.2.0+Release+Notes > > >> > > > > > > > >> > > > > >> > > > > > > > >> > > Link to release candidate 1.2.0.rc2: > > >> > > > > > > > >> > > https://github.com/apache/incu > > >> bator-mxnet/releases/tag/ > > >> > > > > > 1.2.0.rc2 > > >> > > > > > > > >> > > > > >> > > > > > > > >> > > Voting results for 1.2.0.rc2: > > >> > > > > > > > >> > > https://lists.apache.org/thread.html/ > > >> > > > > > > ebe561c609a8e32351dfe4aafc8876 > > >> > > > > > > > >> > > 199560336472726b58c3455e85@%3C > dev.mxnet.apache.org > > >> %3E > > >> > > > > > > > >> > > > > >> > > > > > > > >> > > View this page, click on "Build from Source", and > > use > > >> > the > > >> > > > > source > > >> > > > > > > > code > > >> > > > > > > > >> > > obtained from 1.2.0.rc2 tag: > > >> > > > > > > > >> > > https://mxnet.incubator.apache > > >> .org/install/index.html > > >> > > > > > > > >> > > > > >> > > > > > > > >> > > (Note: The README.md points to the 1.2.0 tag and > > does > > >> > not > > >> > > > work > > >> > > > > > at > > >> > > > > > > > the > > >> > > > > > > > >> > > moment.) > > >> > > > > > > > >> > > > > >> > > > > > > > >> > > Please remember to test first before voting > > >> accordingly: > > >> > > > > > > > >> > > > > >> > > > > > > > >> > > +1 = approve > > >> > > > > > > > >> > > +0 = no opinion > > >> > > > > > > > >> > > -1 = disapprove (provide reason) > > >> > > > > > > > >> > > > > >> > > > > > > > >> > > Anirudh > > >> > > > > > > > >> > > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > > > > > > >