Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc0

Pedro Larroy Thu, 13 Jun 2019 15:06:41 -0700

I reach you in private, the model is not public. We should be able to
see this problem in a public model using LSTM I think.



On Thu, Jun 13, 2019 at 11:15 AM Junru Shao <junrushao1...@gmail.com> wrote:
>
> Hi Pedro,
>
> Thanks for brining this up!
>
> Could you provide your model so that we can dig into this?
>
> Thanks,
> Junru
>
> On Thu, Jun 13, 2019 at 10:33 Pedro Larroy <pedro.larroy.li...@gmail.com>
> wrote:
>
> > I have isolated some of the commits that are causing performance
> > regressions in wavenet like models:
> >
> > Title: 83d2c2d0e:[MXNET-1324] Add NaiveRunGraph to imperative utils
> > (#14192)
> >
> > Causes a regression making hybridize with static slower using GPU
> > inference.
> >
> > [0f63659be5070af218095a6a460427d2a1b67aba] add a compiler flag to use
> > int64 as tensor size (#14570)
> >
> > Causes overall regressions in CPU inference.
> >
> >
> > Pedro.
> >
> > On Wed, Jun 12, 2019 at 11:52 AM Lai Wei <roywei...@gmail.com> wrote:
> > >
> > > Hi @dev,
> > >
> > > I am canceling the vote as the issue Lin discovered require a fix[1] and
> > > the solution is not ready yet.
> > > It's a general problem when building from source with MXNet, not only
> > > impacting horovod use cases.  Any help is appreciated.
> > >
> > > Other issues we are tracking:
> > > 1. Regression on hybridize with static_alloc. (not a blocker for now)
> > > 2. Scala doc issue [2], already merged in master, need to backport to
> > 1.5.x
> > >
> > > Thanks for everyone's help! Please let us know if there is any other
> > issue
> > > with 1.5.0
> > >
> > > [1] https://github.com/apache/incubator-mxnet/pull/15213
> > > [2] https://github.com/apache/incubator-mxnet/pull/15216
> > >
> > >
> > >
> > > Best Regards
> > >
> > > Lai
> > >
> > >
> > > On Tue, Jun 11, 2019 at 5:04 PM Pedro Larroy <
> > pedro.larroy.li...@gmail.com>
> > > wrote:
> > >
> > > > Tested with CPU, 2.6x slower. comparing master vs 1.4.1.
> > > >
> > > > Looks like a general regression.
> > > >
> > > >
> > > > On Tue, Jun 11, 2019 at 2:31 PM Lai Wei <roywei...@gmail.com> wrote:
> > > > >
> > > > > Hi guys,
> > > > >
> > > > > Thanks for the updates. Currently, we are able to confirm Lin's issue
> > > > with
> > > > > Horovod, and there is a fix pending. [1]
> > > > > Will update later today to see if we need to cancel this vote for the
> > > > fix.
> > > > >
> > > > > As for the hybridize with static alloc performance regression. IMO it
> > > > does
> > > > > not need to be a blocker if we have the following speed order.
> > > > > 1.5.0 w/o static > 1.5.0 w/ static  > 1.4.1 w/ static > 1.4.1 w/o
> > static
> > > > > and it will be great to know the following to better make a decision
> > on
> > > > > whether this should block the release.
> > > > > 1) if this is a model specific or a general regression.
> > > > > 2) if this is platform specific or general (w/ or w/o CUDA, w/ or w/o
> > > > > MKLDNN)
> > > > >
> > > > >
> > > > > [1]https://github.com/apache/incubator-mxnet/pull/15213
> > > > >
> > > > >
> > > > > Thanks
> > > > >
> > > > > Best Regards
> > > > >
> > > > > Lai
> > > > >
> > > > >
> > > > > On Tue, Jun 11, 2019 at 1:46 PM Zhi Zhang <zhresh...@apache.org>
> > wrote:
> > > > >
> > > > > >
> > > > > >
> > > > > > On 2019/06/11 18:53:56, Pedro Larroy <pedro.larroy.li...@gmail.com
> > >
> > > > > > wrote:
> > > > > > > The stack trace doesn't seem to come from MXNet, do you have more
> > > > info?
> > > > > > >
> > > > > > > On Tue, Jun 11, 2019 at 11:46 AM Zhi Zhang <zhresh...@apache.org
> > >
> > > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On 2019/06/11 17:36:09, Pedro Larroy <
> > pedro.larroy.li...@gmail.com
> > > > >
> > > > > > wrote:
> > > > > > > > > A bit more background into this:
> > > > > > > > >
> > > > > > > > > While tuning a model using LSTM and convolutions we find that
> > > > using
> > > > > > > > > hybridize with static_alloc and static_shape is 15% slower
> > in the
> > > > > > > > > latest revision vs in version 1.4.1 in which using hybridize
> > with
> > > > > > > > > static_alloc and static_shape is 10% faster than without.
> > > > > > > > >
> > > > > > > > > Overwall we are still 33% faster when comparing master to
> > 1.5.
> > > > > > > > >
> > > > > > > > > Let me know if you think this is a release blocker or not.
> > > > > > > > >
> > > > > > > > > Pedro.
> > > > > > > > >
> > > > > > > > > On Mon, Jun 10, 2019 at 4:51 PM Pedro Larroy
> > > > > > > > > <pedro.larroy.li...@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > -1
> > > > > > > > > >
> > > > > > > > > > We found a performance regression vs 1.4 related to
> > CachedOp
> > > > which
> > > > > > > > > > affects Hybrid forward, which we are looking into.
> > > > > > > > > >
> > > > > > > > > > Pedro.
> > > > > > > > > >
> > > > > > > > > > On Mon, Jun 10, 2019 at 4:33 PM Lin Yuan <
> > apefor...@gmail.com>
> > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > -1 (Tentatively until resolved)
> > > > > > > > > > >
> > > > > > > > > > > I tried to build MXNet 1.5.0 from source and pip install
> > > > horovod
> > > > > > but got
> > > > > > > > > > > the following error:
> > > > > > > > > > >
> > > > > > > > > > > Reproduce:
> > > > > > > > > > > 1) cp make/config.mk .
> > > > > > > > > > > 2) turn on USE_CUDA, USE_CUDNN, USE_NCCL
> > > > > > > > > > > 3) make -j
> > > > > > > > > > >
> > > > > > > > > > > MXNet can build successfully.
> > > > > > > > > > >
> > > > > > > > > > > 4) pip install horovod
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > >
> > > >
> > /home/ubuntu/src/incubator-mxnet/python/mxnet/../../include/mkldnn/mkldnn.h:55:28:
> > > > > > > > > > > fatal error: mkldnn_version.h: No such file or directory
> > > > > > > > > > >     compilation terminated.
> > > > > > > > > > >     INFO: Unable to build MXNet plugin, will skip it.
> > > > > > > > > > >
> > > > > > > > > > > I did not change any setting of MKLDNN in my config.mk.
> > I am
> > > > > > building on
> > > > > > > > > > > DLAMI base 18.0 which is Ubuntu 16.04 and CUDA 10.0
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > >
> > > > > > > > > > > Lin
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Sat, Jun 8, 2019 at 5:39 PM shiwen hu <
> > > > yajiedes...@gmail.com>
> > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > +1
> > > > > > > > > > > >
> > > > > > > > > > > > Lai Wei <roywei...@gmail.com> 于2019年6月9日周日 上午4:12写道：
> > > > > > > > > > > >
> > > > > > > > > > > > > Dear MXNet community,
> > > > > > > > > > > > >
> > > > > > > > > > > > > This is the 3-day vote to release Apache MXNet
> > > > (incubating)
> > > > > > version
> > > > > > > > > > > > 1.5.0.
> > > > > > > > > > > > > Voting on dev@ will start June 8, 23:59:59(PST)  and
> > > > close
> > > > > > on June 11,
> > > > > > > > > > > > > 23:59:59.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1) Link to release notes:
> > > > > > > > > > > > >
> > > > > >
> > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
> > > > > > > > > > > > >
> > > > > > > > > > > > > 2) Link to release candidate:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc0
> > > > > > > > > > > > >
> > > > > > > > > > > > > 3) Link to source and signatures on apache dist
> > server:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc0/
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Please remember to TEST first before voting
> > accordingly:
> > > > > > > > > > > > > +1 = approve
> > > > > > > > > > > > > +0 = no opinion
> > > > > > > > > > > > > -1 = disapprove (provide reason)
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best Regards
> > > > > > > > > > > > >
> > > > > > > > > > > > > Lai
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > > -1. Built from source, import mxnet in python cause Segfault.
> > > > > > > >
> > > > > > > > back trace:
> > > > > > > >
> > > > > > > > Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
> > > > > > > > 0x00007fff3e8a9f20 in ?? ()
> > > > > > > > (gdb) bt
> > > > > > > > #0  0x00007fff3e8a9f20 in ?? ()
> > > > > > > > #1  0x00007fffebbf440c in ReadConfigFile(Configuration&,
> > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>,
> > > > > > > > std::allocator<char> > const&, bool const&, unsigned int
> > const&) ()
> > > > > > from
> > > > > > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > > > > > #2  0x00007fffebbf3d97 in ReadConfigDir(Configuration&,
> > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>,
> > > > > > > > std::allocator<char> > const&, bool const&, unsigned int
> > const&) ()
> > > > > > from
> > > > > > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > > > > > #3  0x00007fffebc5e9aa in pkgInitConfig(Configuration&) () from
> > > > > > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
> > > > > > > > #4  0x00007ffff29d5c48 in ?? () from
> > > > /usr/lib/python3/dist-packages/
> > > > > > > > apt_pkg.cpython-35m-x86_64-linux-gnu.so
> > > > > > > > #5  0x00000000004ea10f in PyCFunction_Call ()
> > > > > > > > #6  0x0000000000536d94 in PyEval_EvalFrameEx ()
> > > > > > > > #7  0x000000000053fc97 in ?? ()
> > > > > > > > #8  0x00000000005409bf in PyEval_EvalCode ()
> > > > > > > > #9  0x000000000054a328 in ?? ()
> > > > > > > > #10 0x00000000004ea1c6 in PyCFunction_Call ()
> > > > > > > > #11 0x000000000053d353 in PyEval_EvalFrameEx ()
> > > > > > > > #12 0x000000000053fc97 in ?? ()
> > > > > > > > #13 0x000000000053bc93 in PyEval_EvalFrameEx ()
> > > > > > > > #14 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > > > > > > #15 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > > > > > > #16 0x000000000053b294 in PyEval_EvalFrameEx ()
> > > > > > > > #17 0x0000000000540b0b in PyEval_EvalCodeEx ()
> > > > > > > > #18 0x00000000004ec2e3 in ?? ()
> > > > > > > > #19 0x00000000005c20e7 in PyObject_Call ()
> > > > > > > >
> > > > > > > > I was using fresh DLAMI ubuntu 18.0 and CUDA 10.0, built with
> > > > > > USE_CUDA=1,
> > > > > > > > USE_CUDNN=1, the rest are default values.
> > > > > > > >
> > > > > > > > -Zhi
> > > > > > >
> > > > > >
> > > > > > Change to +1, I figured out that it was due to the dependencies. I
> > > > still
> > > > > > have issue using DL base AMI with python3, but I will not regard
> > it as
> > > > a
> > > > > > blocker to 1.5 release.
> > > > > > Tested Gluon-CV training and works fine.
> > > > > >
> > > > > > -Zhi
> > > > > >
> > > >
> >

Re: [VOTE] Release Apache MXNet (incubating) version 1.5.0.rc0

Reply via email to