With the PR https://github.com/apache/incubator-mxnet/pull/15213 I could verify that building Horovod is successful with MXNet built from source. So I will remove my pervious -1 vote.
Best, Lin On Tue, Jun 18, 2019 at 2:10 PM Junru Shao <junrushao1...@gmail.com> wrote: > Dear community, > > I am happy to share some results with regard to commit 83d2c2d0e (PR > #14192, link: https://github.com/apache/incubator-mxnet/pull/14192) that > Pedro mentioned that causes regression. > > First, using the exact model that Pedro provides, we did rigorous profiling > and found out that the PR #14192 slows it down by 7.26 ms (from 235.65 ms > to 242.91 ms). > > Then, we submitted a following up PR #15262 (link: > https://github.com/apache/incubator-mxnet/pull/15262) to fix the > regression. By applying the patch to commit 83d2c2d0e, we could verify that > we get comparable performance. Please refer to the PR if you are interested > in our experiment. > > That is to say, regression caused by commit 83d2c2d0e should have been > addressed. Please let me know if there is any future issues. > > Thank you so much, > Junru > > On Thu, Jun 13, 2019 at 3:05 PM Pedro Larroy <pedro.larroy.li...@gmail.com > > > wrote: > > > I reach you in private, the model is not public. We should be able to > > see this problem in a public model using LSTM I think. > > > > > > On Thu, Jun 13, 2019 at 11:15 AM Junru Shao <junrushao1...@gmail.com> > > wrote: > > > > > > Hi Pedro, > > > > > > Thanks for brining this up! > > > > > > Could you provide your model so that we can dig into this? > > > > > > Thanks, > > > Junru > > > > > > On Thu, Jun 13, 2019 at 10:33 Pedro Larroy < > pedro.larroy.li...@gmail.com > > > > > > wrote: > > > > > > > I have isolated some of the commits that are causing performance > > > > regressions in wavenet like models: > > > > > > > > Title: 83d2c2d0e:[MXNET-1324] Add NaiveRunGraph to imperative utils > > > > (#14192) > > > > > > > > Causes a regression making hybridize with static slower using GPU > > > > inference. > > > > > > > > [0f63659be5070af218095a6a460427d2a1b67aba] add a compiler flag to use > > > > int64 as tensor size (#14570) > > > > > > > > Causes overall regressions in CPU inference. > > > > > > > > > > > > Pedro. > > > > > > > > On Wed, Jun 12, 2019 at 11:52 AM Lai Wei <roywei...@gmail.com> > wrote: > > > > > > > > > > Hi @dev, > > > > > > > > > > I am canceling the vote as the issue Lin discovered require a > fix[1] > > and > > > > > the solution is not ready yet. > > > > > It's a general problem when building from source with MXNet, not > only > > > > > impacting horovod use cases. Any help is appreciated. > > > > > > > > > > Other issues we are tracking: > > > > > 1. Regression on hybridize with static_alloc. (not a blocker for > now) > > > > > 2. Scala doc issue [2], already merged in master, need to backport > to > > > > 1.5.x > > > > > > > > > > Thanks for everyone's help! Please let us know if there is any > other > > > > issue > > > > > with 1.5.0 > > > > > > > > > > [1] https://github.com/apache/incubator-mxnet/pull/15213 > > > > > [2] https://github.com/apache/incubator-mxnet/pull/15216 > > > > > > > > > > > > > > > > > > > > Best Regards > > > > > > > > > > Lai > > > > > > > > > > > > > > > On Tue, Jun 11, 2019 at 5:04 PM Pedro Larroy < > > > > pedro.larroy.li...@gmail.com> > > > > > wrote: > > > > > > > > > > > Tested with CPU, 2.6x slower. comparing master vs 1.4.1. > > > > > > > > > > > > Looks like a general regression. > > > > > > > > > > > > > > > > > > On Tue, Jun 11, 2019 at 2:31 PM Lai Wei <roywei...@gmail.com> > > wrote: > > > > > > > > > > > > > > Hi guys, > > > > > > > > > > > > > > Thanks for the updates. Currently, we are able to confirm Lin's > > issue > > > > > > with > > > > > > > Horovod, and there is a fix pending. [1] > > > > > > > Will update later today to see if we need to cancel this vote > > for the > > > > > > fix. > > > > > > > > > > > > > > As for the hybridize with static alloc performance regression. > > IMO it > > > > > > does > > > > > > > not need to be a blocker if we have the following speed order. > > > > > > > 1.5.0 w/o static > 1.5.0 w/ static > 1.4.1 w/ static > 1.4.1 > w/o > > > > static > > > > > > > and it will be great to know the following to better make a > > decision > > > > on > > > > > > > whether this should block the release. > > > > > > > 1) if this is a model specific or a general regression. > > > > > > > 2) if this is platform specific or general (w/ or w/o CUDA, w/ > > or w/o > > > > > > > MKLDNN) > > > > > > > > > > > > > > > > > > > > > [1]https://github.com/apache/incubator-mxnet/pull/15213 > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > Best Regards > > > > > > > > > > > > > > Lai > > > > > > > > > > > > > > > > > > > > > On Tue, Jun 11, 2019 at 1:46 PM Zhi Zhang < > zhresh...@apache.org> > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 2019/06/11 18:53:56, Pedro Larroy < > > pedro.larroy.li...@gmail.com > > > > > > > > > > > > > wrote: > > > > > > > > > The stack trace doesn't seem to come from MXNet, do you > have > > more > > > > > > info? > > > > > > > > > > > > > > > > > > On Tue, Jun 11, 2019 at 11:46 AM Zhi Zhang < > > zhresh...@apache.org > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 2019/06/11 17:36:09, Pedro Larroy < > > > > pedro.larroy.li...@gmail.com > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > A bit more background into this: > > > > > > > > > > > > > > > > > > > > > > While tuning a model using LSTM and convolutions we > find > > that > > > > > > using > > > > > > > > > > > hybridize with static_alloc and static_shape is 15% > > slower > > > > in the > > > > > > > > > > > latest revision vs in version 1.4.1 in which using > > hybridize > > > > with > > > > > > > > > > > static_alloc and static_shape is 10% faster than > without. > > > > > > > > > > > > > > > > > > > > > > Overwall we are still 33% faster when comparing master > to > > > > 1.5. > > > > > > > > > > > > > > > > > > > > > > Let me know if you think this is a release blocker or > > not. > > > > > > > > > > > > > > > > > > > > > > Pedro. > > > > > > > > > > > > > > > > > > > > > > On Mon, Jun 10, 2019 at 4:51 PM Pedro Larroy > > > > > > > > > > > <pedro.larroy.li...@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > -1 > > > > > > > > > > > > > > > > > > > > > > > > We found a performance regression vs 1.4 related to > > > > CachedOp > > > > > > which > > > > > > > > > > > > affects Hybrid forward, which we are looking into. > > > > > > > > > > > > > > > > > > > > > > > > Pedro. > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Jun 10, 2019 at 4:33 PM Lin Yuan < > > > > apefor...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > -1 (Tentatively until resolved) > > > > > > > > > > > > > > > > > > > > > > > > > > I tried to build MXNet 1.5.0 from source and pip > > install > > > > > > horovod > > > > > > > > but got > > > > > > > > > > > > > the following error: > > > > > > > > > > > > > > > > > > > > > > > > > > Reproduce: > > > > > > > > > > > > > 1) cp make/config.mk . > > > > > > > > > > > > > 2) turn on USE_CUDA, USE_CUDNN, USE_NCCL > > > > > > > > > > > > > 3) make -j > > > > > > > > > > > > > > > > > > > > > > > > > > MXNet can build successfully. > > > > > > > > > > > > > > > > > > > > > > > > > > 4) pip install horovod > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > /home/ubuntu/src/incubator-mxnet/python/mxnet/../../include/mkldnn/mkldnn.h:55:28: > > > > > > > > > > > > > fatal error: mkldnn_version.h: No such file or > > directory > > > > > > > > > > > > > compilation terminated. > > > > > > > > > > > > > INFO: Unable to build MXNet plugin, will skip > it. > > > > > > > > > > > > > > > > > > > > > > > > > > I did not change any setting of MKLDNN in my > > config.mk. > > > > I am > > > > > > > > building on > > > > > > > > > > > > > DLAMI base 18.0 which is Ubuntu 16.04 and CUDA 10.0 > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > > > > > Lin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Jun 8, 2019 at 5:39 PM shiwen hu < > > > > > > yajiedes...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > +1 > > > > > > > > > > > > > > > > > > > > > > > > > > > > Lai Wei <roywei...@gmail.com> 于2019年6月9日周日 > > 上午4:12写道: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Dear MXNet community, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This is the 3-day vote to release Apache MXNet > > > > > > (incubating) > > > > > > > > version > > > > > > > > > > > > > > 1.5.0. > > > > > > > > > > > > > > > Voting on dev@ will start June 8, > > 23:59:59(PST) and > > > > > > close > > > > > > > > on June 11, > > > > > > > > > > > > > > > 23:59:59. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1) Link to release notes: > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2) Link to release candidate: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc0 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3) Link to source and signatures on apache dist > > > > server: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc0/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Please remember to TEST first before voting > > > > accordingly: > > > > > > > > > > > > > > > +1 = approve > > > > > > > > > > > > > > > +0 = no opinion > > > > > > > > > > > > > > > -1 = disapprove (provide reason) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best Regards > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Lai > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -1. Built from source, import mxnet in python cause > > Segfault. > > > > > > > > > > > > > > > > > > > > back trace: > > > > > > > > > > > > > > > > > > > > Thread 1 "python3" received signal SIGSEGV, Segmentation > > fault. > > > > > > > > > > 0x00007fff3e8a9f20 in ?? () > > > > > > > > > > (gdb) bt > > > > > > > > > > #0 0x00007fff3e8a9f20 in ?? () > > > > > > > > > > #1 0x00007fffebbf440c in ReadConfigFile(Configuration&, > > > > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, > > > > > > > > > > std::allocator<char> > const&, bool const&, unsigned int > > > > const&) () > > > > > > > > from > > > > > > > > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0 > > > > > > > > > > #2 0x00007fffebbf3d97 in ReadConfigDir(Configuration&, > > > > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, > > > > > > > > > > std::allocator<char> > const&, bool const&, unsigned int > > > > const&) () > > > > > > > > from > > > > > > > > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0 > > > > > > > > > > #3 0x00007fffebc5e9aa in pkgInitConfig(Configuration&) > () > > from > > > > > > > > > > /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0 > > > > > > > > > > #4 0x00007ffff29d5c48 in ?? () from > > > > > > /usr/lib/python3/dist-packages/ > > > > > > > > > > apt_pkg.cpython-35m-x86_64-linux-gnu.so > > > > > > > > > > #5 0x00000000004ea10f in PyCFunction_Call () > > > > > > > > > > #6 0x0000000000536d94 in PyEval_EvalFrameEx () > > > > > > > > > > #7 0x000000000053fc97 in ?? () > > > > > > > > > > #8 0x00000000005409bf in PyEval_EvalCode () > > > > > > > > > > #9 0x000000000054a328 in ?? () > > > > > > > > > > #10 0x00000000004ea1c6 in PyCFunction_Call () > > > > > > > > > > #11 0x000000000053d353 in PyEval_EvalFrameEx () > > > > > > > > > > #12 0x000000000053fc97 in ?? () > > > > > > > > > > #13 0x000000000053bc93 in PyEval_EvalFrameEx () > > > > > > > > > > #14 0x000000000053b294 in PyEval_EvalFrameEx () > > > > > > > > > > #15 0x000000000053b294 in PyEval_EvalFrameEx () > > > > > > > > > > #16 0x000000000053b294 in PyEval_EvalFrameEx () > > > > > > > > > > #17 0x0000000000540b0b in PyEval_EvalCodeEx () > > > > > > > > > > #18 0x00000000004ec2e3 in ?? () > > > > > > > > > > #19 0x00000000005c20e7 in PyObject_Call () > > > > > > > > > > > > > > > > > > > > I was using fresh DLAMI ubuntu 18.0 and CUDA 10.0, built > > with > > > > > > > > USE_CUDA=1, > > > > > > > > > > USE_CUDNN=1, the rest are default values. > > > > > > > > > > > > > > > > > > > > -Zhi > > > > > > > > > > > > > > > > > > > > > > > > > Change to +1, I figured out that it was due to the > > dependencies. I > > > > > > still > > > > > > > > have issue using DL base AMI with python3, but I will not > > regard > > > > it as > > > > > > a > > > > > > > > blocker to 1.5 release. > > > > > > > > Tested Gluon-CV training and works fine. > > > > > > > > > > > > > > > > -Zhi > > > > > > > > > > > > > > > > > > > > >