This vote has been closed. We will make another tag and start vote again.

-sz

> On Jun 18, 2019, at 5:24 PM, Lin Yuan <apefor...@gmail.com> wrote:
> 
> With the PR https://github.com/apache/incubator-mxnet/pull/15213 I could
> verify that building Horovod is successful with MXNet built from source. So
> I will remove my pervious -1 vote.
> 
> Best,
> 
> Lin
> 
>> On Tue, Jun 18, 2019 at 2:10 PM Junru Shao <junrushao1...@gmail.com> wrote:
>> 
>> Dear community,
>> 
>> I am happy to share some results with regard to commit 83d2c2d0e (PR
>> #14192, link: https://github.com/apache/incubator-mxnet/pull/14192) that
>> Pedro mentioned that causes regression.
>> 
>> First, using the exact model that Pedro provides, we did rigorous profiling
>> and found out that the PR #14192 slows it down by 7.26 ms (from 235.65 ms
>> to 242.91 ms).
>> 
>> Then, we submitted a following up PR #15262 (link:
>> https://github.com/apache/incubator-mxnet/pull/15262) to fix the
>> regression. By applying the patch to commit 83d2c2d0e, we could verify that
>> we get comparable performance. Please refer to the PR if you are interested
>> in our experiment.
>> 
>> That is to say, regression caused by commit 83d2c2d0e should have been
>> addressed. Please let me know if there is any future issues.
>> 
>> Thank you so much,
>> Junru
>> 
>> On Thu, Jun 13, 2019 at 3:05 PM Pedro Larroy <pedro.larroy.li...@gmail.com
>>> 
>> wrote:
>> 
>>> I reach you in private, the model is not public. We should be able to
>>> see this problem in a public model using LSTM I think.
>>> 
>>> 
>>> On Thu, Jun 13, 2019 at 11:15 AM Junru Shao <junrushao1...@gmail.com>
>>> wrote:
>>>> 
>>>> Hi Pedro,
>>>> 
>>>> Thanks for brining this up!
>>>> 
>>>> Could you provide your model so that we can dig into this?
>>>> 
>>>> Thanks,
>>>> Junru
>>>> 
>>>> On Thu, Jun 13, 2019 at 10:33 Pedro Larroy <
>> pedro.larroy.li...@gmail.com
>>>> 
>>>> wrote:
>>>> 
>>>>> I have isolated some of the commits that are causing performance
>>>>> regressions in wavenet like models:
>>>>> 
>>>>> Title: 83d2c2d0e:[MXNET-1324] Add NaiveRunGraph to imperative utils
>>>>> (#14192)
>>>>> 
>>>>> Causes a regression making hybridize with static slower using GPU
>>>>> inference.
>>>>> 
>>>>> [0f63659be5070af218095a6a460427d2a1b67aba] add a compiler flag to use
>>>>> int64 as tensor size (#14570)
>>>>> 
>>>>> Causes overall regressions in CPU inference.
>>>>> 
>>>>> 
>>>>> Pedro.
>>>>> 
>>>>> On Wed, Jun 12, 2019 at 11:52 AM Lai Wei <roywei...@gmail.com>
>> wrote:
>>>>>> 
>>>>>> Hi @dev,
>>>>>> 
>>>>>> I am canceling the vote as the issue Lin discovered require a
>> fix[1]
>>> and
>>>>>> the solution is not ready yet.
>>>>>> It's a general problem when building from source with MXNet, not
>> only
>>>>>> impacting horovod use cases.  Any help is appreciated.
>>>>>> 
>>>>>> Other issues we are tracking:
>>>>>> 1. Regression on hybridize with static_alloc. (not a blocker for
>> now)
>>>>>> 2. Scala doc issue [2], already merged in master, need to backport
>> to
>>>>> 1.5.x
>>>>>> 
>>>>>> Thanks for everyone's help! Please let us know if there is any
>> other
>>>>> issue
>>>>>> with 1.5.0
>>>>>> 
>>>>>> [1] https://github.com/apache/incubator-mxnet/pull/15213
>>>>>> [2] https://github.com/apache/incubator-mxnet/pull/15216
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Best Regards
>>>>>> 
>>>>>> Lai
>>>>>> 
>>>>>> 
>>>>>> On Tue, Jun 11, 2019 at 5:04 PM Pedro Larroy <
>>>>> pedro.larroy.li...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Tested with CPU, 2.6x slower. comparing master vs 1.4.1.
>>>>>>> 
>>>>>>> Looks like a general regression.
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Jun 11, 2019 at 2:31 PM Lai Wei <roywei...@gmail.com>
>>> wrote:
>>>>>>>> 
>>>>>>>> Hi guys,
>>>>>>>> 
>>>>>>>> Thanks for the updates. Currently, we are able to confirm Lin's
>>> issue
>>>>>>> with
>>>>>>>> Horovod, and there is a fix pending. [1]
>>>>>>>> Will update later today to see if we need to cancel this vote
>>> for the
>>>>>>> fix.
>>>>>>>> 
>>>>>>>> As for the hybridize with static alloc performance regression.
>>> IMO it
>>>>>>> does
>>>>>>>> not need to be a blocker if we have the following speed order.
>>>>>>>> 1.5.0 w/o static > 1.5.0 w/ static  > 1.4.1 w/ static > 1.4.1
>> w/o
>>>>> static
>>>>>>>> and it will be great to know the following to better make a
>>> decision
>>>>> on
>>>>>>>> whether this should block the release.
>>>>>>>> 1) if this is a model specific or a general regression.
>>>>>>>> 2) if this is platform specific or general (w/ or w/o CUDA, w/
>>> or w/o
>>>>>>>> MKLDNN)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> [1]https://github.com/apache/incubator-mxnet/pull/15213
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> 
>>>>>>>> Best Regards
>>>>>>>> 
>>>>>>>> Lai
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Jun 11, 2019 at 1:46 PM Zhi Zhang <
>> zhresh...@apache.org>
>>>>> wrote:
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 2019/06/11 18:53:56, Pedro Larroy <
>>> pedro.larroy.li...@gmail.com
>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>> The stack trace doesn't seem to come from MXNet, do you
>> have
>>> more
>>>>>>> info?
>>>>>>>>>> 
>>>>>>>>>> On Tue, Jun 11, 2019 at 11:46 AM Zhi Zhang <
>>> zhresh...@apache.org
>>>>>> 
>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 2019/06/11 17:36:09, Pedro Larroy <
>>>>> pedro.larroy.li...@gmail.com
>>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>>>> A bit more background into this:
>>>>>>>>>>>> 
>>>>>>>>>>>> While tuning a model using LSTM and convolutions we
>> find
>>> that
>>>>>>> using
>>>>>>>>>>>> hybridize with static_alloc and static_shape is 15%
>>> slower
>>>>> in the
>>>>>>>>>>>> latest revision vs in version 1.4.1 in which using
>>> hybridize
>>>>> with
>>>>>>>>>>>> static_alloc and static_shape is 10% faster than
>> without.
>>>>>>>>>>>> 
>>>>>>>>>>>> Overwall we are still 33% faster when comparing master
>> to
>>>>> 1.5.
>>>>>>>>>>>> 
>>>>>>>>>>>> Let me know if you think this is a release blocker or
>>> not.
>>>>>>>>>>>> 
>>>>>>>>>>>> Pedro.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, Jun 10, 2019 at 4:51 PM Pedro Larroy
>>>>>>>>>>>> <pedro.larroy.li...@gmail.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -1
>>>>>>>>>>>>> 
>>>>>>>>>>>>> We found a performance regression vs 1.4 related to
>>>>> CachedOp
>>>>>>> which
>>>>>>>>>>>>> affects Hybrid forward, which we are looking into.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Pedro.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Mon, Jun 10, 2019 at 4:33 PM Lin Yuan <
>>>>> apefor...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -1 (Tentatively until resolved)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I tried to build MXNet 1.5.0 from source and pip
>>> install
>>>>>>> horovod
>>>>>>>>> but got
>>>>>>>>>>>>>> the following error:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Reproduce:
>>>>>>>>>>>>>> 1) cp make/config.mk .
>>>>>>>>>>>>>> 2) turn on USE_CUDA, USE_CUDNN, USE_NCCL
>>>>>>>>>>>>>> 3) make -j
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> MXNet can build successfully.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 4) pip install horovod
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> 
>> /home/ubuntu/src/incubator-mxnet/python/mxnet/../../include/mkldnn/mkldnn.h:55:28:
>>>>>>>>>>>>>> fatal error: mkldnn_version.h: No such file or
>>> directory
>>>>>>>>>>>>>>    compilation terminated.
>>>>>>>>>>>>>>    INFO: Unable to build MXNet plugin, will skip
>> it.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I did not change any setting of MKLDNN in my
>>> config.mk.
>>>>> I am
>>>>>>>>> building on
>>>>>>>>>>>>>> DLAMI base 18.0 which is Ubuntu 16.04 and CUDA 10.0
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Lin
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Sat, Jun 8, 2019 at 5:39 PM shiwen hu <
>>>>>>> yajiedes...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Lai Wei <roywei...@gmail.com> 于2019年6月9日周日
>>> 上午4:12写道:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Dear MXNet community,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This is the 3-day vote to release Apache MXNet
>>>>>>> (incubating)
>>>>>>>>> version
>>>>>>>>>>>>>>> 1.5.0.
>>>>>>>>>>>>>>>> Voting on dev@ will start June 8,
>>> 23:59:59(PST)  and
>>>>>>> close
>>>>>>>>> on June 11,
>>>>>>>>>>>>>>>> 23:59:59.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 1) Link to release notes:
>>>>>>>>>>>>>>>> 
>>>>>>>>> 
>>>>> 
>> https://cwiki.apache.org/confluence/display/MXNET/1.5.0+Release+Notes
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 2) Link to release candidate:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>> 
>>> https://github.com/apache/incubator-mxnet/releases/tag/1.5.0.rc0
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 3) Link to source and signatures on apache dist
>>>>> server:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>> 
>>> https://dist.apache.org/repos/dist/dev/incubator/mxnet/1.5.0.rc0/
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Please remember to TEST first before voting
>>>>> accordingly:
>>>>>>>>>>>>>>>> +1 = approve
>>>>>>>>>>>>>>>> +0 = no opinion
>>>>>>>>>>>>>>>> -1 = disapprove (provide reason)
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Best Regards
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Lai
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> -1. Built from source, import mxnet in python cause
>>> Segfault.
>>>>>>>>>>> 
>>>>>>>>>>> back trace:
>>>>>>>>>>> 
>>>>>>>>>>> Thread 1 "python3" received signal SIGSEGV, Segmentation
>>> fault.
>>>>>>>>>>> 0x00007fff3e8a9f20 in ?? ()
>>>>>>>>>>> (gdb) bt
>>>>>>>>>>> #0  0x00007fff3e8a9f20 in ?? ()
>>>>>>>>>>> #1  0x00007fffebbf440c in ReadConfigFile(Configuration&,
>>>>>>>>>>> std::__cxx11::basic_string<char, std::char_traits<char>,
>>>>>>>>>>> std::allocator<char> > const&, bool const&, unsigned int
>>>>> const&) ()
>>>>>>>>> from
>>>>>>>>>>> /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
>>>>>>>>>>> #2  0x00007fffebbf3d97 in ReadConfigDir(Configuration&,
>>>>>>>>>>> std::__cxx11::basic_string<char, std::char_traits<char>,
>>>>>>>>>>> std::allocator<char> > const&, bool const&, unsigned int
>>>>> const&) ()
>>>>>>>>> from
>>>>>>>>>>> /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
>>>>>>>>>>> #3  0x00007fffebc5e9aa in pkgInitConfig(Configuration&)
>> ()
>>> from
>>>>>>>>>>> /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0
>>>>>>>>>>> #4  0x00007ffff29d5c48 in ?? () from
>>>>>>> /usr/lib/python3/dist-packages/
>>>>>>>>>>> apt_pkg.cpython-35m-x86_64-linux-gnu.so
>>>>>>>>>>> #5  0x00000000004ea10f in PyCFunction_Call ()
>>>>>>>>>>> #6  0x0000000000536d94 in PyEval_EvalFrameEx ()
>>>>>>>>>>> #7  0x000000000053fc97 in ?? ()
>>>>>>>>>>> #8  0x00000000005409bf in PyEval_EvalCode ()
>>>>>>>>>>> #9  0x000000000054a328 in ?? ()
>>>>>>>>>>> #10 0x00000000004ea1c6 in PyCFunction_Call ()
>>>>>>>>>>> #11 0x000000000053d353 in PyEval_EvalFrameEx ()
>>>>>>>>>>> #12 0x000000000053fc97 in ?? ()
>>>>>>>>>>> #13 0x000000000053bc93 in PyEval_EvalFrameEx ()
>>>>>>>>>>> #14 0x000000000053b294 in PyEval_EvalFrameEx ()
>>>>>>>>>>> #15 0x000000000053b294 in PyEval_EvalFrameEx ()
>>>>>>>>>>> #16 0x000000000053b294 in PyEval_EvalFrameEx ()
>>>>>>>>>>> #17 0x0000000000540b0b in PyEval_EvalCodeEx ()
>>>>>>>>>>> #18 0x00000000004ec2e3 in ?? ()
>>>>>>>>>>> #19 0x00000000005c20e7 in PyObject_Call ()
>>>>>>>>>>> 
>>>>>>>>>>> I was using fresh DLAMI ubuntu 18.0 and CUDA 10.0, built
>>> with
>>>>>>>>> USE_CUDA=1,
>>>>>>>>>>> USE_CUDNN=1, the rest are default values.
>>>>>>>>>>> 
>>>>>>>>>>> -Zhi
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Change to +1, I figured out that it was due to the
>>> dependencies. I
>>>>>>> still
>>>>>>>>> have issue using DL base AMI with python3, but I will not
>>> regard
>>>>> it as
>>>>>>> a
>>>>>>>>> blocker to 1.5 release.
>>>>>>>>> Tested Gluon-CV training and works fine.
>>>>>>>>> 
>>>>>>>>> -Zhi
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> 
>> 

Reply via email to