Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-04 Thread Jun Wu
+1
I built from source and ran all the model quantization examples
successfully.

On Fri, May 4, 2018 at 3:05 PM, Anirudh  wrote:

> Hi Pedro, Haibin, Indhu,
>
> Thank you for your inputs on the release. I ran the test:
> `test_module.py:test_forward_reshape` for 250k times with different seeds.
> I was unable to reproduce the issue on the release branch.
> If everything goes well with CI tests by Pedro running till Sunday, I think
> we should move forward with the release (given that we have enough +1s).
> Is it possible to trigger the CI on the 1.2 branch repeatedly or at a fixed
> schedule till Sunday?
>
> Anirudh
>
> On Fri, May 4, 2018 at 11:56 AM, Indhu  wrote:
>
> > +1
> >
> > I've been using CUDA build from this branch (built from source) on Ubuntu
> > for couple of days now and I haven't seen any issue.
> >
> > The flaky tests need to be fixed but this release need not be blocked for
> > that.
> >
> >
> > On Fri, May 4, 2018 at 11:32 AM, Haibin Lin 
> > wrote:
> >
> > > I agree with Anirudh that the focus of the discussion should be limited
> > to
> > > the release branch, not the master branch. Anything that breaks on
> master
> > > but works on release branch should not block the release itself.
> > >
> > >
> > > Best,
> > >
> > > Haibin
> > >
> > > On Fri, May 4, 2018 at 10:58 AM, Pedro Larroy <
> > > pedro.larroy.li...@gmail.com>
> > > wrote:
> > >
> > > > I see your point.
> > > >
> > > > I checked the failures on the v1.2.0 branch and I don't see
> segfaults,
> > > just
> > > > minor failures due to flaky tests.
> > > >
> > > > I will trigger it repeatedly a few times until Sunday to have a and
> > > change
> > > > my vote accordingly.
> > > >
> > > > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
> mxnet/job/v1.2.0/
> > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > > > incubator-mxnet/detail/v1.2.0/17/pipeline
> > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > > > incubator-mxnet/detail/v1.2.0/15/pipeline/
> > > >
> > > >
> > > > Pedro.
> > > >
> > > > On Fri, May 4, 2018 at 7:16 PM, Anirudh 
> wrote:
> > > >
> > > > > Hi Pedro,
> > > > >
> > > > > Thank you for the suggestions. I will try to reproduce this without
> > > fixed
> > > > > seeds and also run it for a longer time duration.
> > > > > Having said that, running unit tests over and over for a couple of
> > days
> > > > > will likely cause
> > > > > problems  because there around 42 open issues for flaky tests:
> > > > > https://github.com/apache/incubator-mxnet/issues?q=is%
> > > > > 3Aopen+is%3Aissue+label%3AFlaky
> > > > > Also, the release branch has diverged from master around 3 weeks
> back
> > > and
> > > > > it doesn't have many of the changes merged to the master.
> > > > > So, my question essentially is, what will be your benchmark to
> accept
> > > the
> > > > > release ?
> > > > > Is it that we run the test which you provided on 1.2 without fixed
> > > seeds
> > > > > and for a longer duration without failures ?
> > > > > Or is it that all unit tests should pass over a period of 2 days
> > > without
> > > > > issues. This may require fixing all of the flaky tests which would
> > > delay
> > > > > the release by considerable amount of time.
> > > > > Or is it something else ?
> > > > >
> > > > > Anirudh
> > > > >
> > > > >
> > > > > On Fri, May 4, 2018 at 4:49 AM, Pedro Larroy <
> > > > pedro.larroy.li...@gmail.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Could you remove the fixed seeds and run it for a couple of hours
> > > with
> > > > an
> > > > > > additional loop?  Also I would suggest running the unit tests
> over
> > > and
> > > > > over
> > > > > > for a couple of days if possible.
> > > > > >
> > > > > >
> > > > > > Pedro.
> > > > > >
> > > > > > On Thu, May 3, 2018 at 8:33 PM, Anirudh 
> > > wrote:
> > > > > >
> > > > > > > Hi Pedro and Naveen,
> > > > > > >
> > > > > > > I am unable to reproduce this issue with MKLDNN on the master
> but
> > > not
> > > > > on
> > > > > > > the 1.2.RC2 branch.
> > > > > > >
> > > > > > > Did the following on 1.2.RC2 branch:
> > > > > > >
> > > > > > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas
> > USE_DIST_KVSTORE=0
> > > > > > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> > > > > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> > > > > > > export MXNET_TEST_SEED=11
> > > > > > > export MXNET_MODULE_SEED=812478194
> > > > > > > export MXNET_TEST_COUNT=1
> > > > > > > nosetests-2.7 -v tests/python/unittest/test_
> > > > > > module.py:test_forward_reshape
> > > > > > >
> > > > > > > Was able to do the 10k runs successfully.
> > > > > > >
> > > > > > > Anirudh
> > > > > > >
> > > > > > > On Thu, May 3, 2018 at 8:46 AM, Anirudh  >
> > > > wrote:
> > > > > > >
> > > > > > > > Hi Pedro and Naveen,
> > > > > > > >
> > > > > > > > Is this issue reproducible when MXNet 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-04 Thread Anirudh
Hi Pedro, Haibin, Indhu,

Thank you for your inputs on the release. I ran the test:
`test_module.py:test_forward_reshape` for 250k times with different seeds.
I was unable to reproduce the issue on the release branch.
If everything goes well with CI tests by Pedro running till Sunday, I think
we should move forward with the release (given that we have enough +1s).
Is it possible to trigger the CI on the 1.2 branch repeatedly or at a fixed
schedule till Sunday?

Anirudh

On Fri, May 4, 2018 at 11:56 AM, Indhu  wrote:

> +1
>
> I've been using CUDA build from this branch (built from source) on Ubuntu
> for couple of days now and I haven't seen any issue.
>
> The flaky tests need to be fixed but this release need not be blocked for
> that.
>
>
> On Fri, May 4, 2018 at 11:32 AM, Haibin Lin 
> wrote:
>
> > I agree with Anirudh that the focus of the discussion should be limited
> to
> > the release branch, not the master branch. Anything that breaks on master
> > but works on release branch should not block the release itself.
> >
> >
> > Best,
> >
> > Haibin
> >
> > On Fri, May 4, 2018 at 10:58 AM, Pedro Larroy <
> > pedro.larroy.li...@gmail.com>
> > wrote:
> >
> > > I see your point.
> > >
> > > I checked the failures on the v1.2.0 branch and I don't see segfaults,
> > just
> > > minor failures due to flaky tests.
> > >
> > > I will trigger it repeatedly a few times until Sunday to have a and
> > change
> > > my vote accordingly.
> > >
> > > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/v1.2.0/
> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > > incubator-mxnet/detail/v1.2.0/17/pipeline
> > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > > incubator-mxnet/detail/v1.2.0/15/pipeline/
> > >
> > >
> > > Pedro.
> > >
> > > On Fri, May 4, 2018 at 7:16 PM, Anirudh  wrote:
> > >
> > > > Hi Pedro,
> > > >
> > > > Thank you for the suggestions. I will try to reproduce this without
> > fixed
> > > > seeds and also run it for a longer time duration.
> > > > Having said that, running unit tests over and over for a couple of
> days
> > > > will likely cause
> > > > problems  because there around 42 open issues for flaky tests:
> > > > https://github.com/apache/incubator-mxnet/issues?q=is%
> > > > 3Aopen+is%3Aissue+label%3AFlaky
> > > > Also, the release branch has diverged from master around 3 weeks back
> > and
> > > > it doesn't have many of the changes merged to the master.
> > > > So, my question essentially is, what will be your benchmark to accept
> > the
> > > > release ?
> > > > Is it that we run the test which you provided on 1.2 without fixed
> > seeds
> > > > and for a longer duration without failures ?
> > > > Or is it that all unit tests should pass over a period of 2 days
> > without
> > > > issues. This may require fixing all of the flaky tests which would
> > delay
> > > > the release by considerable amount of time.
> > > > Or is it something else ?
> > > >
> > > > Anirudh
> > > >
> > > >
> > > > On Fri, May 4, 2018 at 4:49 AM, Pedro Larroy <
> > > pedro.larroy.li...@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > Could you remove the fixed seeds and run it for a couple of hours
> > with
> > > an
> > > > > additional loop?  Also I would suggest running the unit tests over
> > and
> > > > over
> > > > > for a couple of days if possible.
> > > > >
> > > > >
> > > > > Pedro.
> > > > >
> > > > > On Thu, May 3, 2018 at 8:33 PM, Anirudh 
> > wrote:
> > > > >
> > > > > > Hi Pedro and Naveen,
> > > > > >
> > > > > > I am unable to reproduce this issue with MKLDNN on the master but
> > not
> > > > on
> > > > > > the 1.2.RC2 branch.
> > > > > >
> > > > > > Did the following on 1.2.RC2 branch:
> > > > > >
> > > > > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas
> USE_DIST_KVSTORE=0
> > > > > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> > > > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> > > > > > export MXNET_TEST_SEED=11
> > > > > > export MXNET_MODULE_SEED=812478194
> > > > > > export MXNET_TEST_COUNT=1
> > > > > > nosetests-2.7 -v tests/python/unittest/test_
> > > > > module.py:test_forward_reshape
> > > > > >
> > > > > > Was able to do the 10k runs successfully.
> > > > > >
> > > > > > Anirudh
> > > > > >
> > > > > > On Thu, May 3, 2018 at 8:46 AM, Anirudh 
> > > wrote:
> > > > > >
> > > > > > > Hi Pedro and Naveen,
> > > > > > >
> > > > > > > Is this issue reproducible when MXNet is built with
> USE_MKLDNN=0?
> > > > > > > Also, there are a bunch of MKLDNN fixes that didn't go into the
> > > > release
> > > > > > > branch. Is this issue reproducible on the release branch ?
> > > > > > > In my opinion, since we have marked MKLDNN as experimental
> > feature
> > > > for
> > > > > > the
> > > > > > > release, if it is confirmed to be a MKLDNN issue
> > > > > > > we don't need to block the release on it.
> > > > > > >
> > > 

Re: segmentation fault in master using mkdlnn

2018-05-04 Thread Da Zheng
I have come up a temporary solution for this memory error.
https://github.com/apache/incubator-mxnet/pull/10812
I tested with Anirudh's command. It works fine.

I call it a temporary solution because it only fixes the segfault. It
seems to me that the race condition can potentially corrupt data in
the input array even without MKLDNN. Please see the description in my
PR for more details.

Best,
Da

On Fri, May 4, 2018 at 12:14 PM, Zheng, Da  wrote:
> Hello Pedro,
>
> I did exactly what you said in your previous email.
>
> I edit ci/docker/runtime_functions.sh based on your patch and here is the 
> history of running your commands:
>  2004  vim ci/docker/runtime_functions.sh
>  2005  ci/docker/runtime_functions.sh clean_repo
>  2006  ci/build.py -p ubuntu_cpu /work/runtime_functions.sh 
> build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu 
> /work/runtime_functions.sh unittest_ubuntu_python2_cpu
>
> Best,
> Da
>
> On 5/4/18, 4:32 AM, "Pedro Larroy"  wrote:
>
> Hi Da. I run it both in my ubuntu 16.04 workstation, in a p3 instance with
> DLAMI. I'm pretty confident it runs in most linux environments.
>
> Can you post the exact commands that you run? is not clear to me what's 
> the
> problem from your paste. Please make sure your repo is clean and all your
> subrepos are clean before starting the docker build.
>
> ci/docker/runtime_functions.sh clean_repo
>
> Pedro.
>
> On Thu, May 3, 2018 at 7:17 PM, Zheng, Da  wrote:
>
> > Hello Pedro,
> >
> > I tried your instructions. It seems I can't run the docker in EC2
> > instances.
> > Where did you reproduce the error?
> >
> > Thanks,
> > Da
> >
> > + echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/'
> > + gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
> > gpg: directory `/root/.gnupg' created
> > gpg: new configuration file `/root/.gnupg/gpg.conf' created
> > gpg: WARNING: options in `/root/.gnupg/gpg.conf' are not yet active 
> during
> > this run
> > gpg: keyring `/root/.gnupg/secring.gpg' created
> > gpg: keyring `/root/.gnupg/pubring.gpg' created
> > gpg: requesting key E084DAB9 from hkp server keyserver.ubuntu.com
> > gpg: keyserver timed out
> > gpg: keyserver receive failed: keyserver error
> > The command '/bin/sh -c /work/ubuntu_r.sh' returned a non-zero code: 2
> > Traceback (most recent call last):
> >   File "ci/build.py", line 263, in 
> > sys.exit(main())
> >   File "ci/build.py", line 197, in main
> > build_docker(platform, docker_binary)
> >   File "ci/build.py", line 73, in build_docker
> > check_call(cmd)
> >   File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
> > raise CalledProcessError(retcode, cmd)
> > subprocess.CalledProcessError: Command '['docker', 'build', '-f',
> > 'docker/Dockerfile.build.ubuntu_cpu', '--build-arg', 'USER_ID=1000',
> > '-t', 'mxnet/build.ubuntu_cpu', 'docker']' returned non-zero exit 
> status 2
> >
> >
> > On 5/3/18, 8:01 AM, "Pedro Larroy"  wrote:
> >
> > Hi Da
> >
> > Reproduction instructions:
> >
> > On the host:
> >
> > Adjust core pattern:
> >
> > $ echo '/tmp/core.%h.%e.%t' > /proc/sys/kernel/core_pattern
> >
> >
> > Use the following patch:
> >
> > ===
> >
> > diff --git a/3rdparty/mkldnn b/3rdparty/mkldnn
> > --- a/3rdparty/mkldnn
> > +++ b/3rdparty/mkldnn
> > @@ -1 +1 @@
> > -Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da
> > +Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da-dirty
> > diff --git a/ci/docker/runtime_functions.sh
> > b/ci/docker/runtime_functions.sh
> > index 027e287..62649c9 100755
> > --- a/ci/docker/runtime_functions.sh
> > +++ b/ci/docker/runtime_functions.sh
> > @@ -360,9 +360,15 @@ unittest_ubuntu_python2_cpu() {
> >  # https://github.com/apache/incubator-mxnet/issues/10026
> >  #export MXNET_MKLDNN_DEBUG=1  # Ignored if not present
> >  export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> > -nosetests-2.7 --verbose tests/python/unittest
> > -nosetests-2.7 --verbose tests/python/train
> > -nosetests-2.7 --verbose tests/python/quantization
> > +export MXNET_TEST_SEED=11
> > +export MXNET_MODULE_SEED=812478194
> > +pwd
> > +export MXNET_TEST_COUNT=1
> > +ulimit -c unlimited
> > +ulimit -c
> > +while nosetests-2.7 --verbose
> > tests/python/unittest/test_module.py:test_forward_reshape; do echo
> > round;
> > done
> > +#nosetests-2.7 --verbose tests/python/train
>

Re: segmentation fault in master using mkdlnn

2018-05-04 Thread Zheng, Da
Hello Pedro,

I did exactly what you said in your previous email.

I edit ci/docker/runtime_functions.sh based on your patch and here is the 
history of running your commands:
 2004  vim ci/docker/runtime_functions.sh 
 2005  ci/docker/runtime_functions.sh clean_repo
 2006  ci/build.py -p ubuntu_cpu /work/runtime_functions.sh 
build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu 
/work/runtime_functions.sh unittest_ubuntu_python2_cpu

Best,
Da

On 5/4/18, 4:32 AM, "Pedro Larroy"  wrote:

Hi Da. I run it both in my ubuntu 16.04 workstation, in a p3 instance with
DLAMI. I'm pretty confident it runs in most linux environments.

Can you post the exact commands that you run? is not clear to me what's the
problem from your paste. Please make sure your repo is clean and all your
subrepos are clean before starting the docker build.

ci/docker/runtime_functions.sh clean_repo

Pedro.

On Thu, May 3, 2018 at 7:17 PM, Zheng, Da  wrote:

> Hello Pedro,
>
> I tried your instructions. It seems I can't run the docker in EC2
> instances.
> Where did you reproduce the error?
>
> Thanks,
> Da
>
> + echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/'
> + gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
> gpg: directory `/root/.gnupg' created
> gpg: new configuration file `/root/.gnupg/gpg.conf' created
> gpg: WARNING: options in `/root/.gnupg/gpg.conf' are not yet active during
> this run
> gpg: keyring `/root/.gnupg/secring.gpg' created
> gpg: keyring `/root/.gnupg/pubring.gpg' created
> gpg: requesting key E084DAB9 from hkp server keyserver.ubuntu.com
> gpg: keyserver timed out
> gpg: keyserver receive failed: keyserver error
> The command '/bin/sh -c /work/ubuntu_r.sh' returned a non-zero code: 2
> Traceback (most recent call last):
>   File "ci/build.py", line 263, in 
> sys.exit(main())
>   File "ci/build.py", line 197, in main
> build_docker(platform, docker_binary)
>   File "ci/build.py", line 73, in build_docker
> check_call(cmd)
>   File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
> raise CalledProcessError(retcode, cmd)
> subprocess.CalledProcessError: Command '['docker', 'build', '-f',
> 'docker/Dockerfile.build.ubuntu_cpu', '--build-arg', 'USER_ID=1000',
> '-t', 'mxnet/build.ubuntu_cpu', 'docker']' returned non-zero exit status 2
>
>
> On 5/3/18, 8:01 AM, "Pedro Larroy"  wrote:
>
> Hi Da
>
> Reproduction instructions:
>
> On the host:
>
> Adjust core pattern:
>
> $ echo '/tmp/core.%h.%e.%t' > /proc/sys/kernel/core_pattern
>
>
> Use the following patch:
>
> ===
>
> diff --git a/3rdparty/mkldnn b/3rdparty/mkldnn
> --- a/3rdparty/mkldnn
> +++ b/3rdparty/mkldnn
> @@ -1 +1 @@
> -Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da
> +Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da-dirty
> diff --git a/ci/docker/runtime_functions.sh
> b/ci/docker/runtime_functions.sh
> index 027e287..62649c9 100755
> --- a/ci/docker/runtime_functions.sh
> +++ b/ci/docker/runtime_functions.sh
> @@ -360,9 +360,15 @@ unittest_ubuntu_python2_cpu() {
>  # https://github.com/apache/incubator-mxnet/issues/10026
>  #export MXNET_MKLDNN_DEBUG=1  # Ignored if not present
>  export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> -nosetests-2.7 --verbose tests/python/unittest
> -nosetests-2.7 --verbose tests/python/train
> -nosetests-2.7 --verbose tests/python/quantization
> +export MXNET_TEST_SEED=11
> +export MXNET_MODULE_SEED=812478194
> +pwd
> +export MXNET_TEST_COUNT=1
> +ulimit -c unlimited
> +ulimit -c
> +while nosetests-2.7 --verbose
> tests/python/unittest/test_module.py:test_forward_reshape; do echo
> round;
> done
> +#nosetests-2.7 --verbose tests/python/train
> +#nosetests-2.7 --verbose tests/python/quantization
>  }
>
>  unittest_ubuntu_python3_cpu() {
>
>
>
> ==
>
> Build and execute the test, make sure the repo is clean
>
> $ ci/docker/runtime_functions.sh clean_repo
>
> $ ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
> build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
> /work/runtime_functions.sh unittest_ubuntu_python2_cpu
>
>
> Once it crashes it will stop.
>
> Then go in the container:
>
>
> $ ci/build.py -p ubuntu_cpu --into-container 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-04 Thread Indhu
+1

I've been using CUDA build from this branch (built from source) on Ubuntu
for couple of days now and I haven't seen any issue.

The flaky tests need to be fixed but this release need not be blocked for
that.


On Fri, May 4, 2018 at 11:32 AM, Haibin Lin 
wrote:

> I agree with Anirudh that the focus of the discussion should be limited to
> the release branch, not the master branch. Anything that breaks on master
> but works on release branch should not block the release itself.
>
>
> Best,
>
> Haibin
>
> On Fri, May 4, 2018 at 10:58 AM, Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> wrote:
>
> > I see your point.
> >
> > I checked the failures on the v1.2.0 branch and I don't see segfaults,
> just
> > minor failures due to flaky tests.
> >
> > I will trigger it repeatedly a few times until Sunday to have a and
> change
> > my vote accordingly.
> >
> > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/v1.2.0/
> > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > incubator-mxnet/detail/v1.2.0/17/pipeline
> > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> > incubator-mxnet/detail/v1.2.0/15/pipeline/
> >
> >
> > Pedro.
> >
> > On Fri, May 4, 2018 at 7:16 PM, Anirudh  wrote:
> >
> > > Hi Pedro,
> > >
> > > Thank you for the suggestions. I will try to reproduce this without
> fixed
> > > seeds and also run it for a longer time duration.
> > > Having said that, running unit tests over and over for a couple of days
> > > will likely cause
> > > problems  because there around 42 open issues for flaky tests:
> > > https://github.com/apache/incubator-mxnet/issues?q=is%
> > > 3Aopen+is%3Aissue+label%3AFlaky
> > > Also, the release branch has diverged from master around 3 weeks back
> and
> > > it doesn't have many of the changes merged to the master.
> > > So, my question essentially is, what will be your benchmark to accept
> the
> > > release ?
> > > Is it that we run the test which you provided on 1.2 without fixed
> seeds
> > > and for a longer duration without failures ?
> > > Or is it that all unit tests should pass over a period of 2 days
> without
> > > issues. This may require fixing all of the flaky tests which would
> delay
> > > the release by considerable amount of time.
> > > Or is it something else ?
> > >
> > > Anirudh
> > >
> > >
> > > On Fri, May 4, 2018 at 4:49 AM, Pedro Larroy <
> > pedro.larroy.li...@gmail.com
> > > >
> > > wrote:
> > >
> > > > Could you remove the fixed seeds and run it for a couple of hours
> with
> > an
> > > > additional loop?  Also I would suggest running the unit tests over
> and
> > > over
> > > > for a couple of days if possible.
> > > >
> > > >
> > > > Pedro.
> > > >
> > > > On Thu, May 3, 2018 at 8:33 PM, Anirudh 
> wrote:
> > > >
> > > > > Hi Pedro and Naveen,
> > > > >
> > > > > I am unable to reproduce this issue with MKLDNN on the master but
> not
> > > on
> > > > > the 1.2.RC2 branch.
> > > > >
> > > > > Did the following on 1.2.RC2 branch:
> > > > >
> > > > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=0
> > > > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> > > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> > > > > export MXNET_TEST_SEED=11
> > > > > export MXNET_MODULE_SEED=812478194
> > > > > export MXNET_TEST_COUNT=1
> > > > > nosetests-2.7 -v tests/python/unittest/test_
> > > > module.py:test_forward_reshape
> > > > >
> > > > > Was able to do the 10k runs successfully.
> > > > >
> > > > > Anirudh
> > > > >
> > > > > On Thu, May 3, 2018 at 8:46 AM, Anirudh 
> > wrote:
> > > > >
> > > > > > Hi Pedro and Naveen,
> > > > > >
> > > > > > Is this issue reproducible when MXNet is built with USE_MKLDNN=0?
> > > > > > Also, there are a bunch of MKLDNN fixes that didn't go into the
> > > release
> > > > > > branch. Is this issue reproducible on the release branch ?
> > > > > > In my opinion, since we have marked MKLDNN as experimental
> feature
> > > for
> > > > > the
> > > > > > release, if it is confirmed to be a MKLDNN issue
> > > > > > we don't need to block the release on it.
> > > > > >
> > > > > > Anirudh
> > > > > >
> > > > > > On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy  >
> > > > wrote:
> > > > > >
> > > > > >> Thanks for raising this issue Pedro.
> > > > > >>
> > > > > >> -1(binding)
> > > > > >>
> > > > > >> We were in a similar state for a while a year ago, a lot of
> effort
> > > > went
> > > > > to
> > > > > >> stabilize the tests and the CI. I have seen the PR builds are
> > > > > >> non-deterministic and you have to retry over and over (wasting
> > > > resources
> > > > > >> and time) and hope you get lucky.
> > > > > >>
> > > > > >> Look at the dashboard for master build
> > > > > >> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
> > > mxnet/job/master/
> > > > > >>
> > > > > >> -Naveen
> > > > > >>
> > > > > >> On Thu, May 3, 2018 at 5:11 AM, Pedro 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-04 Thread Haibin Lin
I agree with Anirudh that the focus of the discussion should be limited to
the release branch, not the master branch. Anything that breaks on master
but works on release branch should not block the release itself.


Best,

Haibin

On Fri, May 4, 2018 at 10:58 AM, Pedro Larroy 
wrote:

> I see your point.
>
> I checked the failures on the v1.2.0 branch and I don't see segfaults, just
> minor failures due to flaky tests.
>
> I will trigger it repeatedly a few times until Sunday to have a and change
> my vote accordingly.
>
> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/v1.2.0/
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> incubator-mxnet/detail/v1.2.0/17/pipeline
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> incubator-mxnet/detail/v1.2.0/15/pipeline/
>
>
> Pedro.
>
> On Fri, May 4, 2018 at 7:16 PM, Anirudh  wrote:
>
> > Hi Pedro,
> >
> > Thank you for the suggestions. I will try to reproduce this without fixed
> > seeds and also run it for a longer time duration.
> > Having said that, running unit tests over and over for a couple of days
> > will likely cause
> > problems  because there around 42 open issues for flaky tests:
> > https://github.com/apache/incubator-mxnet/issues?q=is%
> > 3Aopen+is%3Aissue+label%3AFlaky
> > Also, the release branch has diverged from master around 3 weeks back and
> > it doesn't have many of the changes merged to the master.
> > So, my question essentially is, what will be your benchmark to accept the
> > release ?
> > Is it that we run the test which you provided on 1.2 without fixed seeds
> > and for a longer duration without failures ?
> > Or is it that all unit tests should pass over a period of 2 days without
> > issues. This may require fixing all of the flaky tests which would delay
> > the release by considerable amount of time.
> > Or is it something else ?
> >
> > Anirudh
> >
> >
> > On Fri, May 4, 2018 at 4:49 AM, Pedro Larroy <
> pedro.larroy.li...@gmail.com
> > >
> > wrote:
> >
> > > Could you remove the fixed seeds and run it for a couple of hours with
> an
> > > additional loop?  Also I would suggest running the unit tests over and
> > over
> > > for a couple of days if possible.
> > >
> > >
> > > Pedro.
> > >
> > > On Thu, May 3, 2018 at 8:33 PM, Anirudh  wrote:
> > >
> > > > Hi Pedro and Naveen,
> > > >
> > > > I am unable to reproduce this issue with MKLDNN on the master but not
> > on
> > > > the 1.2.RC2 branch.
> > > >
> > > > Did the following on 1.2.RC2 branch:
> > > >
> > > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=0
> > > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> > > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> > > > export MXNET_TEST_SEED=11
> > > > export MXNET_MODULE_SEED=812478194
> > > > export MXNET_TEST_COUNT=1
> > > > nosetests-2.7 -v tests/python/unittest/test_
> > > module.py:test_forward_reshape
> > > >
> > > > Was able to do the 10k runs successfully.
> > > >
> > > > Anirudh
> > > >
> > > > On Thu, May 3, 2018 at 8:46 AM, Anirudh 
> wrote:
> > > >
> > > > > Hi Pedro and Naveen,
> > > > >
> > > > > Is this issue reproducible when MXNet is built with USE_MKLDNN=0?
> > > > > Also, there are a bunch of MKLDNN fixes that didn't go into the
> > release
> > > > > branch. Is this issue reproducible on the release branch ?
> > > > > In my opinion, since we have marked MKLDNN as experimental feature
> > for
> > > > the
> > > > > release, if it is confirmed to be a MKLDNN issue
> > > > > we don't need to block the release on it.
> > > > >
> > > > > Anirudh
> > > > >
> > > > > On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy 
> > > wrote:
> > > > >
> > > > >> Thanks for raising this issue Pedro.
> > > > >>
> > > > >> -1(binding)
> > > > >>
> > > > >> We were in a similar state for a while a year ago, a lot of effort
> > > went
> > > > to
> > > > >> stabilize the tests and the CI. I have seen the PR builds are
> > > > >> non-deterministic and you have to retry over and over (wasting
> > > resources
> > > > >> and time) and hope you get lucky.
> > > > >>
> > > > >> Look at the dashboard for master build
> > > > >> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
> > mxnet/job/master/
> > > > >>
> > > > >> -Naveen
> > > > >>
> > > > >> On Thu, May 3, 2018 at 5:11 AM, Pedro Larroy <
> > > > >> pedro.larroy.li...@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> > -1  nondeterminisitc failures on CI master:
> > > > >> > https://issues.apache.org/jira/browse/MXNET-396
> > > > >> >
> > > > >> > Was able to reproduce once in a fresh p3 instance with DLAMI
> > can't
> > > > >> > reproduce consistently.
> > > > >> >
> > > > >> > On Wed, May 2, 2018 at 9:51 PM, Anirudh 
> > > > wrote:
> > > > >> >
> > > > >> > > Hi all,
> > > > >> > >
> > > > >> > > As part of RC2 release, we have addressed bugs and some
> concerns
> > > > that
> 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-04 Thread Pedro Larroy
I see your point.

I checked the failures on the v1.2.0 branch and I don't see segfaults, just
minor failures due to flaky tests.

I will trigger it repeatedly a few times until Sunday to have a and change
my vote accordingly.

http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/v1.2.0/
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/v1.2.0/17/pipeline
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/v1.2.0/15/pipeline/


Pedro.

On Fri, May 4, 2018 at 7:16 PM, Anirudh  wrote:

> Hi Pedro,
>
> Thank you for the suggestions. I will try to reproduce this without fixed
> seeds and also run it for a longer time duration.
> Having said that, running unit tests over and over for a couple of days
> will likely cause
> problems  because there around 42 open issues for flaky tests:
> https://github.com/apache/incubator-mxnet/issues?q=is%
> 3Aopen+is%3Aissue+label%3AFlaky
> Also, the release branch has diverged from master around 3 weeks back and
> it doesn't have many of the changes merged to the master.
> So, my question essentially is, what will be your benchmark to accept the
> release ?
> Is it that we run the test which you provided on 1.2 without fixed seeds
> and for a longer duration without failures ?
> Or is it that all unit tests should pass over a period of 2 days without
> issues. This may require fixing all of the flaky tests which would delay
> the release by considerable amount of time.
> Or is it something else ?
>
> Anirudh
>
>
> On Fri, May 4, 2018 at 4:49 AM, Pedro Larroy  >
> wrote:
>
> > Could you remove the fixed seeds and run it for a couple of hours with an
> > additional loop?  Also I would suggest running the unit tests over and
> over
> > for a couple of days if possible.
> >
> >
> > Pedro.
> >
> > On Thu, May 3, 2018 at 8:33 PM, Anirudh  wrote:
> >
> > > Hi Pedro and Naveen,
> > >
> > > I am unable to reproduce this issue with MKLDNN on the master but not
> on
> > > the 1.2.RC2 branch.
> > >
> > > Did the following on 1.2.RC2 branch:
> > >
> > > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=0
> > > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> > > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> > > export MXNET_TEST_SEED=11
> > > export MXNET_MODULE_SEED=812478194
> > > export MXNET_TEST_COUNT=1
> > > nosetests-2.7 -v tests/python/unittest/test_
> > module.py:test_forward_reshape
> > >
> > > Was able to do the 10k runs successfully.
> > >
> > > Anirudh
> > >
> > > On Thu, May 3, 2018 at 8:46 AM, Anirudh  wrote:
> > >
> > > > Hi Pedro and Naveen,
> > > >
> > > > Is this issue reproducible when MXNet is built with USE_MKLDNN=0?
> > > > Also, there are a bunch of MKLDNN fixes that didn't go into the
> release
> > > > branch. Is this issue reproducible on the release branch ?
> > > > In my opinion, since we have marked MKLDNN as experimental feature
> for
> > > the
> > > > release, if it is confirmed to be a MKLDNN issue
> > > > we don't need to block the release on it.
> > > >
> > > > Anirudh
> > > >
> > > > On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy 
> > wrote:
> > > >
> > > >> Thanks for raising this issue Pedro.
> > > >>
> > > >> -1(binding)
> > > >>
> > > >> We were in a similar state for a while a year ago, a lot of effort
> > went
> > > to
> > > >> stabilize the tests and the CI. I have seen the PR builds are
> > > >> non-deterministic and you have to retry over and over (wasting
> > resources
> > > >> and time) and hope you get lucky.
> > > >>
> > > >> Look at the dashboard for master build
> > > >> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-
> mxnet/job/master/
> > > >>
> > > >> -Naveen
> > > >>
> > > >> On Thu, May 3, 2018 at 5:11 AM, Pedro Larroy <
> > > >> pedro.larroy.li...@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > -1  nondeterminisitc failures on CI master:
> > > >> > https://issues.apache.org/jira/browse/MXNET-396
> > > >> >
> > > >> > Was able to reproduce once in a fresh p3 instance with DLAMI
> can't
> > > >> > reproduce consistently.
> > > >> >
> > > >> > On Wed, May 2, 2018 at 9:51 PM, Anirudh 
> > > wrote:
> > > >> >
> > > >> > > Hi all,
> > > >> > >
> > > >> > > As part of RC2 release, we have addressed bugs and some concerns
> > > that
> > > >> > were
> > > >> > > raised.
> > > >> > >
> > > >> > > I would like to propose a vote to release Apache MXNet
> > (incubating)
> > > >> > version
> > > >> > > 1.2.0.RC2. Voting will start now (Wednesday, May 2nd) and end at
> > > >> 12:50 PM
> > > >> > > PDT, Sunday, May 6th.
> > > >> > >
> > > >> > > Link to release notes:
> > > >> > > https://cwiki.apache.org/confluence/display/MXNET/
> > > >> > > Apache+MXNet+%28incubating%29+1.2.0+Release+Notes
> > > >> > >
> > > >> > > Link to release candidate 1.2.0.rc2:
> > > >> > > 

Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-04 Thread Anirudh
Hi Pedro,

Thank you for the suggestions. I will try to reproduce this without fixed
seeds and also run it for a longer time duration.
Having said that, running unit tests over and over for a couple of days
will likely cause
problems  because there around 42 open issues for flaky tests:
https://github.com/apache/incubator-mxnet/issues?q=is%3Aopen+is%3Aissue+label%3AFlaky
Also, the release branch has diverged from master around 3 weeks back and
it doesn't have many of the changes merged to the master.
So, my question essentially is, what will be your benchmark to accept the
release ?
Is it that we run the test which you provided on 1.2 without fixed seeds
and for a longer duration without failures ?
Or is it that all unit tests should pass over a period of 2 days without
issues. This may require fixing all of the flaky tests which would delay
the release by considerable amount of time.
Or is it something else ?

Anirudh


On Fri, May 4, 2018 at 4:49 AM, Pedro Larroy 
wrote:

> Could you remove the fixed seeds and run it for a couple of hours with an
> additional loop?  Also I would suggest running the unit tests over and over
> for a couple of days if possible.
>
>
> Pedro.
>
> On Thu, May 3, 2018 at 8:33 PM, Anirudh  wrote:
>
> > Hi Pedro and Naveen,
> >
> > I am unable to reproduce this issue with MKLDNN on the master but not on
> > the 1.2.RC2 branch.
> >
> > Did the following on 1.2.RC2 branch:
> >
> > make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=0
> > USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> > export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> > export MXNET_TEST_SEED=11
> > export MXNET_MODULE_SEED=812478194
> > export MXNET_TEST_COUNT=1
> > nosetests-2.7 -v tests/python/unittest/test_
> module.py:test_forward_reshape
> >
> > Was able to do the 10k runs successfully.
> >
> > Anirudh
> >
> > On Thu, May 3, 2018 at 8:46 AM, Anirudh  wrote:
> >
> > > Hi Pedro and Naveen,
> > >
> > > Is this issue reproducible when MXNet is built with USE_MKLDNN=0?
> > > Also, there are a bunch of MKLDNN fixes that didn't go into the release
> > > branch. Is this issue reproducible on the release branch ?
> > > In my opinion, since we have marked MKLDNN as experimental feature for
> > the
> > > release, if it is confirmed to be a MKLDNN issue
> > > we don't need to block the release on it.
> > >
> > > Anirudh
> > >
> > > On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy 
> wrote:
> > >
> > >> Thanks for raising this issue Pedro.
> > >>
> > >> -1(binding)
> > >>
> > >> We were in a similar state for a while a year ago, a lot of effort
> went
> > to
> > >> stabilize the tests and the CI. I have seen the PR builds are
> > >> non-deterministic and you have to retry over and over (wasting
> resources
> > >> and time) and hope you get lucky.
> > >>
> > >> Look at the dashboard for master build
> > >> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/master/
> > >>
> > >> -Naveen
> > >>
> > >> On Thu, May 3, 2018 at 5:11 AM, Pedro Larroy <
> > >> pedro.larroy.li...@gmail.com>
> > >> wrote:
> > >>
> > >> > -1  nondeterminisitc failures on CI master:
> > >> > https://issues.apache.org/jira/browse/MXNET-396
> > >> >
> > >> > Was able to reproduce once in a fresh p3 instance with DLAMI  can't
> > >> > reproduce consistently.
> > >> >
> > >> > On Wed, May 2, 2018 at 9:51 PM, Anirudh 
> > wrote:
> > >> >
> > >> > > Hi all,
> > >> > >
> > >> > > As part of RC2 release, we have addressed bugs and some concerns
> > that
> > >> > were
> > >> > > raised.
> > >> > >
> > >> > > I would like to propose a vote to release Apache MXNet
> (incubating)
> > >> > version
> > >> > > 1.2.0.RC2. Voting will start now (Wednesday, May 2nd) and end at
> > >> 12:50 PM
> > >> > > PDT, Sunday, May 6th.
> > >> > >
> > >> > > Link to release notes:
> > >> > > https://cwiki.apache.org/confluence/display/MXNET/
> > >> > > Apache+MXNet+%28incubating%29+1.2.0+Release+Notes
> > >> > >
> > >> > > Link to release candidate 1.2.0.rc2:
> > >> > > https://github.com/apache/incubator-mxnet/releases/tag/1.2.0.rc2
> > >> > >
> > >> > > Voting results for 1.2.0.rc2:
> > >> > > https://lists.apache.org/thread.html/
> ebe561c609a8e32351dfe4aafc8876
> > >> > > 199560336472726b58c3455e85@%3Cdev.mxnet.apache.org%3E
> > >> > >
> > >> > > View this page, click on "Build from Source", and use the source
> > code
> > >> > > obtained from 1.2.0.rc2 tag:
> > >> > > https://mxnet.incubator.apache.org/install/index.html
> > >> > >
> > >> > > (Note: The README.md points to the 1.2.0 tag and does not work at
> > the
> > >> > > moment.)
> > >> > >
> > >> > > Please remember to test first before voting accordingly:
> > >> > >
> > >> > > +1 = approve
> > >> > > +0 = no opinion
> > >> > > -1 = disapprove (provide reason)
> > >> > >
> > >> > > Anirudh
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>


Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-04 Thread Pedro Larroy
Could you remove the fixed seeds and run it for a couple of hours with an
additional loop?  Also I would suggest running the unit tests over and over
for a couple of days if possible.


Pedro.

On Thu, May 3, 2018 at 8:33 PM, Anirudh  wrote:

> Hi Pedro and Naveen,
>
> I am unable to reproduce this issue with MKLDNN on the master but not on
> the 1.2.RC2 branch.
>
> Did the following on 1.2.RC2 branch:
>
> make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=0
> USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
> export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> export MXNET_TEST_SEED=11
> export MXNET_MODULE_SEED=812478194
> export MXNET_TEST_COUNT=1
> nosetests-2.7 -v tests/python/unittest/test_module.py:test_forward_reshape
>
> Was able to do the 10k runs successfully.
>
> Anirudh
>
> On Thu, May 3, 2018 at 8:46 AM, Anirudh  wrote:
>
> > Hi Pedro and Naveen,
> >
> > Is this issue reproducible when MXNet is built with USE_MKLDNN=0?
> > Also, there are a bunch of MKLDNN fixes that didn't go into the release
> > branch. Is this issue reproducible on the release branch ?
> > In my opinion, since we have marked MKLDNN as experimental feature for
> the
> > release, if it is confirmed to be a MKLDNN issue
> > we don't need to block the release on it.
> >
> > Anirudh
> >
> > On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy  wrote:
> >
> >> Thanks for raising this issue Pedro.
> >>
> >> -1(binding)
> >>
> >> We were in a similar state for a while a year ago, a lot of effort went
> to
> >> stabilize the tests and the CI. I have seen the PR builds are
> >> non-deterministic and you have to retry over and over (wasting resources
> >> and time) and hope you get lucky.
> >>
> >> Look at the dashboard for master build
> >> http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/master/
> >>
> >> -Naveen
> >>
> >> On Thu, May 3, 2018 at 5:11 AM, Pedro Larroy <
> >> pedro.larroy.li...@gmail.com>
> >> wrote:
> >>
> >> > -1  nondeterminisitc failures on CI master:
> >> > https://issues.apache.org/jira/browse/MXNET-396
> >> >
> >> > Was able to reproduce once in a fresh p3 instance with DLAMI  can't
> >> > reproduce consistently.
> >> >
> >> > On Wed, May 2, 2018 at 9:51 PM, Anirudh 
> wrote:
> >> >
> >> > > Hi all,
> >> > >
> >> > > As part of RC2 release, we have addressed bugs and some concerns
> that
> >> > were
> >> > > raised.
> >> > >
> >> > > I would like to propose a vote to release Apache MXNet (incubating)
> >> > version
> >> > > 1.2.0.RC2. Voting will start now (Wednesday, May 2nd) and end at
> >> 12:50 PM
> >> > > PDT, Sunday, May 6th.
> >> > >
> >> > > Link to release notes:
> >> > > https://cwiki.apache.org/confluence/display/MXNET/
> >> > > Apache+MXNet+%28incubating%29+1.2.0+Release+Notes
> >> > >
> >> > > Link to release candidate 1.2.0.rc2:
> >> > > https://github.com/apache/incubator-mxnet/releases/tag/1.2.0.rc2
> >> > >
> >> > > Voting results for 1.2.0.rc2:
> >> > > https://lists.apache.org/thread.html/ebe561c609a8e32351dfe4aafc8876
> >> > > 199560336472726b58c3455e85@%3Cdev.mxnet.apache.org%3E
> >> > >
> >> > > View this page, click on "Build from Source", and use the source
> code
> >> > > obtained from 1.2.0.rc2 tag:
> >> > > https://mxnet.incubator.apache.org/install/index.html
> >> > >
> >> > > (Note: The README.md points to the 1.2.0 tag and does not work at
> the
> >> > > moment.)
> >> > >
> >> > > Please remember to test first before voting accordingly:
> >> > >
> >> > > +1 = approve
> >> > > +0 = no opinion
> >> > > -1 = disapprove (provide reason)
> >> > >
> >> > > Anirudh
> >> > >
> >> >
> >>
> >
> >
>


Re: [VOTE] Release Apache MXNet(incubating) version 1.2.0.RC2

2018-05-04 Thread Pedro Larroy
Hi Anirudh

I see too many random failures, segfaults and other problems. Qualitatively
I don't think we are in a situation to make a release. For this I would
expect to see master stable for most of the builds, and it's not the case
right now.

My vote is still -1 non binding.

If someone is willing and able to revert some of the changes that
destabilized master, then the situation would be different.

Failing CI on PRs  is creating problems for getting fixes and changes
merged.

Pedro.




On Thu, May 3, 2018 at 5:46 PM, Anirudh  wrote:

> Hi Pedro and Naveen,
>
> Is this issue reproducible when MXNet is built with USE_MKLDNN=0?
> Also, there are a bunch of MKLDNN fixes that didn't go into the release
> branch. Is this issue reproducible on the release branch ?
> In my opinion, since we have marked MKLDNN as experimental feature for the
> release, if it is confirmed to be a MKLDNN issue
> we don't need to block the release on it.
>
> Anirudh
>
> On Thu, May 3, 2018 at 6:58 AM, Naveen Swamy  wrote:
>
> > Thanks for raising this issue Pedro.
> >
> > -1(binding)
> >
> > We were in a similar state for a while a year ago, a lot of effort went
> to
> > stabilize the tests and the CI. I have seen the PR builds are
> > non-deterministic and you have to retry over and over (wasting resources
> > and time) and hope you get lucky.
> >
> > Look at the dashboard for master build
> > http://jenkins.mxnet-ci.amazon-ml.com/job/incubator-mxnet/job/master/
> >
> > -Naveen
> >
> > On Thu, May 3, 2018 at 5:11 AM, Pedro Larroy <
> pedro.larroy.li...@gmail.com
> > >
> > wrote:
> >
> > > -1  nondeterminisitc failures on CI master:
> > > https://issues.apache.org/jira/browse/MXNET-396
> > >
> > > Was able to reproduce once in a fresh p3 instance with DLAMI  can't
> > > reproduce consistently.
> > >
> > > On Wed, May 2, 2018 at 9:51 PM, Anirudh  wrote:
> > >
> > > > Hi all,
> > > >
> > > > As part of RC2 release, we have addressed bugs and some concerns that
> > > were
> > > > raised.
> > > >
> > > > I would like to propose a vote to release Apache MXNet (incubating)
> > > version
> > > > 1.2.0.RC2. Voting will start now (Wednesday, May 2nd) and end at
> 12:50
> > PM
> > > > PDT, Sunday, May 6th.
> > > >
> > > > Link to release notes:
> > > > https://cwiki.apache.org/confluence/display/MXNET/
> > > > Apache+MXNet+%28incubating%29+1.2.0+Release+Notes
> > > >
> > > > Link to release candidate 1.2.0.rc2:
> > > > https://github.com/apache/incubator-mxnet/releases/tag/1.2.0.rc2
> > > >
> > > > Voting results for 1.2.0.rc2:
> > > > https://lists.apache.org/thread.html/ebe561c609a8e32351dfe4aafc8876
> > > > 199560336472726b58c3455e85@%3Cdev.mxnet.apache.org%3E
> > > >
> > > > View this page, click on "Build from Source", and use the source code
> > > > obtained from 1.2.0.rc2 tag:
> > > > https://mxnet.incubator.apache.org/install/index.html
> > > >
> > > > (Note: The README.md points to the 1.2.0 tag and does not work at the
> > > > moment.)
> > > >
> > > > Please remember to test first before voting accordingly:
> > > >
> > > > +1 = approve
> > > > +0 = no opinion
> > > > -1 = disapprove (provide reason)
> > > >
> > > > Anirudh
> > > >
> > >
> >
>


Re: segmentation fault in master using mkdlnn

2018-05-04 Thread Pedro Larroy
Hi Da. I run it both in my ubuntu 16.04 workstation, in a p3 instance with
DLAMI. I'm pretty confident it runs in most linux environments.

Can you post the exact commands that you run? is not clear to me what's the
problem from your paste. Please make sure your repo is clean and all your
subrepos are clean before starting the docker build.

ci/docker/runtime_functions.sh clean_repo

Pedro.

On Thu, May 3, 2018 at 7:17 PM, Zheng, Da  wrote:

> Hello Pedro,
>
> I tried your instructions. It seems I can't run the docker in EC2
> instances.
> Where did you reproduce the error?
>
> Thanks,
> Da
>
> + echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/'
> + gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
> gpg: directory `/root/.gnupg' created
> gpg: new configuration file `/root/.gnupg/gpg.conf' created
> gpg: WARNING: options in `/root/.gnupg/gpg.conf' are not yet active during
> this run
> gpg: keyring `/root/.gnupg/secring.gpg' created
> gpg: keyring `/root/.gnupg/pubring.gpg' created
> gpg: requesting key E084DAB9 from hkp server keyserver.ubuntu.com
> gpg: keyserver timed out
> gpg: keyserver receive failed: keyserver error
> The command '/bin/sh -c /work/ubuntu_r.sh' returned a non-zero code: 2
> Traceback (most recent call last):
>   File "ci/build.py", line 263, in 
> sys.exit(main())
>   File "ci/build.py", line 197, in main
> build_docker(platform, docker_binary)
>   File "ci/build.py", line 73, in build_docker
> check_call(cmd)
>   File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
> raise CalledProcessError(retcode, cmd)
> subprocess.CalledProcessError: Command '['docker', 'build', '-f',
> 'docker/Dockerfile.build.ubuntu_cpu', '--build-arg', 'USER_ID=1000',
> '-t', 'mxnet/build.ubuntu_cpu', 'docker']' returned non-zero exit status 2
>
>
> On 5/3/18, 8:01 AM, "Pedro Larroy"  wrote:
>
> Hi Da
>
> Reproduction instructions:
>
> On the host:
>
> Adjust core pattern:
>
> $ echo '/tmp/core.%h.%e.%t' > /proc/sys/kernel/core_pattern
>
>
> Use the following patch:
>
> ===
>
> diff --git a/3rdparty/mkldnn b/3rdparty/mkldnn
> --- a/3rdparty/mkldnn
> +++ b/3rdparty/mkldnn
> @@ -1 +1 @@
> -Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da
> +Subproject commit b4137dfc88e3bf5c6b62e833121802eb8c6696da-dirty
> diff --git a/ci/docker/runtime_functions.sh
> b/ci/docker/runtime_functions.sh
> index 027e287..62649c9 100755
> --- a/ci/docker/runtime_functions.sh
> +++ b/ci/docker/runtime_functions.sh
> @@ -360,9 +360,15 @@ unittest_ubuntu_python2_cpu() {
>  # https://github.com/apache/incubator-mxnet/issues/10026
>  #export MXNET_MKLDNN_DEBUG=1  # Ignored if not present
>  export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
> -nosetests-2.7 --verbose tests/python/unittest
> -nosetests-2.7 --verbose tests/python/train
> -nosetests-2.7 --verbose tests/python/quantization
> +export MXNET_TEST_SEED=11
> +export MXNET_MODULE_SEED=812478194
> +pwd
> +export MXNET_TEST_COUNT=1
> +ulimit -c unlimited
> +ulimit -c
> +while nosetests-2.7 --verbose
> tests/python/unittest/test_module.py:test_forward_reshape; do echo
> round;
> done
> +#nosetests-2.7 --verbose tests/python/train
> +#nosetests-2.7 --verbose tests/python/quantization
>  }
>
>  unittest_ubuntu_python3_cpu() {
>
>
>
> ==
>
> Build and execute the test, make sure the repo is clean
>
> $ ci/docker/runtime_functions.sh clean_repo
>
> $ ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
> build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
> /work/runtime_functions.sh unittest_ubuntu_python2_cpu
>
>
> Once it crashes it will stop.
>
> Then go in the container:
>
>
> $ ci/build.py -p ubuntu_cpu --into-container --print-docker-run
>
> A core should be there.
>
> you might need to install gdb as root by executing the previous command
> without uid so you can use apt-get.
>
>
>
>
> Good luck.
>
>
>
>
>
>
>
> On Thu, May 3, 2018 at 4:51 PM, Zheng, Da  wrote:
>
> > Thanks a lot for locating the error.
> > Could you tell me How you reproduce the error?
> >
> > On 5/3/18, 7:45 AM, "Pedro Larroy" 
> wrote:
> >
> > Looks like a problem in mkl's same_shape
> >
> > the pointer to mkldnn::memory::desc   looks invalid.
> >
> > (More stack frames follow...)
> > (gdb) p desc
> > $1 = (const mkldnn::memory::desc &) @0x10:  variable>
> > (gdb) p dtype
> > $2 = 0
> > (gdb) p shape
> > $3 = (const mxnet::TShape &) @0x7f3905a58b50:
> { =
> > {static kStackCache = , ndim_ = 2,
> num_heap_allocated_
> > = 0,
> > 

Master broken due to race condition of MKLDNN PR merges

2018-05-04 Thread Marco de Abreu
Hello,

FYI, master is currently broken. This is caused by two conflicting PRs
being merged at the same time [1][2].

The reason why this is possible is the following: A PR will be always be
rebased on top of master if it gets a new commit. GitHub stores the result
and shows the PR as successfully validated. If there's a new commit being
pushed onto the master in between the time the PR validation has started
and the PR gets merged, it will not get validated since the old check is
being reused. Most of the times this does not cause any problems since most
PRs are not mutually exclusive, but unfortunately, we just ran into this
problem. The bring the master back into a stable state (I have had 30 test
failures on my test env and all PRs will fail), I created a revert PR at
[3].

After this PR has been merged, all PRs should pass validation again. Please
excuse the inconvenience.

Best regards,
Marco

[1]: https://github.com/apache/incubator-mxnet/pull/10736
[2]: https://github.com/apache/incubator-mxnet/pull/10731
[3]: https://github.com/apache/incubator-mxnet/pull/10808