Seems is not the only test: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12726/5/pipeline
test_slice_batchnorm_reshape_batchnorm is also failing and hasn't been touched for a while. It doesn't look like a problem with the test to me, (not a flaky test). Looks to me that should find and address the root cause instead of disabling the test in this case. Pedro. On Tue, Oct 2, 2018 at 2:39 AM Marco de Abreu <marco.g.ab...@googlemail.com.invalid> wrote: > I have created an issue at > https://github.com/apache/incubator-mxnet/issues/12715 and a PR to disable > the test at https://github.com/apache/incubator-mxnet/pull/12716. > > This test is pretty new and was submitted with a number of other > problematic (and disabled) tests: > https://github.com/apache/incubator-mxnet/issues/11164 It could be > possible > that the test is simply not stable enough. The PR that introduced that test > is https://github.com/apache/incubator-mxnet/pull/10921 - it was merged > two > days ago. > > Best regards, > Marco > > On Tue, Oct 2, 2018 at 8:43 AM Pedro Larroy <pedro.larroy.li...@gmail.com> > wrote: > > > Thanks for checking Lin. If it happens again we will have to dig deeper. > We > > have just one executor in GPU so I wonder what could be the root cause of > > this. > > > > On Mon, Oct 1, 2018 at 10:57 PM Lin Yuan <apefor...@gmail.com> wrote: > > > > > I could not reproduce the error on an EC2 g3x8 instance making it hard > to > > > debug. I also suspect it was due to resource usage limit on ci > > Instance. > > > > > > On Mon, Oct 1, 2018 at 10:40 PM Pedro Larroy < > > pedro.larroy.li...@gmail.com > > > > > > > wrote: > > > > > > > It doesn't look like flakiness to me at first sight. I think it might > > be > > > > related to resource usage / allocation / leak in the worst case. > > > > > > > > Could be that there was not enough memory GPU memory at the time of > > test > > > > execution. But I'm just speculating, hence my original question. > > > > > > > > Pedro. > > > > > > > > On Mon, Oct 1, 2018 at 8:16 PM Lin Yuan <apefor...@gmail.com> wrote: > > > > > > > > > Hi Pedro, > > > > > > > > > > I also got this failure in my PR > > > > > > > > > > > > > > > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11742/27/pipeline > > > > > > > > > > I was not able to identify the root cause of it from changelist. > Are > > > you > > > > > suggesting there is some flakiness in the master branch too? > > > > > > > > > > Thanks, > > > > > > > > > > Lin > > > > > > > > > > On Mon, Oct 1, 2018 at 4:55 PM Pedro Larroy < > > > > pedro.larroy.li...@gmail.com> > > > > > wrote: > > > > > > > > > > > Hi > > > > > > > > > > > > I saw this failure on CI: > > > > > > > > > > > > > > > > > > > > > > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1697/pipeline > > > > > > > > > > > > Have you seen other cases where we fail to select the best CUDNN > > > > > algorithm? > > > > > > In which circumstances this could happen, and do you think is a > > good > > > > idea > > > > > > to have one selected by default as a last resort? > > > > > > > > > > > > > > > > > > Pedro. > > > > > > > > > > > > > > > > > > > > >