Another build where test_slice_batchnorm_reshape_batchnorm fails : http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12721/7/pipeline <http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12721/7/pipeline>
— Piyush > On Oct 3, 2018, at 9:32 AM, Pedro Larroy <pedro.larroy.li...@gmail.com> wrote: > > Seems is not the only test: > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12726/5/pipeline > > test_slice_batchnorm_reshape_batchnorm is also failing and hasn't been > touched for a while. It doesn't look like a problem with the test to me, > (not a flaky test). Looks to me that should find and address the root cause > instead of disabling the test in this case. > > Pedro. > > On Tue, Oct 2, 2018 at 2:39 AM Marco de Abreu > <marco.g.ab...@googlemail.com.invalid> wrote: > >> I have created an issue at >> https://github.com/apache/incubator-mxnet/issues/12715 and a PR to disable >> the test at https://github.com/apache/incubator-mxnet/pull/12716. >> >> This test is pretty new and was submitted with a number of other >> problematic (and disabled) tests: >> https://github.com/apache/incubator-mxnet/issues/11164 It could be >> possible >> that the test is simply not stable enough. The PR that introduced that test >> is https://github.com/apache/incubator-mxnet/pull/10921 - it was merged >> two >> days ago. >> >> Best regards, >> Marco >> >> On Tue, Oct 2, 2018 at 8:43 AM Pedro Larroy <pedro.larroy.li...@gmail.com> >> wrote: >> >>> Thanks for checking Lin. If it happens again we will have to dig deeper. >> We >>> have just one executor in GPU so I wonder what could be the root cause of >>> this. >>> >>> On Mon, Oct 1, 2018 at 10:57 PM Lin Yuan <apefor...@gmail.com> wrote: >>> >>>> I could not reproduce the error on an EC2 g3x8 instance making it hard >> to >>>> debug. I also suspect it was due to resource usage limit on ci >>> Instance. >>>> >>>> On Mon, Oct 1, 2018 at 10:40 PM Pedro Larroy < >>> pedro.larroy.li...@gmail.com >>>>> >>>> wrote: >>>> >>>>> It doesn't look like flakiness to me at first sight. I think it might >>> be >>>>> related to resource usage / allocation / leak in the worst case. >>>>> >>>>> Could be that there was not enough memory GPU memory at the time of >>> test >>>>> execution. But I'm just speculating, hence my original question. >>>>> >>>>> Pedro. >>>>> >>>>> On Mon, Oct 1, 2018 at 8:16 PM Lin Yuan <apefor...@gmail.com> wrote: >>>>> >>>>>> Hi Pedro, >>>>>> >>>>>> I also got this failure in my PR >>>>>> >>>>>> >>>>> >>>> >>> >> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11742/27/pipeline >>>>>> >>>>>> I was not able to identify the root cause of it from changelist. >> Are >>>> you >>>>>> suggesting there is some flakiness in the master branch too? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Lin >>>>>> >>>>>> On Mon, Oct 1, 2018 at 4:55 PM Pedro Larroy < >>>>> pedro.larroy.li...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi >>>>>>> >>>>>>> I saw this failure on CI: >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1697/pipeline >>>>>>> >>>>>>> Have you seen other cases where we fail to select the best CUDNN >>>>>> algorithm? >>>>>>> In which circumstances this could happen, and do you think is a >>> good >>>>> idea >>>>>>> to have one selected by default as a last resort? >>>>>>> >>>>>>> >>>>>>> Pedro. >>>>>>> >>>>>> >>>>> >>>> >>> >>