Re: CUDNN algorithm selection failure

2018-10-01 Thread Lin Yuan
I could not reproduce the error on an EC2 g3x8 instance making it hard to
debug. I also suspect it was due to resource usage limit on ci   Instance.

On Mon, Oct 1, 2018 at 10:40 PM Pedro Larroy 
wrote:

> It doesn't look like flakiness to me at first sight. I think it might be
> related to resource usage / allocation / leak in the worst case.
>
> Could be that there was not enough memory GPU memory at the time of test
> execution. But I'm just speculating, hence my original question.
>
> Pedro.
>
> On Mon, Oct 1, 2018 at 8:16 PM Lin Yuan  wrote:
>
> > Hi Pedro,
> >
> > I also got this failure in my PR
> >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11742/27/pipeline
> >
> > I was not able to identify the root cause of it from changelist. Are you
> > suggesting there is some flakiness in the master branch too?
> >
> > Thanks,
> >
> > Lin
> >
> > On Mon, Oct 1, 2018 at 4:55 PM Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> > wrote:
> >
> > > Hi
> > >
> > > I saw this failure on CI:
> > >
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1697/pipeline
> > >
> > > Have you seen other cases where we fail to select the best CUDNN
> > algorithm?
> > > In which circumstances this could happen, and do you think is a good
> idea
> > > to have one selected by default as a last resort?
> > >
> > >
> > > Pedro.
> > >
> >
>


RE: [Discuss] Next MXNet release

2018-10-01 Thread Zhao, Patric
Thanks to let us know this discussion. 
Because we don't have enough bandwidth to track the different sources, like 
discussion forum.

I think the best way is to open issue in the github so that we can answer/solve 
the issue in time :)

Thanks,

--Patric

> -Original Message-
> From: Afrooze, Sina [mailto:sina@gmail.com]
> Sent: Tuesday, October 2, 2018 1:14 AM
> To: dev@mxnet.incubator.apache.org
> Cc: Ye, Jason Y ; Zai, Alexander
> ; Zheng, Da 
> Subject: Re: [Discuss] Next MXNet release
> 
> This post suggests there is a regression from 1.1.0 to 1.2.1 related to
> MKLDNN integration: https://discuss.mxnet.io/t/mxnet-1-2-1-module-get-
> outputs/1882
> 
> The error is related to MKLDNN layout not being converted back to MXNet
> layout in some operator: " !IsMKLDNNData() We can’t generate TBlob for
> MKLDNN data. Please use Reorder2Default() to generate a new NDArray
> first"
> 
> Sina
> 
> 
> 
> 
> On 9/30/18, 6:55 PM, "Steffen Rochel"  wrote:
> 
> Thanks Patrick.
> Updated roadmap and next release content.
> 
> Patrick - suggest to send a reminder to review the design doc and collect
> feedback.
> Are there still known issues or gaps before we declare MKL-DNN
> integration
> as GA?
> 
> Regards,
> Steffen
> 
> On Sat, Sep 29, 2018 at 1:31 AM Zhao, Patric 
> wrote:
> 
> > Thanks, Steffen.
> >
> > Regarding the next release note, two items from our side:
> >
> > 1. (-remove) MKL-DNN integration is done. I think we can remove this
> item.
> > 2. (+add) MKL-DNN based graph optimization and quantization by
> subgraph
> > Design doc:
> >
> https://cwiki.apache.org/confluence/display/MXNET/MXNet+Graph+Optimiz
> ation+and+Quantization+based+on+subgraph+and+MKL-DNN
> > Lead Contributor: Patric Zhao,  https://github.com/pengzhao-intel/
> >
> > Regarding the Roadmap
> > (+add) Q1 2019: MKL-DNN RNN API supports
> >
> > BR,
> >
> > Thanks,
> >
> > --Patric
> >
> >
> > > -Original Message-
> > > From: kellen sunderland [mailto:kellen.sunderl...@gmail.com]
> > > Sent: Saturday, September 29, 2018 11:31 AM
> > > To: dev@mxnet.incubator.apache.org
> > > Subject: Re: [Discuss] Next MXNet release
> > >
> > > Sorry I meant to say next 'Regarding the *minor* release'.
> > >
> > > On Sat, Sep 29, 2018 at 5:27 AM kellen sunderland <
> > > kellen.sunderl...@gmail.com> wrote:
> > >
> > > > Thanks for transparently setting a rough timeline Steffen.  I think
> > > > this will go a long way in helping the community plan their work, 
> even
> > > > if the details change somewhat on the road to the release.
> > > >
> > > > Regarding the major release: I would propose we unify TensorRT with
> > > > the subgraph operator work.
> > > >
> > > > Regarding the patch release:  There were a few minor stack/buffer
> > > > overflows exposed by ASAN that have been addressed.  It's probably
> a
> > > > good idea to include them in a patch release, as they at best result
> > > > in non-deterministic behaviour.
> > > >
> > > > -Kellen
> > > >
> > > >
> > > > On Sat, Sep 29, 2018 at 1:39 AM Steffen Rochel
> > > > 
> > > > wrote:
> > > >
> > > >> I updated
> > > >>
> > > >>
> https://cwiki.apache.org/confluence/display/MXNET/Project+Proposals+f
> > > >> or+next+MXNet+Release
> > > >> ,
> > > >> removed the completed items from 1.3 release and would like to
> kick
> > > >> off discussion about the next release. Please suggest what you
> would
> > > >> like to see included in the next release together with link to 
> design
> > > >> proposal (appropriately for the size and complexity of the 
> proposal)
> > > >> or suggest changes.
> > > >> I suggest to target the next release for December 2018 to frame the
> > > >> discussion.
> > > >> Lets include review of
> > > >>
> https://cwiki.apache.org/confluence/display/MXNET/MXNet+Roadmap -
> > > >> time to update and discuss changes.
> > > >>
> > > >> From the 1.3 release we had discussion regarding
> > > >> https://github.com/apache/incubator-mxnet/issues/11849 and
> resolution
> > > >> in
> > > >> https://github.com/apache/incubator-mxnet/pull/12412 .
> > > >> Are you aware of critical issues and feedback from user which we
> > > >> should consider for a potential 1.3.1 patch release. Should we
> > > >> include PR 12412 in a potential patch release?
> > > >>
> > > >> Regards,
> > > >> Steffen
> > > >>
> > > >
> >
> 
> 



RE: [Discuss] Next MXNet release

2018-10-01 Thread Zhao, Patric
Thanks, Steffen. 

I will send the reminder again and currently Da, Jun, Haibin and Marco is 
reviewing our 1st PR (12530).

Regarding MKL-DNN integration, the MKL-DNN backend reached GA now from my view.
In the last development cycle, lots of tests, both unit tests and real models, 
are added to improve the quality.
And we don't see any big defects in the current solution. 

Really thanks the efforts form Alex and Shufan adding a branch of test case.
1) unit test
Such as PR, concat 11371, pool 11608, LRN 11831, Sum 11272, backward 11232, 
gluon 10921. 
The new CPP test located in 
https://github.com/apache/incubator-mxnet/blob/master/tests/cpp/operator/mkldnn.cc
 and 
the gluon test in 
https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_gluon.py.

2) model level
The model level coverage, including CV and non-CV models, are tracked in our 
local servers weekly with official master branch.
The CV tests includes RESNET50, inception bn, SSD, etc; non-CV tests includes 
sockeye/GNMT, lstm_bucketing models, etc.
All models we tracked can converged with the expected accuracy and performance. 
 

BTW, is there a check list for grading? If so, it's easy to evaluate 
objectively :)

Thanks,

--Patric



> -Original Message-
> From: Steffen Rochel [mailto:steffenroc...@gmail.com]
> Sent: Monday, October 1, 2018 9:54 AM
> To: dev@mxnet.incubator.apache.org
> Cc: Ye, Jason Y ; Zai, Alexander
> ; Zheng, Da 
> Subject: Re: [Discuss] Next MXNet release
> 
> Thanks Patrick.
> Updated roadmap and next release content.
> 
> Patrick - suggest to send a reminder to review the design doc and collect
> feedback.
> Are there still known issues or gaps before we declare MKL-DNN integration
> as GA?
> 
> Regards,
> Steffen
> 
> On Sat, Sep 29, 2018 at 1:31 AM Zhao, Patric  wrote:
> 
> > Thanks, Steffen.
> >
> > Regarding the next release note, two items from our side:
> >
> > 1. (-remove) MKL-DNN integration is done. I think we can remove this item.
> > 2. (+add) MKL-DNN based graph optimization and quantization by
> subgraph
> > Design doc:
> >
> https://cwiki.apache.org/confluence/display/MXNET/MXNet+Graph+Optimiz
> ation+and+Quantization+based+on+subgraph+and+MKL-DNN
> > Lead Contributor: Patric Zhao,  https://github.com/pengzhao-intel/
> >
> > Regarding the Roadmap
> > (+add) Q1 2019: MKL-DNN RNN API supports
> >
> > BR,
> >
> > Thanks,
> >
> > --Patric
> >
> >
> > > -Original Message-
> > > From: kellen sunderland [mailto:kellen.sunderl...@gmail.com]
> > > Sent: Saturday, September 29, 2018 11:31 AM
> > > To: dev@mxnet.incubator.apache.org
> > > Subject: Re: [Discuss] Next MXNet release
> > >
> > > Sorry I meant to say next 'Regarding the *minor* release'.
> > >
> > > On Sat, Sep 29, 2018 at 5:27 AM kellen sunderland <
> > > kellen.sunderl...@gmail.com> wrote:
> > >
> > > > Thanks for transparently setting a rough timeline Steffen.  I
> > > > think this will go a long way in helping the community plan their
> > > > work, even if the details change somewhat on the road to the release.
> > > >
> > > > Regarding the major release: I would propose we unify TensorRT
> > > > with the subgraph operator work.
> > > >
> > > > Regarding the patch release:  There were a few minor stack/buffer
> > > > overflows exposed by ASAN that have been addressed.  It's probably
> > > > a good idea to include them in a patch release, as they at best
> > > > result in non-deterministic behaviour.
> > > >
> > > > -Kellen
> > > >
> > > >
> > > > On Sat, Sep 29, 2018 at 1:39 AM Steffen Rochel
> > > > 
> > > > wrote:
> > > >
> > > >> I updated
> > > >>
> > > >> https://cwiki.apache.org/confluence/display/MXNET/Project+Proposa
> > > >> ls+f
> > > >> or+next+MXNet+Release
> > > >> ,
> > > >> removed the completed items from 1.3 release and would like to
> > > >> kick off discussion about the next release. Please suggest what
> > > >> you would like to see included in the next release together with
> > > >> link to design proposal (appropriately for the size and
> > > >> complexity of the proposal) or suggest changes.
> > > >> I suggest to target the next release for December 2018 to frame
> > > >> the discussion.
> > > >> Lets include review of
> > > >>
> https://cwiki.apache.org/confluence/display/MXNET/MXNet+Roadmap -
> > > >> time to update and discuss changes.
> > > >>
> > > >> From the 1.3 release we had discussion regarding
> > > >> https://github.com/apache/incubator-mxnet/issues/11849 and
> > > >> resolution in
> > > >> https://github.com/apache/incubator-mxnet/pull/12412 .
> > > >> Are you aware of critical issues and feedback from user which we
> > > >> should consider for a potential 1.3.1 patch release. Should we
> > > >> include PR 12412 in a potential patch release?
> > > >>
> > > >> Regards,
> > > >> Steffen
> > > >>
> > > >
> >


Re: Time out for Travis CI

2018-10-01 Thread kellen sunderland
Well I'd propose we get clarification from Travis before bring the issue up
with infra.  No point debating something with infra or amongst ourselves if
it's not possible.

Orthogonal to the paid account option let's merge this speedup to unblock
Intel.

On Oct 2, 2018 4:37 AM, "Marco de Abreu"
 wrote:

I think the timeout and other limitations have been employed by Apache
Infra and not by Travis. They didn't say that specifically, but they
already made me aware that we might get further restrictions if we consume
too many resources.


kellen sunderland  schrieb am Di., 2. Okt.
2018, 04:34:


> Still worth following up with Travis (I've already messaged them).
They're
> in the middle of reorganizing their business model and merging paid and
> free accounts into the same service, so maybe this policy is changing.  It
> doesn't make a lot of sense to me that public repo accounts would have
> timeout limits that are different to private repo accounts in cases where
> they are both paid.
>
> On Tue, Oct 2, 2018, 4:27 AM Marco de Abreu
>  wrote:
>
> > Apache has it's own shared Travis fleet. We are basically using an
> > on-premise version of the paid Travis plan. That was the information I
> got
> > from Infra when I had a chat with them a few days ago. But from that
> > conversation it was made pretty clear that we cannot increase the
limits.
> >
> > -Marco
> >
> > kellen sunderland  schrieb am Di., 2. Okt.
> > 2018, 03:25:
> >
> > > Interesting, this page seems to indicate that private projects do have
> a
> > > longer time out.  I'll drop Travis a quick email and see what the deal
> > > would be for our project.
> > > https://docs.travis-ci.com/user/customizing-the-build/#build-timeouts.
> > >
> > > On Tue, Oct 2, 2018, 3:15 AM kellen sunderland <
> > > kellen.sunderl...@gmail.com>
> > > wrote:
> > >
> > > > I actually thought we were already using a paid plan through Apache
> > > >
> https://blogs.apache.org/infra/entry/apache_gains_additional_travis_ci
> > > >
> > > > On Tue, Oct 2, 2018, 3:11 AM Qing Lan  wrote:
> > > >
> > > >> Are we currently on a free plan? If we are, probably the unlimited
> > build
> > > >> minutes would help
> > > >>
> > > >> Thanks,
> > > >> Qing
> > > >>
> > > >> On 10/1/18, 6:08 PM, "kellen sunderland" <
> > kellen.sunderl...@gmail.com>
> > > >> wrote:
> > > >>
> > > >> Does the global time out change for paid plans?  I looked into
> it
> > > >> briefly
> > > >> but didn't see anything that would indicate it does.
> > > >>
> > > >> On Tue, Oct 2, 2018, 2:25 AM Pedro Larroy <
> > > >> pedro.larroy.li...@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > I think there's two approaches that we can take to mitigate
> the
> > > >> build &
> > > >> > test time problem, in one hand use a paid travis CI plan, in
> > other
> > > >> improve
> > > >> > the unit tests in suites and only run a core set of tests, as
> we
> > > >> should do
> > > >> > on devices, but on this case we reduce coverage.
> > > >> >
> > > >> > https://travis-ci.com/plans
> > > >> >
> > > >> > Pedro.
> > > >> >
> > > >> > On Sat, Sep 29, 2018 at 6:53 PM YiZhi Liu <
> eazhi@gmail.com>
> > > >> wrote:
> > > >> >
> > > >> > > This makes sense. Thanks
> > > >> > >
> > > >> > > On Sat, Sep 29, 2018 at 6:36 PM kellen sunderland <
> > > >> > > kellen.sunderl...@gmail.com> wrote:
> > > >> > >
> > > >> > > > Hey Zhennan, yes this is the exact problem, and I agree
> with
> > > >> your
> > > >> > points
> > > >> > > > completely.  This is why when we first added Travis we
> > > >> attempted to
> > > >> > > > communicate that it would be informational only, and that
> > we'd
> > > >> need to
> > > >> > > > iterate on the config before it would be a test that
> people
> > > >> should
> > > >> > > consider
> > > >> > > > 'required'.  Apologies, we should have been more
> > > >> straightforward about
> > > >> > > > those tradeoffs.  The strong point in favour of adding
> > Travis
> > > in
> > > >> > > > informational mode was that we had a serious MacOS
> specific
> > > bug
> > > >> that we
> > > >> > > > wanted to verify was fixed.
> > > >> > > >
> > > >> > > > The good news is I've opened a PR which I hope will speed
> up
> > > >> these
> > > >> > builds
> > > >> > > > to the point that they won't rely on caching.  Once it is
> > > >> merged it
> > > >> > would
> > > >> > > > be very helpful if you could rebase on this PR and test
to
> > > >> ensure that
> > > >> > > > large changes no longer hit the global timeout without
> > cache.
> > > >> > > > https://github.com/apache/incubator-mxnet/pull/12706
> > > >> > > >
> > > >> > > > On Sun, Sep 30, 2018 at 2:48 AM Qin, Zhennan <
> > > >> zhennan@intel.com>
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > Hi YiZhi and Kellen,
> > > >> > > > >
> > > >> > > > > From my point of view, travis should 

Re: Time out for Travis CI

2018-10-01 Thread Marco de Abreu
I think the timeout and other limitations have been employed by Apache
Infra and not by Travis. They didn't say that specifically, but they
already made me aware that we might get further restrictions if we consume
too many resources.

kellen sunderland  schrieb am Di., 2. Okt.
2018, 04:34:

> Still worth following up with Travis (I've already messaged them).  They're
> in the middle of reorganizing their business model and merging paid and
> free accounts into the same service, so maybe this policy is changing.  It
> doesn't make a lot of sense to me that public repo accounts would have
> timeout limits that are different to private repo accounts in cases where
> they are both paid.
>
> On Tue, Oct 2, 2018, 4:27 AM Marco de Abreu
>  wrote:
>
> > Apache has it's own shared Travis fleet. We are basically using an
> > on-premise version of the paid Travis plan. That was the information I
> got
> > from Infra when I had a chat with them a few days ago. But from that
> > conversation it was made pretty clear that we cannot increase the limits.
> >
> > -Marco
> >
> > kellen sunderland  schrieb am Di., 2. Okt.
> > 2018, 03:25:
> >
> > > Interesting, this page seems to indicate that private projects do have
> a
> > > longer time out.  I'll drop Travis a quick email and see what the deal
> > > would be for our project.
> > > https://docs.travis-ci.com/user/customizing-the-build/#build-timeouts.
> > >
> > > On Tue, Oct 2, 2018, 3:15 AM kellen sunderland <
> > > kellen.sunderl...@gmail.com>
> > > wrote:
> > >
> > > > I actually thought we were already using a paid plan through Apache
> > > >
> https://blogs.apache.org/infra/entry/apache_gains_additional_travis_ci
> > > >
> > > > On Tue, Oct 2, 2018, 3:11 AM Qing Lan  wrote:
> > > >
> > > >> Are we currently on a free plan? If we are, probably the unlimited
> > build
> > > >> minutes would help
> > > >>
> > > >> Thanks,
> > > >> Qing
> > > >>
> > > >> On 10/1/18, 6:08 PM, "kellen sunderland" <
> > kellen.sunderl...@gmail.com>
> > > >> wrote:
> > > >>
> > > >> Does the global time out change for paid plans?  I looked into
> it
> > > >> briefly
> > > >> but didn't see anything that would indicate it does.
> > > >>
> > > >> On Tue, Oct 2, 2018, 2:25 AM Pedro Larroy <
> > > >> pedro.larroy.li...@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > I think there's two approaches that we can take to mitigate
> the
> > > >> build &
> > > >> > test time problem, in one hand use a paid travis CI plan, in
> > other
> > > >> improve
> > > >> > the unit tests in suites and only run a core set of tests, as
> we
> > > >> should do
> > > >> > on devices, but on this case we reduce coverage.
> > > >> >
> > > >> > https://travis-ci.com/plans
> > > >> >
> > > >> > Pedro.
> > > >> >
> > > >> > On Sat, Sep 29, 2018 at 6:53 PM YiZhi Liu <
> eazhi@gmail.com>
> > > >> wrote:
> > > >> >
> > > >> > > This makes sense. Thanks
> > > >> > >
> > > >> > > On Sat, Sep 29, 2018 at 6:36 PM kellen sunderland <
> > > >> > > kellen.sunderl...@gmail.com> wrote:
> > > >> > >
> > > >> > > > Hey Zhennan, yes this is the exact problem, and I agree
> with
> > > >> your
> > > >> > points
> > > >> > > > completely.  This is why when we first added Travis we
> > > >> attempted to
> > > >> > > > communicate that it would be informational only, and that
> > we'd
> > > >> need to
> > > >> > > > iterate on the config before it would be a test that
> people
> > > >> should
> > > >> > > consider
> > > >> > > > 'required'.  Apologies, we should have been more
> > > >> straightforward about
> > > >> > > > those tradeoffs.  The strong point in favour of adding
> > Travis
> > > in
> > > >> > > > informational mode was that we had a serious MacOS
> specific
> > > bug
> > > >> that we
> > > >> > > > wanted to verify was fixed.
> > > >> > > >
> > > >> > > > The good news is I've opened a PR which I hope will speed
> up
> > > >> these
> > > >> > builds
> > > >> > > > to the point that they won't rely on caching.  Once it is
> > > >> merged it
> > > >> > would
> > > >> > > > be very helpful if you could rebase on this PR and test to
> > > >> ensure that
> > > >> > > > large changes no longer hit the global timeout without
> > cache.
> > > >> > > > https://github.com/apache/incubator-mxnet/pull/12706
> > > >> > > >
> > > >> > > > On Sun, Sep 30, 2018 at 2:48 AM Qin, Zhennan <
> > > >> zhennan@intel.com>
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > Hi YiZhi and Kellen,
> > > >> > > > >
> > > >> > > > > From my point of view, travis should be able to get
> passed
> > > >> from a
> > > >> > > scratch
> > > >> > > > > build. Pending result on ccache hit/miss is not a good
> > idea.
> > > >> For this
> > > >> > > PR,
> > > >> > > > > as it changed many header file, lots of files need be
> > > >> recompiled,
> > > >> > just

Re: Time out for Travis CI

2018-10-01 Thread kellen sunderland
Still worth following up with Travis (I've already messaged them).  They're
in the middle of reorganizing their business model and merging paid and
free accounts into the same service, so maybe this policy is changing.  It
doesn't make a lot of sense to me that public repo accounts would have
timeout limits that are different to private repo accounts in cases where
they are both paid.

On Tue, Oct 2, 2018, 4:27 AM Marco de Abreu
 wrote:

> Apache has it's own shared Travis fleet. We are basically using an
> on-premise version of the paid Travis plan. That was the information I got
> from Infra when I had a chat with them a few days ago. But from that
> conversation it was made pretty clear that we cannot increase the limits.
>
> -Marco
>
> kellen sunderland  schrieb am Di., 2. Okt.
> 2018, 03:25:
>
> > Interesting, this page seems to indicate that private projects do have a
> > longer time out.  I'll drop Travis a quick email and see what the deal
> > would be for our project.
> > https://docs.travis-ci.com/user/customizing-the-build/#build-timeouts.
> >
> > On Tue, Oct 2, 2018, 3:15 AM kellen sunderland <
> > kellen.sunderl...@gmail.com>
> > wrote:
> >
> > > I actually thought we were already using a paid plan through Apache
> > > https://blogs.apache.org/infra/entry/apache_gains_additional_travis_ci
> > >
> > > On Tue, Oct 2, 2018, 3:11 AM Qing Lan  wrote:
> > >
> > >> Are we currently on a free plan? If we are, probably the unlimited
> build
> > >> minutes would help
> > >>
> > >> Thanks,
> > >> Qing
> > >>
> > >> On 10/1/18, 6:08 PM, "kellen sunderland" <
> kellen.sunderl...@gmail.com>
> > >> wrote:
> > >>
> > >> Does the global time out change for paid plans?  I looked into it
> > >> briefly
> > >> but didn't see anything that would indicate it does.
> > >>
> > >> On Tue, Oct 2, 2018, 2:25 AM Pedro Larroy <
> > >> pedro.larroy.li...@gmail.com>
> > >> wrote:
> > >>
> > >> > I think there's two approaches that we can take to mitigate the
> > >> build &
> > >> > test time problem, in one hand use a paid travis CI plan, in
> other
> > >> improve
> > >> > the unit tests in suites and only run a core set of tests, as we
> > >> should do
> > >> > on devices, but on this case we reduce coverage.
> > >> >
> > >> > https://travis-ci.com/plans
> > >> >
> > >> > Pedro.
> > >> >
> > >> > On Sat, Sep 29, 2018 at 6:53 PM YiZhi Liu 
> > >> wrote:
> > >> >
> > >> > > This makes sense. Thanks
> > >> > >
> > >> > > On Sat, Sep 29, 2018 at 6:36 PM kellen sunderland <
> > >> > > kellen.sunderl...@gmail.com> wrote:
> > >> > >
> > >> > > > Hey Zhennan, yes this is the exact problem, and I agree with
> > >> your
> > >> > points
> > >> > > > completely.  This is why when we first added Travis we
> > >> attempted to
> > >> > > > communicate that it would be informational only, and that
> we'd
> > >> need to
> > >> > > > iterate on the config before it would be a test that people
> > >> should
> > >> > > consider
> > >> > > > 'required'.  Apologies, we should have been more
> > >> straightforward about
> > >> > > > those tradeoffs.  The strong point in favour of adding
> Travis
> > in
> > >> > > > informational mode was that we had a serious MacOS specific
> > bug
> > >> that we
> > >> > > > wanted to verify was fixed.
> > >> > > >
> > >> > > > The good news is I've opened a PR which I hope will speed up
> > >> these
> > >> > builds
> > >> > > > to the point that they won't rely on caching.  Once it is
> > >> merged it
> > >> > would
> > >> > > > be very helpful if you could rebase on this PR and test to
> > >> ensure that
> > >> > > > large changes no longer hit the global timeout without
> cache.
> > >> > > > https://github.com/apache/incubator-mxnet/pull/12706
> > >> > > >
> > >> > > > On Sun, Sep 30, 2018 at 2:48 AM Qin, Zhennan <
> > >> zhennan@intel.com>
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Hi YiZhi and Kellen,
> > >> > > > >
> > >> > > > > From my point of view, travis should be able to get passed
> > >> from a
> > >> > > scratch
> > >> > > > > build. Pending result on ccache hit/miss is not a good
> idea.
> > >> For this
> > >> > > PR,
> > >> > > > > as it changed many header file, lots of files need be
> > >> recompiled,
> > >> > just
> > >> > > > like
> > >> > > > > a scratch build. I think that's the reason that travis
> > >> timeout. This
> > >> > > > should
> > >> > > > > be fixed before enabling travis, as it will block any
> change
> > >> to those
> > >> > > > base
> > >> > > > > header file. Again, it's not a special case with this PR
> > >> only, you
> > >> > can
> > >> > > > find
> > >> > > > > same problem on other PRs:
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> 

Re: Time out for Travis CI

2018-10-01 Thread Marco de Abreu
Apache has it's own shared Travis fleet. We are basically using an
on-premise version of the paid Travis plan. That was the information I got
from Infra when I had a chat with them a few days ago. But from that
conversation it was made pretty clear that we cannot increase the limits.

-Marco

kellen sunderland  schrieb am Di., 2. Okt.
2018, 03:25:

> Interesting, this page seems to indicate that private projects do have a
> longer time out.  I'll drop Travis a quick email and see what the deal
> would be for our project.
> https://docs.travis-ci.com/user/customizing-the-build/#build-timeouts.
>
> On Tue, Oct 2, 2018, 3:15 AM kellen sunderland <
> kellen.sunderl...@gmail.com>
> wrote:
>
> > I actually thought we were already using a paid plan through Apache
> > https://blogs.apache.org/infra/entry/apache_gains_additional_travis_ci
> >
> > On Tue, Oct 2, 2018, 3:11 AM Qing Lan  wrote:
> >
> >> Are we currently on a free plan? If we are, probably the unlimited build
> >> minutes would help
> >>
> >> Thanks,
> >> Qing
> >>
> >> On 10/1/18, 6:08 PM, "kellen sunderland" 
> >> wrote:
> >>
> >> Does the global time out change for paid plans?  I looked into it
> >> briefly
> >> but didn't see anything that would indicate it does.
> >>
> >> On Tue, Oct 2, 2018, 2:25 AM Pedro Larroy <
> >> pedro.larroy.li...@gmail.com>
> >> wrote:
> >>
> >> > I think there's two approaches that we can take to mitigate the
> >> build &
> >> > test time problem, in one hand use a paid travis CI plan, in other
> >> improve
> >> > the unit tests in suites and only run a core set of tests, as we
> >> should do
> >> > on devices, but on this case we reduce coverage.
> >> >
> >> > https://travis-ci.com/plans
> >> >
> >> > Pedro.
> >> >
> >> > On Sat, Sep 29, 2018 at 6:53 PM YiZhi Liu 
> >> wrote:
> >> >
> >> > > This makes sense. Thanks
> >> > >
> >> > > On Sat, Sep 29, 2018 at 6:36 PM kellen sunderland <
> >> > > kellen.sunderl...@gmail.com> wrote:
> >> > >
> >> > > > Hey Zhennan, yes this is the exact problem, and I agree with
> >> your
> >> > points
> >> > > > completely.  This is why when we first added Travis we
> >> attempted to
> >> > > > communicate that it would be informational only, and that we'd
> >> need to
> >> > > > iterate on the config before it would be a test that people
> >> should
> >> > > consider
> >> > > > 'required'.  Apologies, we should have been more
> >> straightforward about
> >> > > > those tradeoffs.  The strong point in favour of adding Travis
> in
> >> > > > informational mode was that we had a serious MacOS specific
> bug
> >> that we
> >> > > > wanted to verify was fixed.
> >> > > >
> >> > > > The good news is I've opened a PR which I hope will speed up
> >> these
> >> > builds
> >> > > > to the point that they won't rely on caching.  Once it is
> >> merged it
> >> > would
> >> > > > be very helpful if you could rebase on this PR and test to
> >> ensure that
> >> > > > large changes no longer hit the global timeout without cache.
> >> > > > https://github.com/apache/incubator-mxnet/pull/12706
> >> > > >
> >> > > > On Sun, Sep 30, 2018 at 2:48 AM Qin, Zhennan <
> >> zhennan@intel.com>
> >> > > > wrote:
> >> > > >
> >> > > > > Hi YiZhi and Kellen,
> >> > > > >
> >> > > > > From my point of view, travis should be able to get passed
> >> from a
> >> > > scratch
> >> > > > > build. Pending result on ccache hit/miss is not a good idea.
> >> For this
> >> > > PR,
> >> > > > > as it changed many header file, lots of files need be
> >> recompiled,
> >> > just
> >> > > > like
> >> > > > > a scratch build. I think that's the reason that travis
> >> timeout. This
> >> > > > should
> >> > > > > be fixed before enabling travis, as it will block any change
> >> to those
> >> > > > base
> >> > > > > header file. Again, it's not a special case with this PR
> >> only, you
> >> > can
> >> > > > find
> >> > > > > same problem on other PRs:
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://travis-ci.org/apache/incubator-mxnet/builds/433172088?utm_source=github_status_medium=notification
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://travis-ci.org/apache/incubator-mxnet/builds/434404305?utm_source=github_status_medium=notification
> >> > > > >
> >> > > > >
> >> > > > > Thanks,
> >> > > > > Zhennan
> >> > > > >
> >> > > > > -Original Message-
> >> > > > > From: YiZhi Liu [mailto:eazhi@gmail.com]
> >> > > > > Sent: Sunday, September 30, 2018 5:15 AM
> >> > > > > To: eazhi@gmail.com
> >> > > > > Cc: dev@mxnet.incubator.apache.org
> >> > > > > Subject: Re: Time out for Travis CI
> >> > > > >
> >> > > > > while other PRs are all 

Re: Time out for Travis CI

2018-10-01 Thread kellen sunderland
Interesting, this page seems to indicate that private projects do have a
longer time out.  I'll drop Travis a quick email and see what the deal
would be for our project.
https://docs.travis-ci.com/user/customizing-the-build/#build-timeouts.

On Tue, Oct 2, 2018, 3:15 AM kellen sunderland 
wrote:

> I actually thought we were already using a paid plan through Apache
> https://blogs.apache.org/infra/entry/apache_gains_additional_travis_ci
>
> On Tue, Oct 2, 2018, 3:11 AM Qing Lan  wrote:
>
>> Are we currently on a free plan? If we are, probably the unlimited build
>> minutes would help
>>
>> Thanks,
>> Qing
>>
>> On 10/1/18, 6:08 PM, "kellen sunderland" 
>> wrote:
>>
>> Does the global time out change for paid plans?  I looked into it
>> briefly
>> but didn't see anything that would indicate it does.
>>
>> On Tue, Oct 2, 2018, 2:25 AM Pedro Larroy <
>> pedro.larroy.li...@gmail.com>
>> wrote:
>>
>> > I think there's two approaches that we can take to mitigate the
>> build &
>> > test time problem, in one hand use a paid travis CI plan, in other
>> improve
>> > the unit tests in suites and only run a core set of tests, as we
>> should do
>> > on devices, but on this case we reduce coverage.
>> >
>> > https://travis-ci.com/plans
>> >
>> > Pedro.
>> >
>> > On Sat, Sep 29, 2018 at 6:53 PM YiZhi Liu 
>> wrote:
>> >
>> > > This makes sense. Thanks
>> > >
>> > > On Sat, Sep 29, 2018 at 6:36 PM kellen sunderland <
>> > > kellen.sunderl...@gmail.com> wrote:
>> > >
>> > > > Hey Zhennan, yes this is the exact problem, and I agree with
>> your
>> > points
>> > > > completely.  This is why when we first added Travis we
>> attempted to
>> > > > communicate that it would be informational only, and that we'd
>> need to
>> > > > iterate on the config before it would be a test that people
>> should
>> > > consider
>> > > > 'required'.  Apologies, we should have been more
>> straightforward about
>> > > > those tradeoffs.  The strong point in favour of adding Travis in
>> > > > informational mode was that we had a serious MacOS specific bug
>> that we
>> > > > wanted to verify was fixed.
>> > > >
>> > > > The good news is I've opened a PR which I hope will speed up
>> these
>> > builds
>> > > > to the point that they won't rely on caching.  Once it is
>> merged it
>> > would
>> > > > be very helpful if you could rebase on this PR and test to
>> ensure that
>> > > > large changes no longer hit the global timeout without cache.
>> > > > https://github.com/apache/incubator-mxnet/pull/12706
>> > > >
>> > > > On Sun, Sep 30, 2018 at 2:48 AM Qin, Zhennan <
>> zhennan@intel.com>
>> > > > wrote:
>> > > >
>> > > > > Hi YiZhi and Kellen,
>> > > > >
>> > > > > From my point of view, travis should be able to get passed
>> from a
>> > > scratch
>> > > > > build. Pending result on ccache hit/miss is not a good idea.
>> For this
>> > > PR,
>> > > > > as it changed many header file, lots of files need be
>> recompiled,
>> > just
>> > > > like
>> > > > > a scratch build. I think that's the reason that travis
>> timeout. This
>> > > > should
>> > > > > be fixed before enabling travis, as it will block any change
>> to those
>> > > > base
>> > > > > header file. Again, it's not a special case with this PR
>> only, you
>> > can
>> > > > find
>> > > > > same problem on other PRs:
>> > > > >
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://travis-ci.org/apache/incubator-mxnet/builds/433172088?utm_source=github_status_medium=notification
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://travis-ci.org/apache/incubator-mxnet/builds/434404305?utm_source=github_status_medium=notification
>> > > > >
>> > > > >
>> > > > > Thanks,
>> > > > > Zhennan
>> > > > >
>> > > > > -Original Message-
>> > > > > From: YiZhi Liu [mailto:eazhi@gmail.com]
>> > > > > Sent: Sunday, September 30, 2018 5:15 AM
>> > > > > To: eazhi@gmail.com
>> > > > > Cc: dev@mxnet.incubator.apache.org
>> > > > > Subject: Re: Time out for Travis CI
>> > > > >
>> > > > > while other PRs are all good.
>> > > > > On Sat, Sep 29, 2018 at 2:13 PM YiZhi Liu <
>> eazhi@gmail.com>
>> > wrote:
>> > > > > >
>> > > > > > Honestly I don't know yet. I can help to investigate. Just
>> given
>> > the
>> > > > > > evidence that, travis timeout every time it gets
>> re-triggered - 2
>> > > > > > times at least. Correct me if I'm wrong @ Zhennan On Sat,
>> Sep 29,
>> > > 2018
>> > > > > > at 1:54 PM kellen sunderland 
>> wrote:
>> > > > > > >
>> > > > > > > Reading over the PR I don't see what aspects would cause
>> extra
>> > > > > > > runtime YiZhi, could you point them out?
>> > > > > > >
>> 

Re: Time out for Travis CI

2018-10-01 Thread Qing Lan
From the link it looks like "Travis CI offers a free account" instead of Apache 
buy it. It may just be a free user account with extension on the numbers of 
nodes it can runs on. I think we may need to reach out to Travis or Apache to 
clarify that we currently have the service that paid version have instead of an 
extension of "free user account".

Thanks,
Qing

On 10/1/18, 6:15 PM, "kellen sunderland"  wrote:

I actually thought we were already using a paid plan through Apache
https://blogs.apache.org/infra/entry/apache_gains_additional_travis_ci

On Tue, Oct 2, 2018, 3:11 AM Qing Lan  wrote:

> Are we currently on a free plan? If we are, probably the unlimited build
> minutes would help
>
> Thanks,
> Qing
>
> On 10/1/18, 6:08 PM, "kellen sunderland" 
> wrote:
>
> Does the global time out change for paid plans?  I looked into it
> briefly
> but didn't see anything that would indicate it does.
>
> On Tue, Oct 2, 2018, 2:25 AM Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> wrote:
>
> > I think there's two approaches that we can take to mitigate the
> build &
> > test time problem, in one hand use a paid travis CI plan, in other
> improve
> > the unit tests in suites and only run a core set of tests, as we
> should do
> > on devices, but on this case we reduce coverage.
> >
> > https://travis-ci.com/plans
> >
> > Pedro.
> >
> > On Sat, Sep 29, 2018 at 6:53 PM YiZhi Liu 
> wrote:
> >
> > > This makes sense. Thanks
> > >
> > > On Sat, Sep 29, 2018 at 6:36 PM kellen sunderland <
> > > kellen.sunderl...@gmail.com> wrote:
> > >
> > > > Hey Zhennan, yes this is the exact problem, and I agree with 
your
> > points
> > > > completely.  This is why when we first added Travis we attempted
> to
> > > > communicate that it would be informational only, and that we'd
> need to
> > > > iterate on the config before it would be a test that people
> should
> > > consider
> > > > 'required'.  Apologies, we should have been more straightforward
> about
> > > > those tradeoffs.  The strong point in favour of adding Travis in
> > > > informational mode was that we had a serious MacOS specific bug
> that we
> > > > wanted to verify was fixed.
> > > >
> > > > The good news is I've opened a PR which I hope will speed up
> these
> > builds
> > > > to the point that they won't rely on caching.  Once it is merged
> it
> > would
> > > > be very helpful if you could rebase on this PR and test to
> ensure that
> > > > large changes no longer hit the global timeout without cache.
> > > > https://github.com/apache/incubator-mxnet/pull/12706
> > > >
> > > > On Sun, Sep 30, 2018 at 2:48 AM Qin, Zhennan <
> zhennan@intel.com>
> > > > wrote:
> > > >
> > > > > Hi YiZhi and Kellen,
> > > > >
> > > > > From my point of view, travis should be able to get passed
> from a
> > > scratch
> > > > > build. Pending result on ccache hit/miss is not a good idea.
> For this
> > > PR,
> > > > > as it changed many header file, lots of files need be
> recompiled,
> > just
> > > > like
> > > > > a scratch build. I think that's the reason that travis
> timeout. This
> > > > should
> > > > > be fixed before enabling travis, as it will block any change
> to those
> > > > base
> > > > > header file. Again, it's not a special case with this PR only,
> you
> > can
> > > > find
> > > > > same problem on other PRs:
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> 
https://travis-ci.org/apache/incubator-mxnet/builds/433172088?utm_source=github_status_medium=notification
> > > > >
> > > > >
> > > >
> > >
> >
> 
https://travis-ci.org/apache/incubator-mxnet/builds/434404305?utm_source=github_status_medium=notification
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Zhennan
> > > > >
> > > > > -Original Message-
> > > > > From: YiZhi Liu [mailto:eazhi@gmail.com]
> > > > > Sent: Sunday, September 30, 2018 5:15 AM
> > > > > To: eazhi@gmail.com
> > > > > Cc: dev@mxnet.incubator.apache.org
> > > > > Subject: Re: Time out for Travis CI
> > > > >
> > > > > while other PRs are all good.
> > > > > On Sat, Sep 29, 2018 at 2:13 PM YiZhi Liu  >
> > wrote:
> > > > > >
> > > > > > Honestly I don't know yet. I can help to 

Re: Time out for Travis CI

2018-10-01 Thread kellen sunderland
I actually thought we were already using a paid plan through Apache
https://blogs.apache.org/infra/entry/apache_gains_additional_travis_ci

On Tue, Oct 2, 2018, 3:11 AM Qing Lan  wrote:

> Are we currently on a free plan? If we are, probably the unlimited build
> minutes would help
>
> Thanks,
> Qing
>
> On 10/1/18, 6:08 PM, "kellen sunderland" 
> wrote:
>
> Does the global time out change for paid plans?  I looked into it
> briefly
> but didn't see anything that would indicate it does.
>
> On Tue, Oct 2, 2018, 2:25 AM Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> wrote:
>
> > I think there's two approaches that we can take to mitigate the
> build &
> > test time problem, in one hand use a paid travis CI plan, in other
> improve
> > the unit tests in suites and only run a core set of tests, as we
> should do
> > on devices, but on this case we reduce coverage.
> >
> > https://travis-ci.com/plans
> >
> > Pedro.
> >
> > On Sat, Sep 29, 2018 at 6:53 PM YiZhi Liu 
> wrote:
> >
> > > This makes sense. Thanks
> > >
> > > On Sat, Sep 29, 2018 at 6:36 PM kellen sunderland <
> > > kellen.sunderl...@gmail.com> wrote:
> > >
> > > > Hey Zhennan, yes this is the exact problem, and I agree with your
> > points
> > > > completely.  This is why when we first added Travis we attempted
> to
> > > > communicate that it would be informational only, and that we'd
> need to
> > > > iterate on the config before it would be a test that people
> should
> > > consider
> > > > 'required'.  Apologies, we should have been more straightforward
> about
> > > > those tradeoffs.  The strong point in favour of adding Travis in
> > > > informational mode was that we had a serious MacOS specific bug
> that we
> > > > wanted to verify was fixed.
> > > >
> > > > The good news is I've opened a PR which I hope will speed up
> these
> > builds
> > > > to the point that they won't rely on caching.  Once it is merged
> it
> > would
> > > > be very helpful if you could rebase on this PR and test to
> ensure that
> > > > large changes no longer hit the global timeout without cache.
> > > > https://github.com/apache/incubator-mxnet/pull/12706
> > > >
> > > > On Sun, Sep 30, 2018 at 2:48 AM Qin, Zhennan <
> zhennan@intel.com>
> > > > wrote:
> > > >
> > > > > Hi YiZhi and Kellen,
> > > > >
> > > > > From my point of view, travis should be able to get passed
> from a
> > > scratch
> > > > > build. Pending result on ccache hit/miss is not a good idea.
> For this
> > > PR,
> > > > > as it changed many header file, lots of files need be
> recompiled,
> > just
> > > > like
> > > > > a scratch build. I think that's the reason that travis
> timeout. This
> > > > should
> > > > > be fixed before enabling travis, as it will block any change
> to those
> > > > base
> > > > > header file. Again, it's not a special case with this PR only,
> you
> > can
> > > > find
> > > > > same problem on other PRs:
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://travis-ci.org/apache/incubator-mxnet/builds/433172088?utm_source=github_status_medium=notification
> > > > >
> > > > >
> > > >
> > >
> >
> https://travis-ci.org/apache/incubator-mxnet/builds/434404305?utm_source=github_status_medium=notification
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Zhennan
> > > > >
> > > > > -Original Message-
> > > > > From: YiZhi Liu [mailto:eazhi@gmail.com]
> > > > > Sent: Sunday, September 30, 2018 5:15 AM
> > > > > To: eazhi@gmail.com
> > > > > Cc: dev@mxnet.incubator.apache.org
> > > > > Subject: Re: Time out for Travis CI
> > > > >
> > > > > while other PRs are all good.
> > > > > On Sat, Sep 29, 2018 at 2:13 PM YiZhi Liu  >
> > wrote:
> > > > > >
> > > > > > Honestly I don't know yet. I can help to investigate. Just
> given
> > the
> > > > > > evidence that, travis timeout every time it gets
> re-triggered - 2
> > > > > > times at least. Correct me if I'm wrong @ Zhennan On Sat,
> Sep 29,
> > > 2018
> > > > > > at 1:54 PM kellen sunderland 
> wrote:
> > > > > > >
> > > > > > > Reading over the PR I don't see what aspects would cause
> extra
> > > > > > > runtime YiZhi, could you point them out?
> > > > > > >
> > > > > > > On Sat, Sep 29, 2018 at 8:46 PM YiZhi Liu <
> eazhi@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > > Kellen, I think this PR introduces extra runtime in CI,
> thus
> > > > > > > > causes the timeout. Which means, once merged, every PR
> later
> > will
> > > > > > > > see same timeout in travis.
> > > > > > > >
> > > > > > > > So shall we modify the changes to decrease the test
> running
> > time?
> > > 

Re: Time out for Travis CI

2018-10-01 Thread Qing Lan
Are we currently on a free plan? If we are, probably the unlimited build 
minutes would help

Thanks,
Qing

On 10/1/18, 6:08 PM, "kellen sunderland"  wrote:

Does the global time out change for paid plans?  I looked into it briefly
but didn't see anything that would indicate it does.

On Tue, Oct 2, 2018, 2:25 AM Pedro Larroy 
wrote:

> I think there's two approaches that we can take to mitigate the build &
> test time problem, in one hand use a paid travis CI plan, in other improve
> the unit tests in suites and only run a core set of tests, as we should do
> on devices, but on this case we reduce coverage.
>
> https://travis-ci.com/plans
>
> Pedro.
>
> On Sat, Sep 29, 2018 at 6:53 PM YiZhi Liu  wrote:
>
> > This makes sense. Thanks
> >
> > On Sat, Sep 29, 2018 at 6:36 PM kellen sunderland <
> > kellen.sunderl...@gmail.com> wrote:
> >
> > > Hey Zhennan, yes this is the exact problem, and I agree with your
> points
> > > completely.  This is why when we first added Travis we attempted to
> > > communicate that it would be informational only, and that we'd need to
> > > iterate on the config before it would be a test that people should
> > consider
> > > 'required'.  Apologies, we should have been more straightforward about
> > > those tradeoffs.  The strong point in favour of adding Travis in
> > > informational mode was that we had a serious MacOS specific bug that 
we
> > > wanted to verify was fixed.
> > >
> > > The good news is I've opened a PR which I hope will speed up these
> builds
> > > to the point that they won't rely on caching.  Once it is merged it
> would
> > > be very helpful if you could rebase on this PR and test to ensure that
> > > large changes no longer hit the global timeout without cache.
> > > https://github.com/apache/incubator-mxnet/pull/12706
> > >
> > > On Sun, Sep 30, 2018 at 2:48 AM Qin, Zhennan 
> > > wrote:
> > >
> > > > Hi YiZhi and Kellen,
> > > >
> > > > From my point of view, travis should be able to get passed from a
> > scratch
> > > > build. Pending result on ccache hit/miss is not a good idea. For 
this
> > PR,
> > > > as it changed many header file, lots of files need be recompiled,
> just
> > > like
> > > > a scratch build. I think that's the reason that travis timeout. This
> > > should
> > > > be fixed before enabling travis, as it will block any change to 
those
> > > base
> > > > header file. Again, it's not a special case with this PR only, you
> can
> > > find
> > > > same problem on other PRs:
> > > >
> > > >
> > > >
> > >
> >
> 
https://travis-ci.org/apache/incubator-mxnet/builds/433172088?utm_source=github_status_medium=notification
> > > >
> > > >
> > >
> >
> 
https://travis-ci.org/apache/incubator-mxnet/builds/434404305?utm_source=github_status_medium=notification
> > > >
> > > >
> > > > Thanks,
> > > > Zhennan
> > > >
> > > > -Original Message-
> > > > From: YiZhi Liu [mailto:eazhi@gmail.com]
> > > > Sent: Sunday, September 30, 2018 5:15 AM
> > > > To: eazhi@gmail.com
> > > > Cc: dev@mxnet.incubator.apache.org
> > > > Subject: Re: Time out for Travis CI
> > > >
> > > > while other PRs are all good.
> > > > On Sat, Sep 29, 2018 at 2:13 PM YiZhi Liu 
> wrote:
> > > > >
> > > > > Honestly I don't know yet. I can help to investigate. Just given
> the
> > > > > evidence that, travis timeout every time it gets re-triggered - 2
> > > > > times at least. Correct me if I'm wrong @ Zhennan On Sat, Sep 29,
> > 2018
> > > > > at 1:54 PM kellen sunderland  wrote:
> > > > > >
> > > > > > Reading over the PR I don't see what aspects would cause extra
> > > > > > runtime YiZhi, could you point them out?
> > > > > >
> > > > > > On Sat, Sep 29, 2018 at 8:46 PM YiZhi Liu 
> > > wrote:
> > > > > >
> > > > > > > Kellen, I think this PR introduces extra runtime in CI, thus
> > > > > > > causes the timeout. Which means, once merged, every PR later
> will
> > > > > > > see same timeout in travis.
> > > > > > >
> > > > > > > So shall we modify the changes to decrease the test running
> time?
> > > > > > > or just disable the Travis CI?
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Sep 28, 2018 at 9:17 PM Qin, Zhennan
> > > > > > > 
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Hi Kellen,
> > > > > > > >
> > > > > > > > Thanks for your explanation. Do you have a time plan to 
solve
> > > > > > > > the
> > > > > > > timeout issue? Rebasing can't work for my case. Or shall we 
run
> > it
> > > > > > > silently to disallow it voting X for overall CI result? 
Because
> > 

Re: Time out for Travis CI

2018-10-01 Thread kellen sunderland
Does the global time out change for paid plans?  I looked into it briefly
but didn't see anything that would indicate it does.

On Tue, Oct 2, 2018, 2:25 AM Pedro Larroy 
wrote:

> I think there's two approaches that we can take to mitigate the build &
> test time problem, in one hand use a paid travis CI plan, in other improve
> the unit tests in suites and only run a core set of tests, as we should do
> on devices, but on this case we reduce coverage.
>
> https://travis-ci.com/plans
>
> Pedro.
>
> On Sat, Sep 29, 2018 at 6:53 PM YiZhi Liu  wrote:
>
> > This makes sense. Thanks
> >
> > On Sat, Sep 29, 2018 at 6:36 PM kellen sunderland <
> > kellen.sunderl...@gmail.com> wrote:
> >
> > > Hey Zhennan, yes this is the exact problem, and I agree with your
> points
> > > completely.  This is why when we first added Travis we attempted to
> > > communicate that it would be informational only, and that we'd need to
> > > iterate on the config before it would be a test that people should
> > consider
> > > 'required'.  Apologies, we should have been more straightforward about
> > > those tradeoffs.  The strong point in favour of adding Travis in
> > > informational mode was that we had a serious MacOS specific bug that we
> > > wanted to verify was fixed.
> > >
> > > The good news is I've opened a PR which I hope will speed up these
> builds
> > > to the point that they won't rely on caching.  Once it is merged it
> would
> > > be very helpful if you could rebase on this PR and test to ensure that
> > > large changes no longer hit the global timeout without cache.
> > > https://github.com/apache/incubator-mxnet/pull/12706
> > >
> > > On Sun, Sep 30, 2018 at 2:48 AM Qin, Zhennan 
> > > wrote:
> > >
> > > > Hi YiZhi and Kellen,
> > > >
> > > > From my point of view, travis should be able to get passed from a
> > scratch
> > > > build. Pending result on ccache hit/miss is not a good idea. For this
> > PR,
> > > > as it changed many header file, lots of files need be recompiled,
> just
> > > like
> > > > a scratch build. I think that's the reason that travis timeout. This
> > > should
> > > > be fixed before enabling travis, as it will block any change to those
> > > base
> > > > header file. Again, it's not a special case with this PR only, you
> can
> > > find
> > > > same problem on other PRs:
> > > >
> > > >
> > > >
> > >
> >
> https://travis-ci.org/apache/incubator-mxnet/builds/433172088?utm_source=github_status_medium=notification
> > > >
> > > >
> > >
> >
> https://travis-ci.org/apache/incubator-mxnet/builds/434404305?utm_source=github_status_medium=notification
> > > >
> > > >
> > > > Thanks,
> > > > Zhennan
> > > >
> > > > -Original Message-
> > > > From: YiZhi Liu [mailto:eazhi@gmail.com]
> > > > Sent: Sunday, September 30, 2018 5:15 AM
> > > > To: eazhi@gmail.com
> > > > Cc: dev@mxnet.incubator.apache.org
> > > > Subject: Re: Time out for Travis CI
> > > >
> > > > while other PRs are all good.
> > > > On Sat, Sep 29, 2018 at 2:13 PM YiZhi Liu 
> wrote:
> > > > >
> > > > > Honestly I don't know yet. I can help to investigate. Just given
> the
> > > > > evidence that, travis timeout every time it gets re-triggered - 2
> > > > > times at least. Correct me if I'm wrong @ Zhennan On Sat, Sep 29,
> > 2018
> > > > > at 1:54 PM kellen sunderland  wrote:
> > > > > >
> > > > > > Reading over the PR I don't see what aspects would cause extra
> > > > > > runtime YiZhi, could you point them out?
> > > > > >
> > > > > > On Sat, Sep 29, 2018 at 8:46 PM YiZhi Liu 
> > > wrote:
> > > > > >
> > > > > > > Kellen, I think this PR introduces extra runtime in CI, thus
> > > > > > > causes the timeout. Which means, once merged, every PR later
> will
> > > > > > > see same timeout in travis.
> > > > > > >
> > > > > > > So shall we modify the changes to decrease the test running
> time?
> > > > > > > or just disable the Travis CI?
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Sep 28, 2018 at 9:17 PM Qin, Zhennan
> > > > > > > 
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Hi Kellen,
> > > > > > > >
> > > > > > > > Thanks for your explanation. Do you have a time plan to solve
> > > > > > > > the
> > > > > > > timeout issue? Rebasing can't work for my case. Or shall we run
> > it
> > > > > > > silently to disallow it voting X for overall CI result? Because
> > > > > > > most developers are used to ignore the PRs with 'X'.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Zhennan
> > > > > > > >
> > > > > > > > -Original Message-
> > > > > > > > From: kellen sunderland [mailto:kellen.sunderl...@gmail.com]
> > > > > > > > Sent: Friday, September 28, 2018 10:38 PM
> > > > > > > > To: dev@mxnet.incubator.apache.org
> > > > > > > > Subject: Re: Time out for Travis CI
> > > > > > > >
> > > > > > > > Hey Zhennan, you're safe to ignore Travis failures for now.
> > > > > > > > They're
> > > > > > > just informational.
> > > > > > > >
> > > > > > > > The reason you sometimes see quick builds and 

MXNet Podling Report - October

2018-10-01 Thread Haibin Lin
Hi MXNet community,

The podling report for MXNet is due on October 3rd. The report covers
MXNet's progress on community development and project development (the
previous one can be found here ).
You can search "MXNet" at https://wiki.apache.org/incubator/October2018 for
MXNet's draft report for October. Please help review and contribute to the
report before it's due.

If you have any suggestions on improving the report, please let me know and
I'm happy to update the report based on the feedback. Thanks!

Best regards,
Haibin


Re: Time out for Travis CI

2018-10-01 Thread Pedro Larroy
I think there's two approaches that we can take to mitigate the build &
test time problem, in one hand use a paid travis CI plan, in other improve
the unit tests in suites and only run a core set of tests, as we should do
on devices, but on this case we reduce coverage.

https://travis-ci.com/plans

Pedro.

On Sat, Sep 29, 2018 at 6:53 PM YiZhi Liu  wrote:

> This makes sense. Thanks
>
> On Sat, Sep 29, 2018 at 6:36 PM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > Hey Zhennan, yes this is the exact problem, and I agree with your points
> > completely.  This is why when we first added Travis we attempted to
> > communicate that it would be informational only, and that we'd need to
> > iterate on the config before it would be a test that people should
> consider
> > 'required'.  Apologies, we should have been more straightforward about
> > those tradeoffs.  The strong point in favour of adding Travis in
> > informational mode was that we had a serious MacOS specific bug that we
> > wanted to verify was fixed.
> >
> > The good news is I've opened a PR which I hope will speed up these builds
> > to the point that they won't rely on caching.  Once it is merged it would
> > be very helpful if you could rebase on this PR and test to ensure that
> > large changes no longer hit the global timeout without cache.
> > https://github.com/apache/incubator-mxnet/pull/12706
> >
> > On Sun, Sep 30, 2018 at 2:48 AM Qin, Zhennan 
> > wrote:
> >
> > > Hi YiZhi and Kellen,
> > >
> > > From my point of view, travis should be able to get passed from a
> scratch
> > > build. Pending result on ccache hit/miss is not a good idea. For this
> PR,
> > > as it changed many header file, lots of files need be recompiled, just
> > like
> > > a scratch build. I think that's the reason that travis timeout. This
> > should
> > > be fixed before enabling travis, as it will block any change to those
> > base
> > > header file. Again, it's not a special case with this PR only, you can
> > find
> > > same problem on other PRs:
> > >
> > >
> > >
> >
> https://travis-ci.org/apache/incubator-mxnet/builds/433172088?utm_source=github_status_medium=notification
> > >
> > >
> >
> https://travis-ci.org/apache/incubator-mxnet/builds/434404305?utm_source=github_status_medium=notification
> > >
> > >
> > > Thanks,
> > > Zhennan
> > >
> > > -Original Message-
> > > From: YiZhi Liu [mailto:eazhi@gmail.com]
> > > Sent: Sunday, September 30, 2018 5:15 AM
> > > To: eazhi@gmail.com
> > > Cc: dev@mxnet.incubator.apache.org
> > > Subject: Re: Time out for Travis CI
> > >
> > > while other PRs are all good.
> > > On Sat, Sep 29, 2018 at 2:13 PM YiZhi Liu  wrote:
> > > >
> > > > Honestly I don't know yet. I can help to investigate. Just given the
> > > > evidence that, travis timeout every time it gets re-triggered - 2
> > > > times at least. Correct me if I'm wrong @ Zhennan On Sat, Sep 29,
> 2018
> > > > at 1:54 PM kellen sunderland  wrote:
> > > > >
> > > > > Reading over the PR I don't see what aspects would cause extra
> > > > > runtime YiZhi, could you point them out?
> > > > >
> > > > > On Sat, Sep 29, 2018 at 8:46 PM YiZhi Liu 
> > wrote:
> > > > >
> > > > > > Kellen, I think this PR introduces extra runtime in CI, thus
> > > > > > causes the timeout. Which means, once merged, every PR later will
> > > > > > see same timeout in travis.
> > > > > >
> > > > > > So shall we modify the changes to decrease the test running time?
> > > > > > or just disable the Travis CI?
> > > > > >
> > > > > >
> > > > > > On Fri, Sep 28, 2018 at 9:17 PM Qin, Zhennan
> > > > > > 
> > > > > > wrote:
> > > > > > >
> > > > > > > Hi Kellen,
> > > > > > >
> > > > > > > Thanks for your explanation. Do you have a time plan to solve
> > > > > > > the
> > > > > > timeout issue? Rebasing can't work for my case. Or shall we run
> it
> > > > > > silently to disallow it voting X for overall CI result? Because
> > > > > > most developers are used to ignore the PRs with 'X'.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Zhennan
> > > > > > >
> > > > > > > -Original Message-
> > > > > > > From: kellen sunderland [mailto:kellen.sunderl...@gmail.com]
> > > > > > > Sent: Friday, September 28, 2018 10:38 PM
> > > > > > > To: dev@mxnet.incubator.apache.org
> > > > > > > Subject: Re: Time out for Travis CI
> > > > > > >
> > > > > > > Hey Zhennan, you're safe to ignore Travis failures for now.
> > > > > > > They're
> > > > > > just informational.
> > > > > > >
> > > > > > > The reason you sometimes see quick builds and sometimes see
> slow
> > > > > > > builds
> > > > > > is that we're making use of ccache in between builds.  If your PR
> > > > > > is similar to what's in master you should build very quickly, if
> > > > > > not it's going to take a while and likely time out.  If you see
> > > > > > timeouts rebasing may speed things up.  Unfortunately the
> timeouts
> > > > > > are global and we're not able to increase them.  I'm hoping that
> > > > > > 

CUDNN algorithm selection failure

2018-10-01 Thread Pedro Larroy
Hi

I saw this failure on CI:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1697/pipeline

Have you seen other cases where we fail to select the best CUDNN algorithm?
In which circumstances this could happen, and do you think is a good idea
to have one selected by default as a last resort?


Pedro.


Re: [Discuss] Next MXNet release

2018-10-01 Thread Haibin Lin
I found 2 bugs related to gluon Trainer with distributed KVStore. Basically
if someone uses Gluon for distributed training with a learning rate
schedule (e.g. train ResNet50 for image classification), it won't work.

https://github.com/apache/incubator-mxnet/issues/12713

I have the fix for the first bug locally, but I don't have the fix for the
second one.

Best,
Haibin

On Mon, Oct 1, 2018 at 10:14 AM Afrooze, Sina  wrote:

> This post suggests there is a regression from 1.1.0 to 1.2.1 related to
> MKLDNN integration:
> https://discuss.mxnet.io/t/mxnet-1-2-1-module-get-outputs/1882
>
> The error is related to MKLDNN layout not being converted back to MXNet
> layout in some operator: " !IsMKLDNNData() We can’t generate TBlob for
> MKLDNN data. Please use Reorder2Default() to generate a new NDArray first"
>
> Sina
>
>
>
>
> On 9/30/18, 6:55 PM, "Steffen Rochel"  wrote:
>
> Thanks Patrick.
> Updated roadmap and next release content.
>
> Patrick - suggest to send a reminder to review the design doc and
> collect
> feedback.
> Are there still known issues or gaps before we declare MKL-DNN
> integration
> as GA?
>
> Regards,
> Steffen
>
> On Sat, Sep 29, 2018 at 1:31 AM Zhao, Patric 
> wrote:
>
> > Thanks, Steffen.
> >
> > Regarding the next release note, two items from our side:
> >
> > 1. (-remove) MKL-DNN integration is done. I think we can remove this
> item.
> > 2. (+add) MKL-DNN based graph optimization and quantization by
> subgraph
> > Design doc:
> >
> https://cwiki.apache.org/confluence/display/MXNET/MXNet+Graph+Optimization+and+Quantization+based+on+subgraph+and+MKL-DNN
> > Lead Contributor: Patric Zhao,
> https://github.com/pengzhao-intel/
> >
> > Regarding the Roadmap
> > (+add) Q1 2019: MKL-DNN RNN API supports
> >
> > BR,
> >
> > Thanks,
> >
> > --Patric
> >
> >
> > > -Original Message-
> > > From: kellen sunderland [mailto:kellen.sunderl...@gmail.com]
> > > Sent: Saturday, September 29, 2018 11:31 AM
> > > To: dev@mxnet.incubator.apache.org
> > > Subject: Re: [Discuss] Next MXNet release
> > >
> > > Sorry I meant to say next 'Regarding the *minor* release'.
> > >
> > > On Sat, Sep 29, 2018 at 5:27 AM kellen sunderland <
> > > kellen.sunderl...@gmail.com> wrote:
> > >
> > > > Thanks for transparently setting a rough timeline Steffen.  I
> think
> > > > this will go a long way in helping the community plan their
> work, even
> > > > if the details change somewhat on the road to the release.
> > > >
> > > > Regarding the major release: I would propose we unify TensorRT
> with
> > > > the subgraph operator work.
> > > >
> > > > Regarding the patch release:  There were a few minor stack/buffer
> > > > overflows exposed by ASAN that have been addressed.  It's
> probably a
> > > > good idea to include them in a patch release, as they at best
> result
> > > > in non-deterministic behaviour.
> > > >
> > > > -Kellen
> > > >
> > > >
> > > > On Sat, Sep 29, 2018 at 1:39 AM Steffen Rochel
> > > > 
> > > > wrote:
> > > >
> > > >> I updated
> > > >>
> > > >>
> https://cwiki.apache.org/confluence/display/MXNET/Project+Proposals+f
> > > >> or+next+MXNet+Release
> > > >> ,
> > > >> removed the completed items from 1.3 release and would like to
> kick
> > > >> off discussion about the next release. Please suggest what you
> would
> > > >> like to see included in the next release together with link to
> design
> > > >> proposal (appropriately for the size and complexity of the
> proposal)
> > > >> or suggest changes.
> > > >> I suggest to target the next release for December 2018 to frame
> the
> > > >> discussion.
> > > >> Lets include review of
> > > >> https://cwiki.apache.org/confluence/display/MXNET/MXNet+Roadmap
> -
> > > >> time to update and discuss changes.
> > > >>
> > > >> From the 1.3 release we had discussion regarding
> > > >> https://github.com/apache/incubator-mxnet/issues/11849 and
> resolution
> > > >> in
> > > >> https://github.com/apache/incubator-mxnet/pull/12412 .
> > > >> Are you aware of critical issues and feedback from user which we
> > > >> should consider for a potential 1.3.1 patch release. Should we
> > > >> include PR 12412 in a potential patch release?
> > > >>
> > > >> Regards,
> > > >> Steffen
> > > >>
> > > >
> >
>
>
>
>


Re: [Discuss] Next MXNet release

2018-10-01 Thread Afrooze, Sina
This post suggests there is a regression from 1.1.0 to 1.2.1 related to MKLDNN 
integration: https://discuss.mxnet.io/t/mxnet-1-2-1-module-get-outputs/1882

The error is related to MKLDNN layout not being converted back to MXNet layout 
in some operator: " !IsMKLDNNData() We can’t generate TBlob for MKLDNN data. 
Please use Reorder2Default() to generate a new NDArray first"

Sina




On 9/30/18, 6:55 PM, "Steffen Rochel"  wrote:

Thanks Patrick.
Updated roadmap and next release content.

Patrick - suggest to send a reminder to review the design doc and collect
feedback.
Are there still known issues or gaps before we declare MKL-DNN integration
as GA?

Regards,
Steffen

On Sat, Sep 29, 2018 at 1:31 AM Zhao, Patric  wrote:

> Thanks, Steffen.
>
> Regarding the next release note, two items from our side:
>
> 1. (-remove) MKL-DNN integration is done. I think we can remove this item.
> 2. (+add) MKL-DNN based graph optimization and quantization by subgraph
> Design doc:
> 
https://cwiki.apache.org/confluence/display/MXNET/MXNet+Graph+Optimization+and+Quantization+based+on+subgraph+and+MKL-DNN
> Lead Contributor: Patric Zhao,  https://github.com/pengzhao-intel/
>
> Regarding the Roadmap
> (+add) Q1 2019: MKL-DNN RNN API supports
>
> BR,
>
> Thanks,
>
> --Patric
>
>
> > -Original Message-
> > From: kellen sunderland [mailto:kellen.sunderl...@gmail.com]
> > Sent: Saturday, September 29, 2018 11:31 AM
> > To: dev@mxnet.incubator.apache.org
> > Subject: Re: [Discuss] Next MXNet release
> >
> > Sorry I meant to say next 'Regarding the *minor* release'.
> >
> > On Sat, Sep 29, 2018 at 5:27 AM kellen sunderland <
> > kellen.sunderl...@gmail.com> wrote:
> >
> > > Thanks for transparently setting a rough timeline Steffen.  I think
> > > this will go a long way in helping the community plan their work, even
> > > if the details change somewhat on the road to the release.
> > >
> > > Regarding the major release: I would propose we unify TensorRT with
> > > the subgraph operator work.
> > >
> > > Regarding the patch release:  There were a few minor stack/buffer
> > > overflows exposed by ASAN that have been addressed.  It's probably a
> > > good idea to include them in a patch release, as they at best result
> > > in non-deterministic behaviour.
> > >
> > > -Kellen
> > >
> > >
> > > On Sat, Sep 29, 2018 at 1:39 AM Steffen Rochel
> > > 
> > > wrote:
> > >
> > >> I updated
> > >>
> > >> https://cwiki.apache.org/confluence/display/MXNET/Project+Proposals+f
> > >> or+next+MXNet+Release
> > >> ,
> > >> removed the completed items from 1.3 release and would like to kick
> > >> off discussion about the next release. Please suggest what you would
> > >> like to see included in the next release together with link to design
> > >> proposal (appropriately for the size and complexity of the proposal)
> > >> or suggest changes.
> > >> I suggest to target the next release for December 2018 to frame the
> > >> discussion.
> > >> Lets include review of
> > >> https://cwiki.apache.org/confluence/display/MXNET/MXNet+Roadmap -
> > >> time to update and discuss changes.
> > >>
> > >> From the 1.3 release we had discussion regarding
> > >> https://github.com/apache/incubator-mxnet/issues/11849 and resolution
> > >> in
> > >> https://github.com/apache/incubator-mxnet/pull/12412 .
> > >> Are you aware of critical issues and feedback from user which we
> > >> should consider for a potential 1.3.1 patch release. Should we
> > >> include PR 12412 in a potential patch release?
> > >>
> > >> Regards,
> > >> Steffen
> > >>
> > >
>





Re: Subscription

2018-10-01 Thread Naveen Swamy
Invited

On Mon, Oct 1, 2018 at 12:39 PM Jim Jagielski  wrote:

> I'd like an invite as well, please :)
>
> > On Sep 29, 2018, at 12:03 PM, Naveen Swamy  wrote:
> >
> > Invite sent. Welcome to Apache MXNet Cosmin :).
> >
> >
> > On Sat, Sep 29, 2018 at 11:38 AM Cosmin Cătălin Sanda <
> > cosmincata...@gmail.com> wrote:
> >
> >> Hi, I would like to subscribe to the ASF mxnet channel.
> >> 
> >> *Cosmin Catalin SANDA*
> >> Data Scientist & Engineer
> >> Phone: +45.27.30.60.35
> >> Web: https://cosminsanda.com
> >>
>
>


Re: Subscription

2018-10-01 Thread Jim Jagielski
I'd like an invite as well, please :)

> On Sep 29, 2018, at 12:03 PM, Naveen Swamy  wrote:
> 
> Invite sent. Welcome to Apache MXNet Cosmin :).
> 
> 
> On Sat, Sep 29, 2018 at 11:38 AM Cosmin Cătălin Sanda <
> cosmincata...@gmail.com> wrote:
> 
>> Hi, I would like to subscribe to the ASF mxnet channel.
>> 
>> *Cosmin Catalin SANDA*
>> Data Scientist & Engineer
>> Phone: +45.27.30.60.35
>> Web: https://cosminsanda.com
>>