Re: CI Pipeline Change Proposal

2020-03-31 Thread Joe Evans
Thanks everyone for your input. I've created an issue for tracking the increased sanity build time, but this should be treated as a separate project. https://github.com/apache/incubator-mxnet/issues/17945 In the meantime, to keep momentum going on the staggered build pipeline project, please let

Re: CI Pipeline Change Proposal

2020-03-27 Thread Marco de Abreu
The docker cache images can be used by you. They're available in Dockerhub, you just have to tweak the docker run method. The thing is that the scripts CI uses has the intention that layers change and thus the cache is used. If you want to be able to change the layers, then you have to accept

Re: CI Pipeline Change Proposal

2020-03-27 Thread Aaron Markham
Sure. That's the fix for now. But, I've noticed that when that's done and there's no process to enforce upgrades and patching, these get really out of date and the problems compound. Plus, when I build locally using docker, I can never seem to get the benefit of the cache. Or at least not in the

Re: CI Pipeline Change Proposal

2020-03-27 Thread Marco de Abreu
What about dependency pinning? The cache should not be our method to do dependency pinning and synchronization. -Marco Aaron Markham schrieb am Fr., 27. März 2020, 03:45: > I'm dealing with a Ruby dep breaking the site build right now. > I wish this would be on occasion that I choose, not

Re: CI Pipeline Change Proposal

2020-03-26 Thread Aaron Markham
I'm dealing with a Ruby dep breaking the site build right now. I wish this would be on occasion that I choose, not when Ruby or x dependency releases a new version. When the cache expires for Jekyll the site won't publish anymore... And CI will be blocked for the website test. If we built the

Re: CI Pipeline Change Proposal

2020-03-26 Thread Marco de Abreu
Correct. But I'm surprised about 2:50min to pull down the images. Maybe it makes sense to use ECR as mirror? -Marco Joe Evans schrieb am Do., 26. März 2020, 22:02: > +1 on rebuilding the containers regularly without caching layers. > > We are both pulling down a bunch of docker layers (when

Re: CI Pipeline Change Proposal

2020-03-26 Thread Joe Evans
+1 on rebuilding the containers regularly without caching layers. We are both pulling down a bunch of docker layers (when docker pulls an image) and then building a new container to run the sanity build in. Pulling down all the layers is what is taking so long (2m50s.) Within the docker build,

Re: CI Pipeline Change Proposal

2020-03-26 Thread Marco de Abreu
The job which rebuilds the cache has a property where you can set whether to rebuild the cache from scratch or not. You could duplicate that job, disable publishing and enable rebuild. Then add an alarm to the result and you should be golden. -Marco Lausen, Leonard schrieb am Do., 26. März

Re: CI Pipeline Change Proposal

2020-03-26 Thread Lausen, Leonard
WRT Docker Cache: We need to add a mechanism to invalidate the cache and rebuild the containers on a set schedule. The builds break too often and the breakage is only detected when a contributor touches the Dockerfiles (manually causing cache invalidation) On Thu, 2020-03-26 at 16:06 -0400, Aaron

Re: CI Pipeline Change Proposal

2020-03-26 Thread Aaron Markham
I think it is a good idea to do the sanity check first. Even at 10 minutes. And also try to fix the docker cache situation, but those can be separate tasks. On Thu, Mar 26, 2020, 12:52 Marco de Abreu wrote: > Jenkins doesn't load for me, so let me ask this way: are we actually > rebuilding

Re: CI Pipeline Change Proposal

2020-03-26 Thread Marco de Abreu
Jenkins doesn't load for me, so let me ask this way: are we actually rebuilding every single time or do you mean the docker cache? Pulling the cache should only take a few seconds from my experience - docker build should be a no-op in most cases. -Marco Joe Evans schrieb am Do., 26. März 2020,

Re: CI Pipeline Change Proposal

2020-03-26 Thread Joe Evans
The sanity-lint check pulls a docker image cache, builds a new container and runs inside. The docker setup is taking around 3 minutes, at least: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fsanity/detail/master/1764/pipeline/39 We could improve this by not

Re: CI Pipeline Change Proposal

2020-03-26 Thread Marco de Abreu
Do you know what's driving the duration for sanity? It used to be 50 sec execution and 60 sec preparation. -Marco Joe Evans schrieb am Do., 26. März 2020, 20:31: > Thanks Marco and Aaron for your input. > > > Can you show by how much the duration will increase? > > The average sanity build

Re: CI Pipeline Change Proposal

2020-03-26 Thread Joe Evans
Thanks Marco and Aaron for your input. > Can you show by how much the duration will increase? The average sanity build time is around 10min, while the average build time for unix-cpu is about 2 hours, so the entire build pipeline would increase by 2 hours if we required both unix-cpu and sanity

Re: CI Pipeline Change Proposal

2020-03-25 Thread Marco de Abreu
Back then I have created a system which exports all Jenkins results to cloud watch. It does not include individual test results but rather stages and jobs. The data for the sanity check should be available there. Something I'd also be curious about is the percentage of the failures in one run.

Re: CI Pipeline Change Proposal

2020-03-25 Thread Aaron Markham
+1 for sanity check - that's fast. -1 for unix-cpu - that's slow and can just hang. So my suggestion would be to see the data apart - what's the failure rate on the sanity check and the unix-cpu? Actually, can we get a table of all of the tests with this data?! If the sanity check fails... let's

Re: CI Pipeline Change Proposal

2020-03-25 Thread Marco de Abreu
We had this structure in the past and the community was bothered by CI taking more time, thus we moved to the current model with everything parallelized. We'd basically revert that then. Can you show by how much the duration will increase? Also, we have zero test parallelisation, speak we are

CI Pipeline Change Proposal

2020-03-24 Thread Joe Evans
Hi, First, I just wanted to introduce myself to the MXNet community. I’m Joe and will be working with Chai and the AWS team to improve some issues around MXNet CI. One of our goals is to reduce the costs associated with running MXNet CI. The task I’m working on now is this issue: