So a quick update, unfortunately we saw some DAGBag parsing time increases (~10x for some DAGs) on the webservers with the 1.7.1rc3. Because of this I will be working on a staging cluster that has a copy of our production production DAGBag, and is a copy of our production airflow infrastructure, just without the workers. This will let us debug the release outside of production.
On Thu, Apr 28, 2016 at 10:20 AM, Dan Davydov <dan.davy...@airbnb.com> wrote: > Definitely, here were the issues we hit: > - airbnb/airflow#1365 occured > - Webservers/scheduler were timing out and stuck in restart cycles due to > increased time spent on parsing DAGs due to airbnb/airflow#1213/files > - Failed tasks that ran after the upgrade and the revert (after we > reverted the upgrade) were unable to be cleared (but running the tasks > through the UI worked without clearing them) > - The way log files were stored on S3 was changed (airflow now requires a > connection to be setup) which broke log storage > - Some DAGs were broken (unable to be parsed) due to package > reorganization in open-source (the import paths were changed) (the utils > refactor commit) > > On Thu, Apr 28, 2016 at 12:17 AM, Bolke de Bruin <bdbr...@gmail.com> > wrote: > >> Dan, >> >> Are you able to share some of the bugs you have been hitting and >> connected commits? >> >> We could at the very least learn from them and maybe even improve testing. >> >> Bolke >> >> >> > Op 28 apr. 2016, om 06:51 heeft Dan Davydov >> <dan.davy...@airbnb.com.INVALID> het volgende geschreven: >> > >> > All of the blockers were fixed as of yesterday (there was some issue >> that >> > Jeremiah was looking at with the last release candidate which I think is >> > fixed but I'm not sure). I started staging the airbnb_1.7.1rc3 tag >> earlier >> > today, so as long as metrics look OK and the 1.7.1rc2 issues seem >> resolved >> > tomorrow I will release internally either tomorrow or Monday (we try to >> > avoid releases on Friday). If there aren't any issues we can push the >> 1.7.1 >> > tag on Monday/Tuesday. >> > >> > @Sid >> > I think we were originally aiming to deploy internally once every two >> weeks >> > but we decided to do it once a month in the end. I'm not too sure about >> > that so Max can comment there. >> > >> > We have been running 1.7.0 in production for about a month now and it >> > stable. >> > >> > I think what really slowed down this release cycle is some commits that >> > caused severe bugs that we decided to roll-forward with instead of >> rolling >> > back. We can potentially try reverting these commits next time while the >> > fixes are applied for the next version, although this is not always >> trivial >> > to do. >> > >> > On Wed, Apr 27, 2016 at 9:31 PM, Siddharth Anand < >> > siddharthan...@yahoo.com.invalid> wrote: >> > >> >> Btw, is anyone of the committers running 1.7.0 or later in any staging >> or >> >> production env? I have to say that given that 1.6.2 was the most stable >> >> release and is 4 or more months old does not say much for our release >> >> cadence or process. What's our plan for 1.7.1? >> >> >> >> Sent from Sid's iPhone >> >> >> >>> On Apr 27, 2016, at 9:05 PM, Chris Riccomini <criccom...@apache.org> >> >> wrote: >> >>> >> >>> Hey all, >> >>> >> >>> I just wanted to check in on the 1.7.1 release status. I know there >> have >> >>> been some major-ish bugs, as well as several people doing tests. >> Should >> >> we >> >>> create a 1.7.1 release JIRA, and track outstanding issues there? >> >>> >> >>> Cheers, >> >>> Chris >> >> >> >> >> >> >