Report of this nights test run of master (note that I patched the master so that the duplicate process killer doesn't really kill the process).
I notice that one a *Celery* worker after *exactly 1 hour* a new process gets started and everything gets confused. Also note that it *doesn't happen with the local runner*. For now, my plan is to: - enhance logging to log to stack-driver and have extra logging information to troubleshoot (private branch for now) - dive some more in the scheduler/worker - my hunch is that the worker starts some process after an hour and starts up a new task (*if anyone has an idea?!*) or the scheduler thinks the sensor is dead after one hour... Here are log extracts: [2017-01-04 00:00:10,172] {models.py:168} INFO - Filling up the DagBag from /home/airflow/dags/user_product_interaction.py [2017-01-04 00:00:11,500] {jobs.py:2012} INFO - Subprocess PID is 87 [2017-01-04 00:00:15,474] {models.py:168} INFO - Filling up the DagBag from /home/airflow/dags/user_product_interaction.py [2017-01-04 00:00:17,088] {models.py:1062} INFO - Dependencies all met for <TaskInstance: user-product-interactions.wait-for-orders 2017-01-03 00:00:00 [queued]> [2017-01-04 00:00:17,126] {models.py:1062} INFO - Dependencies all met for <TaskInstance: user-product-interactions.wait-for-orders 2017-01-03 00:00:00 [queued]> [2017-01-04 00:00:17,127] {models.py:1250} INFO - -------------------------------------------------------------------------------- Starting attempt 1 of 1 -------------------------------------------------------------------------------- [2017-01-04 00:00:17,205] {models.py:1273} INFO - Executing <Task(GoogleCloudStorageObjectSensor): wait-for-orders> on 2017-01-03 00:00:00 exactly 1 hour later and lots of messages in between: 2017-01-04 01:00:42,077] {transport.py:151} INFO - Attempting refresh to obtain initial access_token [2017-01-04 01:00:42,126] {client.py:795} INFO - Refreshing access_token [2017-01-04 01:01:26,425] {models.py:168} INFO - Filling up the DagBag from /home/airflow/dags/user_product_interaction.py [2017-01-04 01:01:28,620] {jobs.py:2012} INFO - Subprocess PID is 244 [2017-01-04 01:01:32,663] {models.py:168} INFO - Filling up the DagBag from /home/airflow/dags/user_product_interaction.py [2017-01-04 01:01:33,527] {jobs.py:2081} WARNING - Recorded hostname and pid of airflow-worker-1705741-9ncug and 244 do not match this instance's which are airflow-worker-1705741-9ncug and 87. Taking the poison pill. So long. [2017-01-04 01:01:35,134] {models.py:1059} WARNING - Dependencies not met for <TaskInstance: user-product-interactions.wait-for-orders 2017-01-03 00:00:00 [running]>, dependency 'Task Instance Not Already Running' FAILED: Task is already running, it started on 2017-01-04 00:00:17.088903. [2017-01-04 01:01:38,592] {jobs.py:2081} WARNING - Recorded hostname and pid of airflow-worker-1705741-9ncug and 244 do not match this instance's which are airflow-worker-1705741-9ncug and 87. Taking the poison pill. So long. And probably from the other process that starts up: [2017-01-04 01:01:42,393] {gcp_api_base_hook.py:81} INFO - Getting connection using a JSON key file. [2017-01-04 01:01:42,417] {discovery.py:852} INFO - URL being requested: GET https://www.googleapis.com/storage/v1/b/vex-eu-data/o/datasets%2Fmarker%2Fexport%2F2017%2F01%2F04%2F_orders20170101?alt=json [2017-01-04 01:01:42,417] {transport.py:151} INFO - Attempting refresh to obtain initial access_token [2017-01-04 01:01:42,465] {client.py:795} INFO - Refreshing access_token [2017-01-04 01:01:43,602] {jobs.py:2081} WARNING - Recorded hostname and pid of airflow-worker-1705741-9ncug and 244 do not match this instance's which are airflow-worker-1705741-9ncug and 87. Taking the poison pill. So long. [2017-01-04 01:01:48,628] {jobs.py:2081} WARNING - Recorded hostname and pid of airflow-worker-1705741-9ncug and 244 do not match this instance's which are airflow-worker-1705741-9ncug and 87. Taking the poison pill. So long. as a reference the full log (note that the full log is confusing, probably due to the fact that logs for different processes are appended and uploaded to Cloud Storage: https://storage.googleapis.com/vex-eu-data/airflow/default/logs/user-product-interactions/wait-for-orders/2017-01-03T00%3A00%3A00 On Tue, Jan 3, 2017 at 8:34 PM Chris Riccomini <criccom...@apache.org> wrote: > Hey Bolke, > > Thanks for taking this on. I'm definitely up for running stuff in our > environments to verify everything is working. > > Can I ask that you create a 1.8 alpha 1 branch in the git repo? This will > make it easier for us to track what changes are getting cherry picked into > the branch, and will also make it easier for users to pip install, if they > want to do so via github. > > Also, yea, when we switch to beta, we need to stop merging anything other > than bug fixes into the release branch. > > Cheers, > Chris > > On Tue, Jan 3, 2017 at 10:31 AM, Dan Davydov <dan.davy...@airbnb.com > .invalid > > wrote: > > > All very reasonable to me, one reason we may not have hit the bugs in our > > production is because we are running off a different merge base and our > > cherries aren't 1-1 with what we are running in production (we still test > > them but we can't run them in production), that being said I don't think > I > > authored the commits you are referring to so I don't have full context. > > > > On Tue, Jan 3, 2017 at 1:27 PM, Bolke de Bruin <bdbr...@gmail.com> > wrote: > > > > > Hi Dan et al, > > > > > > That sounds good to me, however I will be pretty critical of the > changes > > > in the scheduler and the cleanliness of the patches. This is due to the > > > fact I have been chasing quite some bugs in master that were pretty > hard > > to > > > track down even with a debugger at hand. I’m surprised that those > didn’t > > > pop up in your production or maybe I am concerned ;-). Anyways, I hope > > you > > > understand I might be a bit picky in understanding and needing (design) > > > documentation for some of the changes. > > > > > > What I would like to suggest is that for the Alpha versions we still > > > accept “new” features so these PRs can get in, but from Beta we will > not > > > accept new features anymore. For new features in the area of the > > scheduler > > > an integration DummyDag should be supplied, so others can test the > > > behaviour. Does this sound ok? > > > > > > My list of open code items for a release looks now like this: > > > > > > Blockers > > > * one_failed not honoured > > > * Alex’s sensor issue > > > > > > New features: > > > * Schedule all pending DAGs in a single loop > > > * Add support for backfill true/false > > > * Impersonation > > > * CGroups > > > * Add Cloud Storage updated sensor > > > > > > Alpha2 I will package tomorrow. Packages are signed now by my > apache.org > > < > > > http://apache.org/> key. Please verify and let me know if something is > > > off. I’m still waiting for access to the incubating dist repository. > > > > > > Bolke > > > > > > > > > > On 3 Jan 2017, at 14:38, Dan Davydov <dan.davy...@airbnb.com > .INVALID> > > > wrote: > > > > > > > > I have also started on this effort, recently Alex Guziel and I have > > been > > > > pushing Airbnb's custom cherries onto master to get Airbnb back onto > > > master > > > > in order for us to do a release. > > > > > > > > I think it might make sense to wait for these two commits to get > merged > > > in > > > > since they would be quite nice to have for all Airflow users and seem > > > like > > > > they will be merged soon: > > > > Schedule all pending DAG runs in a single scheduler loop - > > > > https://github.com/apache/incubator-airflow/pull/1906 < > > > https://github.com/apache/incubator-airflow/pull/1906> > > > > Add Support for dag.backfill=(True|False) Option - > > > > https://github.com/apache/incubator-airflow/pull/1830 < > > > https://github.com/apache/incubator-airflow/pull/1830> > > > > Impersonation Support + Cgroups - https://github.com/apache/ < > > > https://github.com/apache/> > > > > incubator-airflow/pull/1934 (this is kind of important from the > Airbnb > > > side > > > > so that we can help test the new master without having to cherrypick > > this > > > > PR on top of it which would make the testing unreliable for others). > > > > > > > > If there are PRs that affect the core of Airflow that other > committers > > > > think are important to merge we could include these too. I can commit > > to > > > > pushing out the Impersonation/Cgroups PR this week pending PR > comments. > > > > What do you think Bolke? > > > > > > > > On Tue, Jan 3, 2017 at 4:26 AM, Bolke de Bruin <bdbr...@gmail.com > > > <mailto:bdbr...@gmail.com>> wrote: > > > > > > > >> Hey Alex, > > > >> > > > >> I have noticed the same, and it is also the reason why we have Alpha > > > >> versions. For now I have noticed the following: > > > >> > > > >> * Tasks can get in limbo between scheduler and executor: > > > >> https://github.com/apache/incubator-airflow/pull/1948 < > > > https://github.com/apache/incubator-airflow/pull/1948> < > > > >> https://github.com/apache/incubator-airflow/pull/1948 < > > > https://github.com/apache/incubator-airflow/pull/1948>> > > > >> * Try_number not increased due to reset in LocalTaskJob: > > > >> https://github.com/apache/incubator-airflow/pull/1969 < > > > https://github.com/apache/incubator-airflow/pull/1969> < > > > >> https://github.com/apache/incubator-airflow/pull/1969 < > > > https://github.com/apache/incubator-airflow/pull/1969>> > > > >> * one_failed trigger not executed > > > >> > > > >> My idea is to move to a Samba style of releases eventually, but for > > now > > > I > > > >> would like to get master into a state that we understand and > therefore > > > not > > > >> accept any patches that do not address any bugs. > > > >> > > > >> If you (or anyone else) can review the above PRs and add your own as > > > well > > > >> then I can create another Alpha version. I’ll be on gitter as much > as > > I > > > can > > > >> so we can speed up if needed. > > > >> > > > >> - Bolke > > > >> > > > >>> On 3 Jan 2017, at 08:51, Alex Van Boxel <a...@vanboxel.be> wrote: > > > >>> > > > >>> Hey Bolke, > > > >>> > > > >>> thanks for getting this moving. But I already have some blockers, > > > since I > > > >>> moved up master to this release (moved from end November to now) > > > >> stability > > > >>> has gone down (certainly on Celary). I'm trying to identify the > core > > > >>> problems and see if I can fix them. > > > >>> > > > >>> On Sat, Dec 31, 2016 at 9:52 PM Bolke de Bruin <bdbr...@gmail.com > > > >> <mailto:bdbr...@gmail.com <mailto:bdbr...@gmail.com>>> wrote: > > > >>> > > > >>> Dear All, > > > >>> > > > >>> On the verge of the New Year, I decided to be a little bit cheeky > and > > > to > > > >>> make available an Airflow 1.8.0 Alpha 1. We have been talking about > > it > > > >> for > > > >>> a long time now and by doing this I wanted bootstrap the process. > It > > > >> should > > > >>> by no means be considered an Apache release yet. This is for > testing > > > >>> purposes in the dev community around Airflow, nothing else. > > > >>> > > > >>> The build is exactly the same as the state of master (git 410736d) > > plus > > > >> the > > > >>> change to version “1.8.0.alpha1” in version.py. > > > >>> > > > >>> I am dedicating quite some time next week and beyond to get a > release > > > >> out. > > > >>> Hopefully we can get some help with testing, changelog etc. To make > > > this > > > >>> possible I would like to propose a freeze to adding new features > for > > at > > > >>> least two weeks - say until Jan 15. > > > >>> > > > >>> You can find the tar here: http://people.apache.org/~bolke/ < > > > http://people.apache.org/~bolke/> < > > > >>> http://people.apache.org/~bolke/ <http://people.apache.org/~bolke/ > > > > < > > > http://people.apache.org/~bolke/ <http://people.apache.org/~bolke/>>> > . > > > >> It isn’t signed. Following versions > > > >>> will be. SHA is available. > > > >>> > > > >>> Lastly, Alpha 1 does not have the fix for retries yet. So we will > get > > > an > > > >>> Alpha 2 :-). @Max / @Dan / @Paul: a potential fix is in > > > >>> https://github.com/apache/incubator-airflow/pull/1948 < > > > https://github.com/apache/incubator-airflow/pull/1948> < > > > >> https://github.com/apache/incubator-airflow/pull/1948 < > > > https://github.com/apache/incubator-airflow/pull/1948>> < > > > >>> https://github.com/apache/incubator-airflow/pull/1948 < > > > https://github.com/apache/incubator-airflow/pull/1948> < > > > >> https://github.com/apache/incubator-airflow/pull/1948 < > > > https://github.com/apache/incubator-airflow/pull/1948>>> , but your > > > >> feedback > > > >>> is required as it is entrenched in new processing code that you are > > > >> running > > > >>> in production afaik - so I wonder what happens in your fork. > > > >>> > > > >>> Happy New Year! > > > >>> > > > >>> Bolke > > > >>> > > > >>> > > > >>> > > > >>> -- > > > >>> _/ > > > >>> _/ Alex Van Boxel > > > > > > > > > -- _/ _/ Alex Van Boxel