Re: Scheduler silently dies

2017-03-25 Thread Bolke de Bruin
I case you *think* you have encountered a schedule *hang*, please provide a 
strace on the parent process, provide process list output that shows defunct 
scheduler processes, and provide *all* logging (main logs, scheduler processing 
log, task logs), preferably in debug mode (settings.py). Also show memory 
limits, cpu count and airflow.cfg.

Thanks
Bolke


> On 25 Mar 2017, at 18:16, Bolke de Bruin  wrote:
> 
> Please specify what “stop doing its job” means. It doesn’t log anything 
> anymore? If it does, the scheduler hasn’t died and hasn’t stopped.
> 
> B.
> 
> 
>> On 24 Mar 2017, at 18:20, Gael Magnan  wrote:
>> 
>> We encountered the same kind of problem with the scheduler that stopped
>> doing its job even after rebooting. I thought changing the start date or
>> the state of a task instance might be to blame but I've never been able to
>> pinpoint the problem either.
>> 
>> We are using celery and docker if it helps.
>> 
>> Le sam. 25 mars 2017 à 01:53, Bolke de Bruin  a écrit :
>> 
>>> We are running *without* num runs for over a year (and never have). It is
>>> a very elusive issue which has not been reproducible.
>>> 
>>> I like more info on this but it needs to be very elaborate even to the
>>> point of access to the system exposing the behavior.
>>> 
>>> Bolke
>>> 
>>> Sent from my iPhone
>>> 
 On 24 Mar 2017, at 16:04, Vijay Ramesh  wrote:
 
 We literally have a cron job that restarts the scheduler every 30 min.
>>> Num
 runs didn't work consistently in rc4, sometimes it would restart itself
>>> and
 sometimes we'd end up with a few zombie scheduler processes and things
 would get stuck. Also running locally, without celery.
 
> On Mar 24, 2017 16:02,  wrote:
> 
> We have max runs set and still hit this. Our solution is dumber:
> monitoring log output, and kill the scheduler if it stops emitting.
>>> Works
> like a charm.
> 
>> On Mar 24, 2017, at 5:50 PM, F. Hakan Koklu 
> wrote:
>> 
>> Some solutions to this problem is restarting the scheduler frequently
>>> or
>> some sort of monitoring on the scheduler. We have set up a dag that
>>> pings
>> cronitor  (a dead man's snitch type of service)
> every
>> 10 minutes and the snitch pages you when the scheduler dies and does
>>> not
>> send a ping to it.
>> 
>> On Fri, Mar 24, 2017 at 1:49 PM, Andrew Phillips <
>>> aphill...@qrmedia.com>
>> wrote:
>> 
>>> We use celery and run into it from time to time.
 
>>> 
>>> Bang goes my theory ;-) At least, assuming it's the same underlying
>>> cause...
>>> 
>>> Regards
>>> 
>>> ap
>>> 
> 
>>> 
> 



Re: Scheduler silently dies

2017-03-25 Thread Bolke de Bruin
Please specify what “stop doing its job” means. It doesn’t log anything 
anymore? If it does, the scheduler hasn’t died and hasn’t stopped.

B.


> On 24 Mar 2017, at 18:20, Gael Magnan  wrote:
> 
> We encountered the same kind of problem with the scheduler that stopped
> doing its job even after rebooting. I thought changing the start date or
> the state of a task instance might be to blame but I've never been able to
> pinpoint the problem either.
> 
> We are using celery and docker if it helps.
> 
> Le sam. 25 mars 2017 à 01:53, Bolke de Bruin  a écrit :
> 
>> We are running *without* num runs for over a year (and never have). It is
>> a very elusive issue which has not been reproducible.
>> 
>> I like more info on this but it needs to be very elaborate even to the
>> point of access to the system exposing the behavior.
>> 
>> Bolke
>> 
>> Sent from my iPhone
>> 
>>> On 24 Mar 2017, at 16:04, Vijay Ramesh  wrote:
>>> 
>>> We literally have a cron job that restarts the scheduler every 30 min.
>> Num
>>> runs didn't work consistently in rc4, sometimes it would restart itself
>> and
>>> sometimes we'd end up with a few zombie scheduler processes and things
>>> would get stuck. Also running locally, without celery.
>>> 
 On Mar 24, 2017 16:02,  wrote:
 
 We have max runs set and still hit this. Our solution is dumber:
 monitoring log output, and kill the scheduler if it stops emitting.
>> Works
 like a charm.
 
> On Mar 24, 2017, at 5:50 PM, F. Hakan Koklu 
 wrote:
> 
> Some solutions to this problem is restarting the scheduler frequently
>> or
> some sort of monitoring on the scheduler. We have set up a dag that
>> pings
> cronitor  (a dead man's snitch type of service)
 every
> 10 minutes and the snitch pages you when the scheduler dies and does
>> not
> send a ping to it.
> 
> On Fri, Mar 24, 2017 at 1:49 PM, Andrew Phillips <
>> aphill...@qrmedia.com>
> wrote:
> 
>> We use celery and run into it from time to time.
>>> 
>> 
>> Bang goes my theory ;-) At least, assuming it's the same underlying
>> cause...
>> 
>> Regards
>> 
>> ap
>> 
 
>> 



Re: Scheduler silently dies

2017-03-25 Thread Bolke de Bruin
Hi Harish,

The below does *not* indicate a scheduler hang, it is a valid exception as 
mentioned earlier.

Bolke.

> On 24 Mar 2017, at 19:07, harish singh  wrote:
> 
> We have been using (1.7) over a year and never faced this issue.
> The moment we switched to 1.8, I think we have hit this issue.
> The reason why I saw "I think" is because I am not sure if it is the same
> issue. But whenever I restart, my pipeline proceeds.
> 
> 
> 
> *Airflow 1.7Having said that, In 1.7, I did face a similar issue (less than
> 5 times over a year): *
> *I saw that there were lot of processes marked  ""  with parent
> process being "scheduler". *
> 
> *Somebody mentioned it in this jira ->
> https://issues.apache.org/jira/browse/AIRFLOW-401
> *
> *Workaround:  Restart scheduler*
> 
> 
> 
> 
> *Airflow 1.8:Now the issue in 1.8 may be different then the issue in
> 1.7 But again the issue get solved and pipeline progresses on a SCHEDULER
> RESTART.*If it may help, this is the trace in 1.8:
> [2017-03-22 19:35:16,332] {models.py:167} INFO - Filling up the DagBag from
> /usr/local/airflow/pipeline/pipeline.py [2017-03-22 19:35:22,451]
> {airflow_configuration.py:40} INFO - loading setup.cfg file [2017-03-22
> 19:35:51,041] {timeout.py:37} ERROR - Process timed out [2017-03-22
> 19:35:51,041] {models.py:266} ERROR - Failed to import:
> /usr/local/airflow/pipeline/pipeline.py Traceback (most recent call last):
> File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 263,
> in process_file m = imp.load_source(mod_name, filepath) File
> "/usr/local/airflow/pipeline/pipeline.py", line 167, in 
> create_tasks(dbguid, version, dag, override_start_date) File
> "/usr/local/airflow/pipeline/pipeline.py", line 104, in create_tasks t =
> create_task(dbguid, dag, taskInfo, version, override_date) File
> "/usr/local/airflow/pipeline/pipeline.py", line 85, in create_task retries,
> 1, depends_on_past, version, override_dag_date) File
> "/usr/local/airflow/pipeline/dags/base_pipeline.py", line 90, in
> create_python_operator depends_on_past=depends_on_past) File
> "/usr/local/lib/python2.7/dist-packages/airflow/utils/decorators.py", line
> 86, in wrapper result = func(*args, **kwargs) File
> "/usr/local/lib/python2.7/dist-packages/airflow/operators/python_operator.py",
> line 65, in __init__ super(PythonOperator, self).__init__(*args, **kwargs)
> File "/usr/local/lib/python2.7/dist-packages/airflow/utils/decorators.py",
> line 70, in wrapper sig = signature(func) File "/usr/local/lib/python2.7/
> dist-packages/funcsigs/__init__.py", line 105, in signature return
> Signature.from_function(obj) File "/usr/local/lib/python2.7/
> dist-packages/funcsigs/__init__.py", line 594, in from_function
> __validate_parameters__=False) File "/usr/local/lib/python2.7/
> dist-packages/funcsigs/__init__.py", line 518, in __init__ for param in
> parameters)) File "/usr/lib/python2.7/collections.py", line 52, in __init__
> self.__update(*args, **kwds) File "/usr/lib/python2.7/_abcoll.py", line
> 548, in update self[key] = value File "/usr/lib/python2.7/collections.py",
> line 61, in __setitem__ last[1] = root[0] = self.__map[key] = [last, root,
> key] File "/usr/local/lib/python2.7/dist-packages/airflow/utils/timeout.py",
> line 38, in handle_timeout raise AirflowTaskTimeout(self.error_message)
> AirflowTaskTimeout: Timeout
> 
> 
> 
> 
> On Fri, Mar 24, 2017 at 5:45 PM, Bolke de Bruin  wrote:
> 
>> We are running *without* num runs for over a year (and never have). It is
>> a very elusive issue which has not been reproducible.
>> 
>> I like more info on this but it needs to be very elaborate even to the
>> point of access to the system exposing the behavior.
>> 
>> Bolke
>> 
>> Sent from my iPhone
>> 
>>> On 24 Mar 2017, at 16:04, Vijay Ramesh  wrote:
>>> 
>>> We literally have a cron job that restarts the scheduler every 30 min.
>> Num
>>> runs didn't work consistently in rc4, sometimes it would restart itself
>> and
>>> sometimes we'd end up with a few zombie scheduler processes and things
>>> would get stuck. Also running locally, without celery.
>>> 
 On Mar 24, 2017 16:02,  wrote:
 
 We have max runs set and still hit this. Our solution is dumber:
 monitoring log output, and kill the scheduler if it stops emitting.
>> Works
 like a charm.
 
> On Mar 24, 2017, at 5:50 PM, F. Hakan Koklu 
 wrote:
> 
> Some solutions to this problem is restarting the scheduler frequently
>> or
> some sort of monitoring on the scheduler. We have set up a dag that
>> pings
> cronitor  (a dead man's snitch type of service)
 every
> 10 minutes and the snitch pages you when the scheduler dies and does
>> not
> send a ping to it.
> 
> On Fri, Mar 24, 2017 at 1:49 PM, Andrew Phillips <
>> aphill...@qrmedia.com>
> 

Re: 1.8.1 release

2017-03-25 Thread Bolke de Bruin
I have set it to blocker.

> On 25 Mar 2017, at 17:56, Vincent Poulain  
> wrote:
> 
> Hello,
> 
> For some people who are running airflow on prod with docker, this one is
> quite important : https://issues.apache.org/jira/browse/AIRFLOW-1018. I
> don't have log anymore :/
> 
> cheers,
> 
> On Fri, Mar 24, 2017 at 6:59 PM, Bolke de Bruin  wrote:
> 
>> Hi Chris
>> 
>> I think some jira are missing from the blocker list, I'll supply them
>> soon. Also some fixes are already in the v1-8-test branch, that are not
>> part of your list yet and some need to be (check jira on fixes for 1.8.1).
>> 
>> 982 and 983 might be fixed by reverting a change that we did as part of
>> 1.8.0 and including the "wait for all tasks" patch, that is already in
>> master. Let me pick this up.
>> 
>> To help you out I already did some work on the jira classifications (e.g.
>> try filtering on blocking issues) which should make it easier to find out
>> what needs to go into 1.8.1.
>> 
>> Bolke
>> 
>> Sent from my iPhone
>> 
>>> On 24 Mar 2017, at 10:21, Chris Riccomini  wrote:
>>> 
>>> Hey all,
>>> 
>>> I've let this thread sit for a while. Here are a list of the issues that
>>> were raised:
>>> 
>>> BLOCKERS:
>>> https://issues.apache.org/jira/browse/AIRFLOW-982
>>> https://issues.apache.org/jira/browse/AIRFLOW-983
>>> https://issues.apache.org/jira/browse/AIRFLOW-1019
>>> https://issues.apache.org/jira/browse/AIRFLOW-1017
>>> 
>>> NICE TO HAVE:
>>> https://issues.apache.org/jira/browse/AIRFLOW-1015
>>> https://issues.apache.org/jira/browse/AIRFLOW-1013
>>> https://issues.apache.org/jira/browse/AIRFLOW-1004
>>> https://issues.apache.org/jira/browse/AIRFLOW-1003
>>> https://issues.apache.org/jira/browse/AIRFLOW-1001
>>> 
>>> It looks like AIRFLOW-1017 is done, though the JIRA is not closed.
>>> 
>>> The rest remain open. I will wait on the release until the remaining
>>> blockers are finished. Dan/Daniel, can you comment on status?
>>> 
>>> Ruslan, if you want to work on your nice to haves, and submit patches,
>>> that's great, otherwise I don't believe they'll get fixed as part of
>> 1.8.1.
>>> 
>>> Cheers,
>>> Chris
>>> 
>>> On Wed, Mar 22, 2017 at 9:19 AM, Ruslan Dautkhanov >> 
>>> wrote:
>>> 
 Thank you Sid!
 
 
 Best regards,
 Ruslan
 
 On Wed, Mar 22, 2017 at 12:01 AM, siddharth anand 
 wrote:
 
> Ruslan,
> Thanks for sharing this list. I can pick a few up. I agree we should
>> aim
 to
> get some of them into 1.8.1.
> 
> -s
> 
> On Tue, Mar 21, 2017 at 2:29 PM, Ruslan Dautkhanov <
>> dautkha...@gmail.com
> 
> wrote:
> 
>> Some of the issues I ran into while testing 1.8rc5 :
>> 
>> https://issues.apache.org/jira/browse/AIRFLOW-1015
>>> https://issues.apache.org/jira/browse/AIRFLOW-1013
>>> https://issues.apache.org/jira/browse/AIRFLOW-1004
>>> https://issues.apache.org/jira/browse/AIRFLOW-1003
>>> https://issues.apache.org/jira/browse/AIRFLOW-1001
>>> https://issues.apache.org/jira/browse/AIRFLOW-1015
>> 
>> 
>> It would be great to have at least some of them fixed in 1.8.1.
>> 
>> Thank you.
>> 
>> 
>> 
>> 
>> --
>> Ruslan Dautkhanov
>> 
>> On Tue, Mar 21, 2017 at 3:02 PM, Dan Davydov > invalid
>>> wrote:
>> 
>>> Here is my list for targeted 1.8.1 fixes:
>>> https://issues.apache.org/jira/browse/AIRFLOW-982
>>> https://issues.apache.org/jira/browse/AIRFLOW-983
>>> https://issues.apache.org/jira/browse/AIRFLOW-1019 (and in general
 the
>>> slow
>>> startup time from this new logic of orphaned/reset task)
>>> https://issues.apache.org/jira/browse/AIRFLOW-1017 (which I will
>> hopefully
>>> have a fix out for soon just finishing up tests)
>>> 
>>> We are also hitting a new issue with subdags with rc5 that we weren't
>>> hitting with rc4 where subdags will occasionally just hang (had to
 roll
>>> back from rc5 to rc4), I'll try to spin up a JIRA for it soon which
>> should
>>> be on the list too.
>>> 
>>> 
>>> On Tue, Mar 21, 2017 at 1:54 PM, Chris Riccomini <
> criccom...@apache.org>
>>> wrote:
>>> 
 Agreed. I'm looking for a list of checksums/JIRAs that we want in
 the
 bugfix release.
 
 On Tue, Mar 21, 2017 at 12:54 PM, Bolke de Bruin <
 bdbr...@gmail.com>
 wrote:
 
> 
> 
>> On 21 Mar 2017, at 12:51, Bolke de Bruin 
>> wrote:
>> 
>> My suggestion, as we are using semantic versioning is:
>> 
>> 1) no new features in the 1.8 branch
>> 2) only bug fixes in the 1.8 branch
>> 3) new features to land in 1.9
>> 
>> This allows companies to
> 
> Have a 

Re: 1.8.1 release

2017-03-25 Thread Vincent Poulain
Hello,

For some people who are running airflow on prod with docker, this one is
quite important : https://issues.apache.org/jira/browse/AIRFLOW-1018. I
don't have log anymore :/

cheers,

On Fri, Mar 24, 2017 at 6:59 PM, Bolke de Bruin  wrote:

> Hi Chris
>
> I think some jira are missing from the blocker list, I'll supply them
> soon. Also some fixes are already in the v1-8-test branch, that are not
> part of your list yet and some need to be (check jira on fixes for 1.8.1).
>
> 982 and 983 might be fixed by reverting a change that we did as part of
> 1.8.0 and including the "wait for all tasks" patch, that is already in
> master. Let me pick this up.
>
> To help you out I already did some work on the jira classifications (e.g.
> try filtering on blocking issues) which should make it easier to find out
> what needs to go into 1.8.1.
>
> Bolke
>
> Sent from my iPhone
>
> > On 24 Mar 2017, at 10:21, Chris Riccomini  wrote:
> >
> > Hey all,
> >
> > I've let this thread sit for a while. Here are a list of the issues that
> > were raised:
> >
> > BLOCKERS:
> > https://issues.apache.org/jira/browse/AIRFLOW-982
> > https://issues.apache.org/jira/browse/AIRFLOW-983
> > https://issues.apache.org/jira/browse/AIRFLOW-1019
> > https://issues.apache.org/jira/browse/AIRFLOW-1017
> >
> > NICE TO HAVE:
> > https://issues.apache.org/jira/browse/AIRFLOW-1015
> > https://issues.apache.org/jira/browse/AIRFLOW-1013
> > https://issues.apache.org/jira/browse/AIRFLOW-1004
> > https://issues.apache.org/jira/browse/AIRFLOW-1003
> > https://issues.apache.org/jira/browse/AIRFLOW-1001
> >
> > It looks like AIRFLOW-1017 is done, though the JIRA is not closed.
> >
> > The rest remain open. I will wait on the release until the remaining
> > blockers are finished. Dan/Daniel, can you comment on status?
> >
> > Ruslan, if you want to work on your nice to haves, and submit patches,
> > that's great, otherwise I don't believe they'll get fixed as part of
> 1.8.1.
> >
> > Cheers,
> > Chris
> >
> > On Wed, Mar 22, 2017 at 9:19 AM, Ruslan Dautkhanov  >
> > wrote:
> >
> >> Thank you Sid!
> >>
> >>
> >> Best regards,
> >> Ruslan
> >>
> >> On Wed, Mar 22, 2017 at 12:01 AM, siddharth anand 
> >> wrote:
> >>
> >>> Ruslan,
> >>> Thanks for sharing this list. I can pick a few up. I agree we should
> aim
> >> to
> >>> get some of them into 1.8.1.
> >>>
> >>> -s
> >>>
> >>> On Tue, Mar 21, 2017 at 2:29 PM, Ruslan Dautkhanov <
> dautkha...@gmail.com
> >>>
> >>> wrote:
> >>>
>  Some of the issues I ran into while testing 1.8rc5 :
> 
>  https://issues.apache.org/jira/browse/AIRFLOW-1015
> > https://issues.apache.org/jira/browse/AIRFLOW-1013
> > https://issues.apache.org/jira/browse/AIRFLOW-1004
> > https://issues.apache.org/jira/browse/AIRFLOW-1003
> > https://issues.apache.org/jira/browse/AIRFLOW-1001
> > https://issues.apache.org/jira/browse/AIRFLOW-1015
> 
> 
>  It would be great to have at least some of them fixed in 1.8.1.
> 
>  Thank you.
> 
> 
> 
> 
>  --
>  Ruslan Dautkhanov
> 
>  On Tue, Mar 21, 2017 at 3:02 PM, Dan Davydov   invalid
> > wrote:
> 
> > Here is my list for targeted 1.8.1 fixes:
> > https://issues.apache.org/jira/browse/AIRFLOW-982
> > https://issues.apache.org/jira/browse/AIRFLOW-983
> > https://issues.apache.org/jira/browse/AIRFLOW-1019 (and in general
> >> the
> > slow
> > startup time from this new logic of orphaned/reset task)
> > https://issues.apache.org/jira/browse/AIRFLOW-1017 (which I will
>  hopefully
> > have a fix out for soon just finishing up tests)
> >
> > We are also hitting a new issue with subdags with rc5 that we weren't
> > hitting with rc4 where subdags will occasionally just hang (had to
> >> roll
> > back from rc5 to rc4), I'll try to spin up a JIRA for it soon which
>  should
> > be on the list too.
> >
> >
> > On Tue, Mar 21, 2017 at 1:54 PM, Chris Riccomini <
> >>> criccom...@apache.org>
> > wrote:
> >
> >> Agreed. I'm looking for a list of checksums/JIRAs that we want in
> >> the
> >> bugfix release.
> >>
> >> On Tue, Mar 21, 2017 at 12:54 PM, Bolke de Bruin <
> >> bdbr...@gmail.com>
> >> wrote:
> >>
> >>>
> >>>
>  On 21 Mar 2017, at 12:51, Bolke de Bruin 
>  wrote:
> 
>  My suggestion, as we are using semantic versioning is:
> 
>  1) no new features in the 1.8 branch
>  2) only bug fixes in the 1.8 branch
>  3) new features to land in 1.9
> 
>  This allows companies to
> >>>
> >>> Have a "known" version and can move to the new branch when they
> >>> want
>  to
> >>> get new features. Obviously we only support N-1, so when 1.10
> >> comes
>  out
> >> we
> >>> stop