I case you *think* you have encountered a schedule *hang*, please provide a strace on the parent process, provide process list output that shows defunct scheduler processes, and provide *all* logging (main logs, scheduler processing log, task logs), preferably in debug mode (settings.py). Also show memory limits, cpu count and airflow.cfg.
Thanks Bolke > On 25 Mar 2017, at 18:16, Bolke de Bruin <bdbr...@gmail.com> wrote: > > Please specify what “stop doing its job” means. It doesn’t log anything > anymore? If it does, the scheduler hasn’t died and hasn’t stopped. > > B. > > >> On 24 Mar 2017, at 18:20, Gael Magnan <gaelmag...@gmail.com> wrote: >> >> We encountered the same kind of problem with the scheduler that stopped >> doing its job even after rebooting. I thought changing the start date or >> the state of a task instance might be to blame but I've never been able to >> pinpoint the problem either. >> >> We are using celery and docker if it helps. >> >> Le sam. 25 mars 2017 à 01:53, Bolke de Bruin <bdbr...@gmail.com> a écrit : >> >>> We are running *without* num runs for over a year (and never have). It is >>> a very elusive issue which has not been reproducible. >>> >>> I like more info on this but it needs to be very elaborate even to the >>> point of access to the system exposing the behavior. >>> >>> Bolke >>> >>> Sent from my iPhone >>> >>>> On 24 Mar 2017, at 16:04, Vijay Ramesh <vi...@change.org> wrote: >>>> >>>> We literally have a cron job that restarts the scheduler every 30 min. >>> Num >>>> runs didn't work consistently in rc4, sometimes it would restart itself >>> and >>>> sometimes we'd end up with a few zombie scheduler processes and things >>>> would get stuck. Also running locally, without celery. >>>> >>>>> On Mar 24, 2017 16:02, <lro...@quartethealth.com> wrote: >>>>> >>>>> We have max runs set and still hit this. Our solution is dumber: >>>>> monitoring log output, and kill the scheduler if it stops emitting. >>> Works >>>>> like a charm. >>>>> >>>>>> On Mar 24, 2017, at 5:50 PM, F. Hakan Koklu <fhakan.ko...@gmail.com> >>>>> wrote: >>>>>> >>>>>> Some solutions to this problem is restarting the scheduler frequently >>> or >>>>>> some sort of monitoring on the scheduler. We have set up a dag that >>> pings >>>>>> cronitor <https://cronitor.io/> (a dead man's snitch type of service) >>>>> every >>>>>> 10 minutes and the snitch pages you when the scheduler dies and does >>> not >>>>>> send a ping to it. >>>>>> >>>>>> On Fri, Mar 24, 2017 at 1:49 PM, Andrew Phillips < >>> aphill...@qrmedia.com> >>>>>> wrote: >>>>>> >>>>>>> We use celery and run into it from time to time. >>>>>>>> >>>>>>> >>>>>>> Bang goes my theory ;-) At least, assuming it's the same underlying >>>>>>> cause... >>>>>>> >>>>>>> Regards >>>>>>> >>>>>>> ap >>>>>>> >>>>> >>> >