Max, Thank you for the quick response, that is very helpful and great material for my investigations!
Thanks again, Stéphane > On Jun 11, 2018, at 3:11 PM, Maxime Beauchemin <maximebeauche...@gmail.com> > wrote: > > DagBag import timeouts happen when people do more than just "configuration > as code" in their module scope (say doing actual compute in module scope, > which is a no-no). They may also happen if you read things from flimsy > external systems that may introduce delays. Say you read pipeline > configuration from Zookeeper or from a database or network drive and > somehow that operation is timing out. > > Also with Airflow (at the moment) you are responsible to synchronize the > pipeline definitions (DAGS_FOLDER) on all machines across the cluster. If > they are not in sync you'll have problems with symptoms that may look like > "dag_id not found". That happens when the scheduler is aware of DAGs that > workers may not be aware of. > > Max > > On Mon, Jun 11, 2018 at 12:42 PM Stephane Bonneaud <steph...@fathomhealth.co> > wrote: > >> Hi there, >> >> We’re using Airflow in our startup and it’s been great in many ways, >> thanks for the work you guys are doing! >> >> Unfortunately, we’re hitting a bunch of issues with ops timing out, DAGs >> failing for unclear reasons, with no logs or the following error: >> "airflow.exceptions.AirflowException: dag_id could not be found”. This >> seems to happen when enough DAGs are running at the same time, though it >> can also happen more rarely here and there. But, the best way to reproduce >> the error with our setup is to run enough DAGs at once. Most of the time, >> clearing the DAG run or ops that have failed and letting the DAG re-run is >> enough to fix the problem. >> >> I found resources pointing to the dagbag_import_timeout, e.g., >> https://stackoverflow.com/questions/43235130/airflow-dag-id-could-not-be-found >> < >> https://stackoverflow.com/questions/43235130/airflow-dag-id-could-not-be-found >>> . >> I did play with that parameter, and other parameters as well. And it does >> seem that they help, i.e., I can run more DAGs at once, but >> (1) if I run enough DAGs at once, I still see ops and DAGs >> failing, so the problem is not fixed ; >> (2) more importantly, I don’t fully understand the problem. I have >> some ideas on what is happening, but maybe I’m totally wrong? >> >> Any recommendations on how I should investigate that? >> >> Thank you very much! >> Have a nice rest of the day, >> Stéphane >> http://stephanebonneaud.com <http://stephanebonneaud.com/> >> >>