Got it! I don’t think it’s this last case, but I’ll keep my eye open for it anyway. Really, thanks again, I appreciate the help! Will let you know what I find if I feel it may be of some use for you.
Stéphane > On Jun 11, 2018, at 3:31 PM, Maxime Beauchemin <maximebeauche...@gmail.com> > wrote: > > One more thing is if one of your worker has a missing dependency required > for a specific DAG. For example you read configuration from zookeeper in > the DAG file, but only one worker is missing the Zookeeper client python > lib, but the scheduler has the lib. You can imagine that the scheduler will > send the job over to the worker, and the worker can't interpret the DAG > file. > > > On Mon, Jun 11, 2018 at 3:22 PM Stephane Bonneaud <steph...@fathomhealth.co> > wrote: > >> Max, >> >> Thank you for the quick response, that is very helpful and great material >> for my investigations! >> >> Thanks again, >> Stéphane >> >> >>> On Jun 11, 2018, at 3:11 PM, Maxime Beauchemin < >> maximebeauche...@gmail.com> wrote: >>> >>> DagBag import timeouts happen when people do more than just >> "configuration >>> as code" in their module scope (say doing actual compute in module scope, >>> which is a no-no). They may also happen if you read things from flimsy >>> external systems that may introduce delays. Say you read pipeline >>> configuration from Zookeeper or from a database or network drive and >>> somehow that operation is timing out. >>> >>> Also with Airflow (at the moment) you are responsible to synchronize the >>> pipeline definitions (DAGS_FOLDER) on all machines across the cluster. If >>> they are not in sync you'll have problems with symptoms that may look >> like >>> "dag_id not found". That happens when the scheduler is aware of DAGs that >>> workers may not be aware of. >>> >>> Max >>> >>> On Mon, Jun 11, 2018 at 12:42 PM Stephane Bonneaud < >> steph...@fathomhealth.co> >>> wrote: >>> >>>> Hi there, >>>> >>>> We’re using Airflow in our startup and it’s been great in many ways, >>>> thanks for the work you guys are doing! >>>> >>>> Unfortunately, we’re hitting a bunch of issues with ops timing out, DAGs >>>> failing for unclear reasons, with no logs or the following error: >>>> "airflow.exceptions.AirflowException: dag_id could not be found”. This >>>> seems to happen when enough DAGs are running at the same time, though it >>>> can also happen more rarely here and there. But, the best way to >> reproduce >>>> the error with our setup is to run enough DAGs at once. Most of the >> time, >>>> clearing the DAG run or ops that have failed and letting the DAG re-run >> is >>>> enough to fix the problem. >>>> >>>> I found resources pointing to the dagbag_import_timeout, e.g., >>>> >> https://stackoverflow.com/questions/43235130/airflow-dag-id-could-not-be-found >>>> < >>>> >> https://stackoverflow.com/questions/43235130/airflow-dag-id-could-not-be-found >>>>> . >>>> I did play with that parameter, and other parameters as well. And it >> does >>>> seem that they help, i.e., I can run more DAGs at once, but >>>> (1) if I run enough DAGs at once, I still see ops and DAGs >>>> failing, so the problem is not fixed ; >>>> (2) more importantly, I don’t fully understand the problem. I >> have >>>> some ideas on what is happening, but maybe I’m totally wrong? >>>> >>>> Any recommendations on how I should investigate that? >>>> >>>> Thank you very much! >>>> Have a nice rest of the day, >>>> Stéphane >>>> http://stephanebonneaud.com <http://stephanebonneaud.com/> >>>> >>>> >> >>