> On Nov 26, 2018, at 7:50 AM, Maxime Beauchemin <maximebeauche...@gmail.com>
> wrote:
>
> The historical reason is that people would check in scripts in the repo
> that had actual compute or other forms or undesired effect in module scope
> (scripts with no "if __name__ == '__main__':") and Airflow would just run
> this script while seeking for DAGs. So we added this mitigation patch that
> would confirm that there's something Airflow-related in the .py file. Not
> elegant, and confusing at times, but it also probably prevented some issues
> over the years.
>
> The solution here is to have a more explicit way of adding DAGs to the
> DagBag (instead of the folder-crawling approach). The DagFetcher proposal
> offers solutions around that, having a central "manifest" file that
> provides explicit pointers to all DAGs in the environment.
Some rebasing needs to happen. When I looked at 1.8 code base almost an year
ago, it felt like more complex than necessary. What airflow is trying to
promise from an architectural standpoint — that was not clear to me. It is
trying to do too many things, scattered in too many places, is the feeling I
got. As a result, I stopped peeping, and just trust that it works — which it
does, btw. I tend to think that, airflow outgrew its original intents. A sort
of micro-services architecture has to be brought in. I may sound critical, but
no offense. I truly appreciate the contributions.
>
> Max
>
> On Sat, Nov 24, 2018 at 5:04 PM Beau Barker <beauinmelbou...@gmail.com>
> wrote:
>
>> In my opinion this searching for dags is not ideal.
>>
>> We should be explicitly specifying the dags to load somewhere.
>>
>>
>>> On 25 Nov 2018, at 10:41 am, Kevin Yang <yrql...@gmail.com> wrote:
>>>
>>> I believe that is mostly because we want to skip parsing/loading .py
>> files
>>> that doesn't contain DAG defs to save time, as scheduler is going to
>>> parse/load the .py files over and over again and some files can take
>> quite
>>> long to load.
>>>
>>> Cheers,
>>> Kevin Y
>>>
>>> On Fri, Nov 23, 2018 at 12:44 AM soma dhavala <soma.dhav...@gmail.com>
>>> wrote:
>>>
>>>> happy to report that the “fix” worked. thanks Alex.
>>>>
>>>> btw, wondering why was it there in the first place? how does it help —
>>>> saves time, early termination — what?
>>>>
>>>>
>>>>> On Nov 23, 2018, at 8:18 AM, Alex Guziel <alex.guz...@airbnb.com>
>> wrote:
>>>>>
>>>>> Yup.
>>>>>
>>>>> On Thu, Nov 22, 2018 at 3:16 PM soma dhavala <soma.dhav...@gmail.com
>>>> <mailto:soma.dhav...@gmail.com>> wrote:
>>>>>
>>>>>
>>>>>> On Nov 23, 2018, at 3:28 AM, Alex Guziel <alex.guz...@airbnb.com
>>>> <mailto:alex.guz...@airbnb.com>> wrote:
>>>>>>
>>>>>> It’s because of this
>>>>>>
>>>>>> “When searching for DAGs, Airflow will only consider files where the
>>>> string “airflow” and “DAG” both appear in the contents of the .py file.”
>>>>>>
>>>>>
>>>>> Have not noticed it. From airflow/models.py, in process_file — (both
>> in
>>>> 1.9 and 1.10)
>>>>> ..
>>>>> if not all([s in content for s in (b'DAG', b'airflow')]):
>>>>> ..
>>>>> is looking for those strings and if they are not found, it is returning
>>>> without loading the DAGs.
>>>>>
>>>>>
>>>>> So having “airflow” and “DAG” dummy strings placed somewhere will make
>>>> it work?
>>>>>
>>>>>
>>>>>> On Thu, Nov 22, 2018 at 2:27 AM soma dhavala <soma.dhav...@gmail.com
>>>> <mailto:soma.dhav...@gmail.com>> wrote:
>>>>>>
>>>>>>
>>>>>>> On Nov 22, 2018, at 3:37 PM, Alex Guziel <alex.guz...@airbnb.com
>>>> <mailto:alex.guz...@airbnb.com>> wrote:
>>>>>>>
>>>>>>> I think this is what is going on. The dags are picked by local
>>>> variables. I.E. if you do
>>>>>>> dag = Dag(...)
>>>>>>> dag = Dag(…)
>>>>>>
>>>>>> from my_module import create_dag
>>>>>>
>>>>>> for file in yaml_files:
>>>>>> dag = create_dag(file)
>>>>>> globals()[dag.dag_id] = dag
>>>>>>
>>>>>> You notice that create_dag is in a different module. If it is in the
>>>> same scope (file), it will be fine.
>>>>>>
>>>>>>>
>>>>>>
>>>>>>> Only the second dag will be picked up.
>>>>>>>
>>>>>>> On Thu, Nov 22, 2018 at 2:04 AM Soma S Dhavala <
>> soma.dhav...@gmail.com
>>>> <mailto:soma.dhav...@gmail.com>> wrote:
>>>>>>> Hey AirFlow Devs:
>>>>>>> In our organization, we build a Machine Learning WorkBench with
>>>> AirFlow as
>>>>>>> an orchestrator of the ML Work Flows, and have wrapped AirFlow python
>>>>>>> operators to customize the behaviour. These work flows are specified
>> in
>>>>>>> YAML.
>>>>>>>
>>>>>>> We drop a DAG loader (written python) in the default location airflow
>>>>>>> expects the DAG files. This DAG loader reads the specified YAML
>> files
>>>> and
>>>>>>> converts them into airflow DAG objects. Essentially, we are
>>>>>>> programmatically creating the DAG objects. In order to support
>> muliple
>>>>>>> parsers (yaml, json etc), we separated the DAG creation from loading.
>>>> But
>>>>>>> when a DAG is created (in a separate module) and made available to
>> the
>>>> DAG
>>>>>>> loaders, airflow does not pick it up. As an example, consider that I
>>>>>>> created a DAG picked it, and will simply unpickle the DAG and give it
>>>> to
>>>>>>> airflow.
>>>>>>>
>>>>>>> However, in current avatar of airfow, the very creation of DAG has to
>>>>>>> happen in the loader itself. As far I am concerned, airflow should
>> not
>>>> care
>>>>>>> where and how the DAG object is created, so long as it is a valid DAG
>>>>>>> object. The workaround for us is to mix parser and loader in the same
>>>> file
>>>>>>> and drop it in the airflow default dags folder. During dag_bag
>>>> creation,
>>>>>>> this file is loaded up with import_modules utility and shows up in
>> the
>>>> UI.
>>>>>>> While this is a solution, but it is not clean.
>>>>>>>
>>>>>>> What do DEVs think about a solution to this problem? Will saving the
>>>> DAG to
>>>>>>> the db and reading it from the db work? Or some core changes need to
>>>> happen
>>>>>>> in the dag_bag creation. Can dag_bag take a bunch of "created" DAGs.
>>>>>>>
>>>>>>> thanks,
>>>>>>> -soma
>>>>>>
>>>>>
>>>>
>>>>
>>