Question regarding data usage in the DAG itself

Chen Michaeli Mon, 21 Sep 2020 14:23:46 -0700

Hello, I am using Apache Airflow for my fun and experience and it is great!
I hope I was meant to send questions to this address, please correct me if
I'm wrong.


I was wondering why I shouldn't let the DAG itself do any data gathering?

For example and for the sake of simplicity, I have a pipeline that reads a
file name from a s3 bucket, and than stores it in a mysql table.

Normally I would use one sensor or operator to get the file name, and than
a second operator to store it in mysql. (While for example using xCom to
communicate the name between them).

I understand this might be the preffered course of action, and that is what
I currently do!
However, what I don't understand is why can't I just get the file name
within the DAG itself.
Why is it considered to be a bad practice to do any data related processing
or gathering in the DAG?

I can use the AWS API to easily retrieve the file name and store it in a
regular Python "global" variable. Than I will only have one operator that
takes this file name and stores it in mysql.

Each time the DAG will be parsed for execution, my code that uses the AWS
API will run again and provide me with a new file name.

Am I missing something?

Thank you very much, this has gotten me so curious!

Question regarding data usage in the DAG itself

Reply via email to