Oh I thought the DAG is parsed only prior to execution. Thank you so much! :)
בתאריך יום ג׳, 22 בספט׳ 2020, 20:15, מאת Tomasz Urbaszek < [email protected]>: > The DAG is parsed every few seconds (by scheduler). It means that any > top-level code is executed every few seconds. So if you will request an > external API or database on DAG level (not in operator) it means that the > request will be send quite often and that's definitely not an expected > behavior :) > > Cheers, > Tomek > > On Mon, Sep 21, 2020 at 11:23 PM Chen Michaeli <[email protected]> wrote: > >> Hello, I am using Apache Airflow for my fun and experience and it is >> great! >> I hope I was meant to send questions to this address, please correct me >> if I'm wrong. >> >> I was wondering why I shouldn't let the DAG itself do any data gathering? >> >> For example and for the sake of simplicity, I have a pipeline that reads >> a file name from a s3 bucket, and than stores it in a mysql table. >> >> Normally I would use one sensor or operator to get the file name, and >> than a second operator to store it in mysql. (While for example using xCom >> to communicate the name between them). >> >> I understand this might be the preffered course of action, and that is >> what I currently do! >> However, what I don't understand is why can't I just get the file name >> within the DAG itself. >> Why is it considered to be a bad practice to do any data related >> processing or gathering in the DAG? >> >> I can use the AWS API to easily retrieve the file name and store it in a >> regular Python "global" variable. Than I will only have one operator that >> takes this file name and stores it in mysql. >> >> Each time the DAG will be parsed for execution, my code that uses the AWS >> API will run again and provide me with a new file name. >> >> Am I missing something? >> >> Thank you very much, this has gotten me so curious! >> >
