thanks Jeremiah, this is exactly what was bugging me. I am going to rewrite that code and look at persistent storage. your explanation helped, thanks!
On Wed, Mar 22, 2017 at 2:29 PM, Jeremiah Lowin <[email protected]> wrote: > In vanilla Python, your DAGs will all reference the same object, so when > your DAG file is parsed and 200 DAGs are created, there will still only be > 1 60MB dict object created (I say vanilla because there are obviously ways > to create copies of the object). > > HOWEVER, you should assume that each Airflow TASK is being run in a > different process, and each process is going to load your DAG file when it > runs. If resource use is a concern, I suggest you look at out-of-core or > persistent storage for the object so you don't need to load the whole thing > every time. > > On Wed, Mar 22, 2017 at 11:20 AM Boris Tyukin <[email protected]> > wrote: > > > hi Jeremiah, thanks for the explanation! > > > > i am very new to Python so was surprised that it works and my external > > dictionary object was still accessible to all dags generated. I think it > > makes sense but I would like to confirm one thing and I do not know how > to > > test it myself. > > > > do you think that large dictionary object will still be loaded to memory > > only once even if I generate 200 dags that will be accessing it? so > > basically they will just use a reference to it or they would create a > copy > > of the same 60Mb structure. > > > > I hope my question makes sense :) > > > > On Wed, Mar 22, 2017 at 10:54 AM, Jeremiah Lowin <[email protected]> > > wrote: > > > > > At the risk of oversimplifying things, your DAG definition file is > loaded > > > *every* time a DAG (or any task in that DAG) is run. Think of it as a > > > literal Python import of your dag-defining module: any variables are > > loaded > > > along with the DAGs, which are then executed. That's why your dict is > > > always available. This will work with Celery since it follows the same > > > approach, parsing your DAG file to run each task. > > > > > > (By the way, this is why it's critical that all parts of your Airflow > > > infrastructure have access to the same DAGS_FOLDER) > > > > > > Now it is true that the DagBag loads DAG objects but think of it as > more > > of > > > an "index" so that the scheduler/webserver know what DAGs are > available. > > > When it's time to actually run one of those DAGs, the executor loads it > > > from the underlying source file. > > > > > > Jeremiah > > > > > > On Wed, Mar 22, 2017 at 8:45 AM Boris Tyukin <[email protected]> > > > wrote: > > > > > > > Hi, > > > > > > > > I have a weird question but it bugs my mind. I have some like below > to > > > > generate dags dynamically, using Max's example code from FAQ. > > > > > > > > It works fine but I have one large dict (let's call it my_outer_dict) > > > that > > > > takes over 60Mb in memory and I need to access it from all generated > > > dags. > > > > Needless to say, i do not want to recreate that dict for every dag > as I > > > > want to load it to memory only once. > > > > > > > > To my surprise, if i define that dag outside of my dag definition > > code, I > > > > can still access it. > > > > > > > > Can someone explain why and where is it stored? I thought only dag > > > > definitions are loaded to dagbag and not the variables outside it. > > > > > > > > Is it even a good practice and will it work still if I switch to > celery > > > > executor? > > > > > > > > > > > > def get_dag(i): > > > > dag_id = 'foo_{}'.format(i) > > > > dag = DAG(dag_id) > > > > .... > > > > print my_outer_dict > > > > > > > > my_outer_dict = {} > > > > for i in range(10): > > > > dag = get_dag(i) > > > > globals()[dag.dag_id] = dag > > > > > > > > > >
