Thank you for your reply, Max
Dynamic DAGs query the database for tables and generates DAGs and tasks based
on the output.
For Python does not take much to execute:
Dynamic - 500 tasks:
time python PPAD_OIS_MASTER_IDI.py
[2019-08-15 12:57:48,522] {settings.py:174} INFO - setting.configure_orm():
Using pool settings. pool_size=30, pool_recycle=300
real 0m1.830s
user 0m1.622s
sys 0m0.188s
Static - 100 tasks:
time python PPAD_OPS_CANARY_CONNECTIONS_TEST_8.py
[2019-08-15 12:59:24,959] {settings.py:174} INFO - setting.configure_orm():
Using pool settings. pool_size=30, pool_recycle=300
real 0m1.009s
user 0m0.898s
sys 0m0.108s
We have 44 DAGs with 1003 Dynamic tasks. Parsing in quite time:
DagBag parsing time: 3.9385959999999995
Parsing in time of execution, when scheduler submits the DAGs:
DagBag parsing time: 99.820316
Delay between the task run inside a single DAG grow from 30 sec to 10 min, then
it drops back even thou tasks are runnign.
Eugene
On 8/15/19, 11:52 AM, "Maxime Beauchemin" <[email protected]> wrote:
What is your dynamic DAG doing? How long does it take to execute it just as
a python script (`time python mydag.py`)?
As an Airflow admin, people may want to lower the DAG parsing timeout
configuration key to force people to not do crazy thing in DAG module
scope. At some point at Airbnb we had someone running a Hive query in DAG
scope, clearly that needs to be prevented.
Loading DAGs by calling a database can bring all sorts of surprises that
can drive everyone crazy. As mentioned in a recent post, repo-contained,
deterministic "less dynamic" DAGs are great, because they are
self-contained and allow you to use source-control properly (revert a bad
change for instance). That may mean having a process or script that
compiles external things that are dynamic into things like yaml files
checked into the code repo. Things as simple as parsing duration become
more predictable (network latency and database load are not part of that
equation), but more importantly, all changes become tracked in the code
repo.
yaml parsing in python can be pretty slow too, and there are solutions /
alternatives there. Hocon is great. Also C-accelerated yaml is possible:
https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F27743711%2Fcan-i-speedup-yaml&data=01%7C01%7Cebacal%40paypal.com%7Cb01b585b5bf348b7ee4808d721b1c363%7Cfb00791460204374977e21bac5f3f4c8%7C1&sdata=n05lhbbyxOVY96UgCkOOg7zRVZD0KD78oD98RotL224%3D&reserved=0
Max
On Wed, Aug 14, 2019 at 9:56 PM Bacal, Eugene <[email protected]>
wrote:
> Hello Airflow team,
>
> Please advise if you can. In our environment, we have noticed that dynamic
> tasks place quite of stress on scheduler, webserver and increase MySQL DB
> connections.
> We are run about 1000 Dynamic Tasks every 30 min and parsing time
> increases from 5 to 65 sec with Runtime from 2sec to 350+ . This happens
at
> execution time then it drops to normal while still executing tasks.
> Webserver hangs for few minutes.
>
> Airflow 1.10.1.
> MySQL DB
>
> Example:
>
> Dynamic Tasks:
> Number of DAGs: 44
> Total task number: 950
> DagBag parsing time: 65.879642000000001
>
> Static Tasks:
> Number of DAGs: 73
> Total task number: 1351
> DagBag parsing time: 1.731088
>
> Is this something you aware of? Any advises on Dynamic tasks
> optimization/best practices?
>
> Thank you in advance,
> Eugene
>
>
>