Hi Eugene -

We run an Airflow deployment that executes ~6,000 hourly tasks across 32 
dynamic DAGs - happy to jump on a call some time and talk about our setup, 
which now includes a step to generate one file per dynamic DAG, so that each 
DAG gets its own scheduler loop. Hit me up r...@astronomer.io if interested. We 
want to write a guide about it, but it's been hard to find time lately for that.

-Ry

Sent via Superhuman ( https://sprh.mn/?vip=r...@rywalker.com )

On Thu, Sep 12, 2019 at 1:21 PM, Eugene Bacal < eba...@paypal.com.invalid > 
wrote:

> 
> 
> 
> Wanted to ask for an advise...
> 
> 
> 
> Because I run 1300 task every 30 min in parallel such scenario creates a
> lot of DB connections that freezes the UI. I would like to segregate
> result_backend and Airflow DB.
> 
> 
> 
> Current setup:
> broker_url = rabbitmq
> result_backend = mysqldb
> sql_alchemy_conn = mysqldb
> 
> 
> 
> New setup:
> broker_url = redis
> result_backend = redis
> sql_alchemy_conn = mysqldb
> 
> 
> 
> I have tested it and see 2x reduction connection wise. But is there
> something else I maybe missing, since Airflow does not recommend to use
> redis as result_backend
> 
> 
> 
> Please advise,
> Eugene
> 
> 
> 
> On 8/29/19, 4:35 PM, "Bacal, Eugene" < ebacal@ paypal. com. INVALID (
> eba...@paypal.com.INVALID ) > wrote:
> 
> 
> 
> Thank you, Daniel! Sorry for spamming.
> 
> 
> 
> https:/ / nam03. safelinks. protection. outlook. com/ 
> ?url=https%3A%2F%2Fstackoverflow.
> com%2Fquestions%2F57718658%2Fairflow-celery-workers-too-many-mysql-connections&amp;data=01%7C01%7Cebacal%40paypal.
> com%7C1af3b2b9d588408e315508d72cd99b17%7Cfb00791460204374977e21bac5f3f4c8%7C1&amp;sdata=YcjJBkRhpHzT1UFSwdJabCDMAdKvTDu6eujQOB4dPhk%3D&amp;reserved=0
> (
> https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F57718658%2Fairflow-celery-workers-too-many-mysql-connections&amp;data=01%7C01%7Cebacal%40paypal.com%7C1af3b2b9d588408e315508d72cd99b17%7Cfb00791460204374977e21bac5f3f4c8%7C1&amp;sdata=YcjJBkRhpHzT1UFSwdJabCDMAdKvTDu6eujQOB4dPhk%3D&amp;reserved=0
> )
> 
> 
> 
> On 8/29/19, 1:14 PM, "Daniel Standish" < dpstandish@ gmail. com (
> dpstand...@gmail.com ) > wrote:
> 
> 
> 
> Eugene
> 
> 
> 
> Why don't you create a stack overflow post and give us the link?
> 
> 
> 
> That is probably a better way to help you through this.
> 
> 
> 
> We will need to see what exactly you are doing in your dag files and
> potentially also hooks / operators.
> 
> 
> 
> Thanks
> 
> 
> 
> On Thu, Aug 29, 2019 at 10:41 AM Bacal, Eugene < ebacal@ paypal. com. invalid
> ( eba...@paypal.com.invalid ) > wrote:
> 
> 
>> 
>> 
>> Can someone advise if this is expecting behavior, please?
>> 
>> 
>> 
>> - DB connections are not being re-used
>> - Connections stay open while active only 5:
>> mysql> show global status like 'Thread%';
>> +-------------------------+---------+
>> | Variable_name | Value |
>> +-------------------------+---------+
>> | Threadpool_idle_threads | 0 |
>> | Threadpool_threads | 0 |
>> | Threads_cached | 775 |
>> | Threads_connected | 5323 |
>> | Threads_created | 4846609 |
>> | Threads_running | 5 |
>> +-------------------------+---------+
>> - Workers create hungreds of conections that’s remains open until DB
>> cleares them (900 sec)
>> 
>> 
>> 
>> Eugene Bacal
>> 
>> 
>> 
>> Workload Automation Engineer
>> Scheduling Platform Experience
>> Cell: 4802028764
>> Slack: @ebacal
>> DL: dl-pp-cdp-spx@ paypal. com ( dl-pp-cdp-...@paypal.com )
>> 
>> 
>> 
>> Scottsdale, Arizona
>> 
>> 
>> 
>> On 8/21/19, 2:36 PM, "Bacal, Eugene" < ebacal@ paypal. com. INVALID (
>> eba...@paypal.com.INVALID ) > wrote:
>> 
>> 
>> 
>> Hi Max,
>> 
>> 
>> 
>> We have ran few testing today from DB side and noticed that:
>> 
>> 
>> 
>> - DB connections are not being re-used
>> - Connections stay open while active only 5:
>> mysql> show global status like 'Thread%';
>> +-------------------------+---------+
>> | Variable_name | Value |
>> +-------------------------+---------+
>> | Threadpool_idle_threads | 0 |
>> | Threadpool_threads | 0 |
>> | Threads_cached | 775 |
>> | Threads_connected | 5323 |
>> | Threads_created | 4846609 |
>> | Threads_running | 5 |
>> +-------------------------+---------+
>> - Workers create hungreds of conections that’s remains open until DB
>> cleares them (900 sec)
>> 
>> 
>> 
>> Seems that similar cases were reported some time back(2016),
>> 
>> 
>> 
>> https:/ / nam03. safelinks. protection. outlook. com/ 
>> ?url=https%3A%2F%2Fissues.
>> apache. org%2Fjira%2Fbrowse%2FAIRFLOW-680&amp;data=01%7C01%7Cebacal%40paypal.
>> com%7C1af3b2b9d588408e315508d72cd99b17%7Cfb00791460204374977e21bac5f3f4c8%7C1&amp;sdata=7pjGdmuyIuI4GRHPPa7LH3%2Bkq6TVeZJjrMFQf0HHuZg%3D&amp;reserved=0
>> (
>> https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FAIRFLOW-680&amp;data=01%7C01%7Cebacal%40paypal.com%7C1af3b2b9d588408e315508d72cd99b17%7Cfb00791460204374977e21bac5f3f4c8%7C1&amp;sdata=7pjGdmuyIuI4GRHPPa7LH3%2Bkq6TVeZJjrMFQf0HHuZg%3D&amp;reserved=0
>> )
>> 
>> 
>> 
>> On 8/20/19, 5:19 PM, "Bacal, Eugene" < ebacal@ paypal. com. INVALID (
>> eba...@paypal.com.INVALID ) > wrote:
>> 
>> 
>> 
>> Celery executor.
>> 12 BareMetals: CPU(s):40, MHz 2494.015, RAM 378G - each box
>> 
>> 
>> 
>> Worker .cfg:
>> [core]
>> sql_alchemy_pool_size = 5
>> sql_alchemy_pool_recycle = 900
>> sql_alchemy_reconnect_timeout = 300
>> parallelism = 1200
>> dag_concurrency = 800
>> non_pooled_task_slot_count = 1200
>> max_active_runs_per_dag = 10
>> dagbag_import_timeout = 30
>> [celery]
>> worker_concurrency = 100
>> 
>> 
>> 
>> Scheduler .cfg:
>> [core]
>> sql_alchemy_pool_size = 30
>> sql_alchemy_pool_recycle = 300
>> sql_alchemy_reconnect_timeout = 300
>> parallelism = 1200
>> dag_concurrency = 800
>> non_pooled_task_slot_count = 1200
>> max_active_runs_per_dag = 10
>> [scheduler]
>> job_heartbeat_sec = 5
>> scheduler_heartbeat_sec = 5
>> run_duration = 1800
>> min_file_process_interval = 10
>> min_file_parsing_loop_time = 1
>> dag_dir_list_interval = 300
>> print_stats_interval = 30
>> scheduler_zombie_task_threshold = 300
>> max_tis_per_query = 1024
>> max_threads = 29
>> 
>> 
>> 
>> From workers I see 350 + connections at the start time, then it drops to
>> 200 and then to 1-10 once tasks complete
>> From Scheduler very low 1-10:
>> MySQL connections:
>> 331 worker1
>> 215 worker2
>> 349 worker53
>> 335 worker54
>> 347 worker55
>> 336 worker56
>> 336 worker57
>> 354 worker58
>> 339 worker59
>> 328 worker60
>> 333 worker61
>> 337 worker62
>> 2 scheduler
>> 
>> 
>> 
>> - Eugene
>> 
>> 
>> 
>> On 8/20/19, 8:51 AM, "Maxime Beauchemin" < maximebeauchemin@ gmail. com (
>> maximebeauche...@gmail.com ) > wrote:
>> 
>> 
>> 
>> Delay between tasks could be due to not having enough worker slots. What
>> type of executor are you using, how is it configured?
>> 
>> 
>> 
>> Max
>> 
>> 
>> 
>> On Tue, Aug 20, 2019 at 7:50 AM Bacal, Eugene
>> < ebacal@ paypal. com. invalid ( eba...@paypal.com.invalid ) >
>> wrote:
>> 
>> 
>>> 
>>> 
>>> Absolutely possible, Daniel,
>>> 
>>> 
>>> 
>>> We are looking in all directions. Has anyone noticed
>>> 
>>> 
>> 
>> 
>> 
>> performance
>> 
>> 
>>> 
>>> 
>>> improvements with PostgreSQL vs MySQL ?
>>> 
>>> 
>>> 
>>> -Eugene
>>> 
>>> 
>>> 
>>> On 8/15/19, 2:03 PM, "Daniel Standish" < dpstandish@ gmail. com (
>>> dpstand...@gmail.com ) >
>>> 
>>> 
>> 
>> 
>> 
>> wrote:
>> 
>> 
>>> 
>>> 
>>> It's not just webserver and scheduler that will parse
>>> 
>>> 
>> 
>> 
>> 
>> your dag file.
>> 
>> 
>>> 
>>> 
>>> During the execution of a dag run, dag file will be
>>> 
>>> 
>> 
>> 
>> 
>> re-parsed at the
>> 
>> 
>>> 
>>> 
>>> start
>>> of every task instance. If you have 1000 tasks running
>>> 
>>> 
>> 
>> 
>> 
>> in short
>> 
>> 
>>> 
>>> 
>>> period of
>>> time, that's 1000 queries. It's possible these queries
>>> 
>>> 
>> 
>> 
>> 
>> are piling up
>> 
>> 
>>> 
>>> 
>>> in a
>>> queue on your database. Dag read time has to be very
>>> 
>>> 
>> 
>> 
>> 
>> fast for this
>> 
>> 
>>> 
>>> 
>>> reason.
>>> 
>>> 
>>> 
>>> On Thu, Aug 15, 2019 at 1:45 PM Bacal, Eugene
>>> < ebacal@ paypal. com. invalid ( eba...@paypal.com.invalid ) >
>>> wrote:
>>> 
>>> 
>>>> 
>>>> 
>>>> Thank you for your reply, Max
>>>> 
>>>> 
>>>> 
>>>> Dynamic DAGs query the database for tables and
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> generates DAGs and
>> 
>> 
>>> 
>>> 
>>> tasks
>>> 
>>> 
>>>> 
>>>> 
>>>> based on the output.
>>>> For Python does not take much to execute:
>>>> 
>>>> 
>>>> 
>>>> Dynamic - 500 tasks:
>>>> time python PPAD_OIS_MASTER_IDI. py ( http://ppad_ois_master_idi.py/ )
>>>> [2019-08-15 12:57:48,522] { settings. py:174 ( http://settings.py:174/ ) }
>>>> INFO - setting.configure_orm(): Using pool settings.
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> pool_size=30,
>> 
>> 
>>> 
>>> 
>>> pool_recycle=300
>>> 
>>> 
>>>> 
>>>> 
>>>> real 0m1.830s
>>>> user 0m1.622s
>>>> sys 0m0.188s
>>>> 
>>>> 
>>>> 
>>>> Static - 100 tasks:
>>>> time python PPAD_OPS_CANARY_CONNECTIONS_TEST_8. py (
>>>> http://ppad_ops_canary_connections_test_8.py/ )
>>>> [2019-08-15 12:59:24,959] { settings. py:174 ( http://settings.py:174/ ) }
>>>> INFO - setting.configure_orm(): Using pool settings.
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> pool_size=30,
>> 
>> 
>>> 
>>> 
>>> pool_recycle=300
>>> 
>>> 
>>>> 
>>>> 
>>>> real 0m1.009s
>>>> user 0m0.898s
>>>> sys 0m0.108s
>>>> 
>>>> 
>>>> 
>>>> We have 44 DAGs with 1003 Dynamic tasks. Parsing in
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> quite time:
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> DagBag parsing time: 3.9385959999999995
>>>> 
>>>> 
>>>> 
>>>> Parsing in time of execution, when scheduler submits
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> the DAGs:
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> DagBag parsing time: 99.820316
>>>> 
>>>> 
>>>> 
>>>> Delay between the task run inside a single DAG grow
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> from 30 sec to
>> 
>> 
>>> 
>>> 
>>> 10 min,
>>> 
>>> 
>>>> 
>>>> 
>>>> then it drops back even thou tasks are runnign.
>>>> 
>>>> 
>>>> 
>>>> Eugene
>>>> 
>>>> 
>>>> 
>>>> On 8/15/19, 11:52 AM, "Maxime Beauchemin" <
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> maximebeauchemin@ gmail. com ( maximebeauche...@gmail.com ) >
>>> 
>>> 
>>>> 
>>>> 
>>>> wrote:
>>>> 
>>>> 
>>>> 
>>>> What is your dynamic DAG doing? How long does it
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> take to execute
>> 
>> 
>>> 
>>> 
>>> it
>>> 
>>> 
>>>> 
>>>> 
>>>> just as
>>>> a python script (`time python mydag. py ( http://mydag.py/ ) `)?
>>>> 
>>>> 
>>>> 
>>>> As an Airflow admin, people may want to lower the
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> DAG parsing
>> 
>> 
>>> 
>>> 
>>> timeout
>>> 
>>> 
>>>> 
>>>> 
>>>> configuration key to force people to not do crazy
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> thing in DAG
>> 
>> 
>>> 
>>> 
>>> module
>>> 
>>> 
>>>> 
>>>> 
>>>> scope. At some point at Airbnb we had someone
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> running a Hive
>> 
>> 
>>> 
>>> 
>>> query in
>>> 
>>> 
>>>> 
>>>> 
>>>> DAG
>>>> scope, clearly that needs to be prevented.
>>>> 
>>>> 
>>>> 
>>>> Loading DAGs by calling a database can bring all
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> sorts of
>> 
>> 
>>> 
>>> 
>>> surprises
>>> 
>>> 
>>>> 
>>>> 
>>>> that
>>>> can drive everyone crazy. As mentioned in a recent
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> post,
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> repo-contained,
>>>> deterministic "less dynamic" DAGs are great,
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> because they are
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> self-contained and allow you to use source-control
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> properly
>> 
>> 
>>> 
>>> 
>>> (revert a
>>> 
>>> 
>>>> 
>>>> 
>>>> bad
>>>> change for instance). That may mean having a
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> process or script
>> 
>> 
>>> 
>>> 
>>> that
>>> 
>>> 
>>>> 
>>>> 
>>>> compiles external things that are dynamic into
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> things like yaml
>> 
>> 
>>> 
>>> 
>>> files
>>> 
>>> 
>>>> 
>>>> 
>>>> checked into the code repo. Things as simple as
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> parsing duration
>> 
>> 
>>> 
>>> 
>>> become
>>> 
>>> 
>>>> 
>>>> 
>>>> more predictable (network latency and database
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> load are not part
>> 
>> 
>>> 
>>> 
>>> of
>>> 
>>> 
>>>> 
>>>> 
>>>> that
>>>> equation), but more importantly, all changes
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> become tracked in
>> 
>> 
>>> 
>>> 
>>> the code
>>> 
>>> 
>>>> 
>>>> 
>>>> repo.
>>>> 
>>>> 
>>>> 
>>>> yaml parsing in python can be pretty slow too, and
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> there are
>> 
>> 
>>> 
>>> 
>>> solutions
>>> 
>>> 
>>>> 
>>>> 
>>>> /
>>>> alternatives there. Hocon is great. Also
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> C-accelerated yaml is
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> possible:
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> https:/ / nam03. safelinks. protection. outlook. com/ 
>> ?url=https%3A%2F%2Fstackoverflow.
>> com%2Fquestions%2F27743711%2Fcan-i-speedup-yaml&amp;data=01%7C01%7Cebacal%40paypal.
>> com%7C1af3b2b9d588408e315508d72cd99b17%7Cfb00791460204374977e21bac5f3f4c8%7C1&amp;sdata=4T1xq7QBztFw24GvJEL0kR8%2BNFmDEtnSfDLtgWL9f%2BU%3D&amp;reserved=0
>> (
>> https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F27743711%2Fcan-i-speedup-yaml&amp;data=01%7C01%7Cebacal%40paypal.com%7C1af3b2b9d588408e315508d72cd99b17%7Cfb00791460204374977e21bac5f3f4c8%7C1&amp;sdata=4T1xq7QBztFw24GvJEL0kR8%2BNFmDEtnSfDLtgWL9f%2BU%3D&amp;reserved=0
>> )
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> Max
>>>> 
>>>> 
>>>> 
>>>> On Wed, Aug 14, 2019 at 9:56 PM Bacal, Eugene
>>>> < ebacal@ paypal. com. invalid ( eba...@paypal.com.invalid ) >
>>>> wrote:
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Hello Airflow team,
>>>>> 
>>>>> 
>>>>> 
>>>>> Please advise if you can. In our environment, we
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> have noticed
>> 
>> 
>>> 
>>> 
>>> that
>>> 
>>> 
>>>> 
>>>> 
>>>> dynamic
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> tasks place quite of stress on scheduler,
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> webserver and
>> 
>> 
>>> 
>>> 
>>> increase
>>> 
>>> 
>>>> 
>>>> 
>>>> MySQL DB
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> connections.
>>>>> We are run about 1000 Dynamic Tasks every 30 min
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> and parsing
>> 
>> 
>>> 
>>> 
>>> time
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> increases from 5 to 65 sec with Runtime from
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 2sec to 350+ .
>> 
>> 
>>> 
>>> 
>>> This
>>> 
>>> 
>>>> 
>>>> 
>>>> happens at
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> execution time then it drops to normal while
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> still executing
>> 
>> 
>>> 
>>> 
>>> tasks.
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Webserver hangs for few minutes.
>>>>> 
>>>>> 
>>>>> 
>>>>> Airflow 1.10.1.
>>>>> MySQL DB
>>>>> 
>>>>> 
>>>>> 
>>>>> Example:
>>>>> 
>>>>> 
>>>>> 
>>>>> Dynamic Tasks:
>>>>> Number of DAGs: 44
>>>>> Total task number: 950
>>>>> DagBag parsing time: 65.879642000000001
>>>>> 
>>>>> 
>>>>> 
>>>>> Static Tasks:
>>>>> Number of DAGs: 73
>>>>> Total task number: 1351
>>>>> DagBag parsing time: 1.731088
>>>>> 
>>>>> 
>>>>> 
>>>>> Is this something you aware of? Any advises on
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> Dynamic tasks
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> optimization/best practices?
>>>>> 
>>>>> 
>>>>> 
>>>>> Thank you in advance,
>>>>> Eugene
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 
>

Reply via email to