Can someone advise if this is expecting behavior, please? 

- DB connections are not being re-used
        - Connections stay open while active only 5:
                mysql>  show global status like 'Thread%';
                +-------------------------+---------+
                | Variable_name           | Value   |
                +-------------------------+---------+
                | Threadpool_idle_threads | 0       |
                | Threadpool_threads      | 0       |
                | Threads_cached          | 775     |
                | Threads_connected       | 5323    |
                | Threads_created         | 4846609 |
                | Threads_running         | 5       |
                +-------------------------+---------+ 
        - Workers create hungreds of conections that’s remains open until DB 
cleares them (900 sec)



Eugene Bacal
 
Workload Automation Engineer
Scheduling Platform Experience
Cell: 4802028764
Slack: @ebacal
DL: dl-pp-cdp-...@paypal.com
 

Scottsdale, Arizona
 

On 8/21/19, 2:36 PM, "Bacal, Eugene" <eba...@paypal.com.INVALID> wrote:

    Hi Max, 
    
    We have ran few testing today from DB side and noticed that:
    
        - DB connections are not being re-used
        - Connections stay open while active only 5:
                mysql>  show global status like 'Thread%';
                +-------------------------+---------+
                | Variable_name           | Value   |
                +-------------------------+---------+
                | Threadpool_idle_threads | 0       |
                | Threadpool_threads      | 0       |
                | Threads_cached          | 775     |
                | Threads_connected       | 5323    |
                | Threads_created         | 4846609 |
                | Threads_running         | 5       |
                +-------------------------+---------+ 
        - Workers create hungreds of conections that’s remains open until DB 
cleares them (900 sec)
    
    Seems that similar cases were reported some time back(2016), 
    
https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FAIRFLOW-680&amp;data=01%7C01%7Cebacal%40paypal.com%7Cde99ce488b3e479020ca08d7267f9b30%7Cfb00791460204374977e21bac5f3f4c8%7C1&amp;sdata=YXWNRhht76O1djebOfm0oEwAWanKQNLaW0KWLWrrxms%3D&amp;reserved=0
    
    
    On 8/20/19, 5:19 PM, "Bacal, Eugene" <eba...@paypal.com.INVALID> wrote:
    
        Celery executor.
        12 BareMetals: CPU(s):40, MHz 2494.015, RAM 378G - each box
        
        Worker .cfg:
        [core]
        sql_alchemy_pool_size = 5
        sql_alchemy_pool_recycle = 900
        sql_alchemy_reconnect_timeout = 300
        parallelism = 1200
        dag_concurrency = 800
        non_pooled_task_slot_count = 1200
        max_active_runs_per_dag = 10
        dagbag_import_timeout = 30
        [celery]
        worker_concurrency = 100
        
        Scheduler .cfg:
        [core]
        sql_alchemy_pool_size = 30
        sql_alchemy_pool_recycle = 300
        sql_alchemy_reconnect_timeout = 300
        parallelism = 1200
        dag_concurrency = 800
        non_pooled_task_slot_count = 1200
        max_active_runs_per_dag = 10
        [scheduler]
        job_heartbeat_sec = 5
        scheduler_heartbeat_sec = 5
        run_duration = 1800
        min_file_process_interval = 10
        min_file_parsing_loop_time = 1
        dag_dir_list_interval = 300
        print_stats_interval = 30
        scheduler_zombie_task_threshold = 300
        max_tis_per_query = 1024
        max_threads = 29
        
        
        From workers I see 350 + connections at the start time, then it drops 
to 200 and then to 1-10 once tasks complete
        From Scheduler very low 1-10:
        MySQL connections:
            331 worker1
            215 worker2
            349 worker53
            335 worker54
            347 worker55
            336 worker56
            336 worker57
            354 worker58
            339 worker59
            328 worker60
            333 worker61
            337 worker62
             2 scheduler
        
        
        
        - Eugene
         
        
        On 8/20/19, 8:51 AM, "Maxime Beauchemin" <maximebeauche...@gmail.com> 
wrote:
        
            Delay between tasks could be due to not having enough worker slots. 
What
            type of executor are you using, how is it configured?
            
            Max
            
            On Tue, Aug 20, 2019 at 7:50 AM Bacal, Eugene 
<eba...@paypal.com.invalid>
            wrote:
            
            > Absolutely possible, Daniel,
            >
            > We are looking in all directions. Has anyone noticed performance
            > improvements with PostgreSQL vs MySQL ?
            >
            > -Eugene
            >
            >
            > On 8/15/19, 2:03 PM, "Daniel Standish" <dpstand...@gmail.com> 
wrote:
            >
            >     It's not just webserver and scheduler that will parse your 
dag file.
            >     During the execution of a dag run, dag file will be re-parsed 
at the
            > start
            >     of every task instance.  If you have 1000 tasks running in 
short
            > period of
            >     time, that's 1000 queries.  It's possible these queries are 
piling up
            > in a
            >     queue on your database.  Dag read time has to be very fast 
for this
            > reason.
            >
            >
            >
            >     On Thu, Aug 15, 2019 at 1:45 PM Bacal, Eugene
            > <eba...@paypal.com.invalid>
            >     wrote:
            >
            >     >
            >     > Thank you for your reply, Max
            >     >
            >     > Dynamic DAGs query the database for tables and generates 
DAGs and
            > tasks
            >     > based on the output.
            >     > For Python does not take much to execute:
            >     >
            >     > Dynamic - 500 tasks:
            >     > time python PPAD_OIS_MASTER_IDI.py
            >     > [2019-08-15 12:57:48,522] {settings.py:174} INFO -
            >     > setting.configure_orm(): Using pool settings. pool_size=30,
            > pool_recycle=300
            >     > real    0m1.830s
            >     > user    0m1.622s
            >     > sys     0m0.188s
            >     >
            >     >
            >     > Static - 100 tasks:
            >     > time python PPAD_OPS_CANARY_CONNECTIONS_TEST_8.py
            >     > [2019-08-15 12:59:24,959] {settings.py:174} INFO -
            >     > setting.configure_orm(): Using pool settings. pool_size=30,
            > pool_recycle=300
            >     > real    0m1.009s
            >     > user    0m0.898s
            >     > sys     0m0.108s
            >     >
            >     >
            >     > We have 44 DAGs with 1003 Dynamic tasks. Parsing in quite 
time:
            >     > DagBag parsing time: 3.9385959999999995
            >     >
            >     > Parsing in time of execution, when scheduler submits the 
DAGs:
            >     > DagBag parsing time: 99.820316
            >     >
            >     > Delay between the task run inside a single DAG grow from 30 
sec to
            > 10 min,
            >     > then it drops back even thou tasks are runnign.
            >     >
            >     > Eugene
            >     >
            >     >
            >     >
            >     >
            >     >
            >     > On 8/15/19, 11:52 AM, "Maxime Beauchemin" <
            > maximebeauche...@gmail.com>
            >     > wrote:
            >     >
            >     >     What is your dynamic DAG doing? How long does it take 
to execute
            > it
            >     > just as
            >     >     a python script (`time python mydag.py`)?
            >     >
            >     >     As an Airflow admin, people may want to lower the DAG 
parsing
            > timeout
            >     >     configuration key to force people to not do crazy thing 
in DAG
            > module
            >     >     scope. At some point at Airbnb we had someone running a 
Hive
            > query in
            >     > DAG
            >     >     scope, clearly that needs to be prevented.
            >     >
            >     >     Loading DAGs by calling a database can bring all sorts 
of
            > surprises
            >     > that
            >     >     can drive everyone crazy. As mentioned in a recent post,
            >     > repo-contained,
            >     >     deterministic "less dynamic" DAGs are great, because 
they are
            >     >     self-contained and allow you to use source-control 
properly
            > (revert a
            >     > bad
            >     >     change for instance). That may mean having a process or 
script
            > that
            >     >     compiles external things that are dynamic into things 
like yaml
            > files
            >     >     checked into the code repo. Things as simple as parsing 
duration
            > become
            >     >     more predictable (network latency and database load are 
not part
            > of
            >     > that
            >     >     equation), but more importantly, all changes become 
tracked in
            > the code
            >     >     repo.
            >     >
            >     >     yaml parsing in python can be pretty slow too, and 
there are
            > solutions
            >     > /
            >     >     alternatives there. Hocon is great. Also C-accelerated 
yaml is
            >     > possible:
            >     >
            >     >
            > 
https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F27743711%2Fcan-i-speedup-yaml&amp;data=01%7C01%7Cebacal%40paypal.com%7Cde99ce488b3e479020ca08d7267f9b30%7Cfb00791460204374977e21bac5f3f4c8%7C1&amp;sdata=HnqMESbF%2F6DiLbt0fCP%2ByJmF27P4Yovq%2Bx57oeLz0C4%3D&amp;reserved=0
            >     >
            >     >     Max
            >     >
            >     >     On Wed, Aug 14, 2019 at 9:56 PM Bacal, Eugene
            >     > <eba...@paypal.com.invalid>
            >     >     wrote:
            >     >
            >     >     > Hello Airflow team,
            >     >     >
            >     >     > Please advise if you can. In our environment, we have 
noticed
            > that
            >     > dynamic
            >     >     > tasks place quite of stress on scheduler, webserver 
and
            > increase
            >     > MySQL DB
            >     >     > connections.
            >     >     > We are run about 1000 Dynamic Tasks every 30 min and 
parsing
            > time
            >     >     > increases from 5 to 65 sec with Runtime from 2sec to 
350+ .
            > This
            >     > happens at
            >     >     > execution time then it drops to normal while still 
executing
            > tasks.
            >     >     > Webserver hangs for few minutes.
            >     >     >
            >     >     > Airflow 1.10.1.
            >     >     > MySQL DB
            >     >     >
            >     >     > Example:
            >     >     >
            >     >     > Dynamic Tasks:
            >     >     > Number of DAGs: 44
            >     >     > Total task number: 950
            >     >     > DagBag parsing time: 65.879642000000001
            >     >     >
            >     >     > Static Tasks:
            >     >     > Number of DAGs: 73
            >     >     > Total task number: 1351
            >     >     > DagBag parsing time: 1.731088
            >     >     >
            >     >     > Is this something you aware of? Any advises on 
Dynamic tasks
            >     >     > optimization/best practices?
            >     >     >
            >     >     > Thank you in advance,
            >     >     > Eugene
            >     >     >
            >     >     >
            >     >     >
            >     >
            >     >
            >     >
            >
            >
            >
            
        
        
    
    

Reply via email to