If you are interested in my journey to come to that PR I've just published a 
blog post about my journey to get there - 
https://www.astronomer.io/blog/profiling-the-airflow-scheduler/ 
<https://www.astronomer.io/blog/profiling-the-airflow-scheduler/> -- Improving 
Airflow’s scheduler is one of our top priorities at Astronomer and I think this 
should help anyone with short-running tasks.

-ash


> On 5 Dec 2019, at 21:33, Aaron Grubb <[email protected]> wrote:
> 
> That’s great! Thanks for your reply!
>  
> From: Kamil Breguła <[email protected]> 
> Sent: Thursday, December 5, 2019 4:19 PM
> To: [email protected]
> Subject: Re: Celery Task Startup Overhead
>  
> Hello,
>  
> This is caused by strict process isolation. Each task is started in a new 
> process, where the Python interpreter is loaded completely anew. 
> This change can help solve some of your problems.
> https://github.com/apache/airflow/pull/6627 
> <https://github.com/apache/airflow/pull/6627>
>  
> Best regards,
> Kamil
>  
> On Thu, Dec 5, 2019 at 9:41 PM Aaron Grubb <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi everyone,
>  
> I’ve been testing celery workers with both prefork and eventlet pools and I'm 
> noticing massive startup overhead for simple BashOperators. For example, 20x 
> instances of:
>  
> BashOperator(
>     task_id='test0',
>     bash_command="echo 'test'",
>     dag=dag)
>  
> executed concurrently spikes my worker machine to from ~150mb to ~3gb 
> (eventlet) or ~3.5gb (prefork) memory and takes ~50 seconds. Is this an 
> expected artifact of the 20x python executions or is there some way to reduce 
> this?
>  
> Thanks,
> Aaron

Reply via email to