Do you use the Pool mechanism? Pools allow you to limit the active number
of tasks that use some resource. In this case, Airflow tasks seem to be
your scarce resource.

We do not have the db connection problem because we use separate web
services to do most of our database work, not the built-in operators. This
also lets us deploy web services on other machines, and so we just have one
Airflow machine and a separate server which runs lots of Java apps (the
Embulk program).

On Mon, May 16, 2016 at 9:38 AM, harish singh <harish.sing...@gmail.com>
wrote:

>  We have now restricted parallelism to 4. The db is external to the
> container.
>
> I still notice that even if I do no backfill and start the container:
> 1. Just For 4 customers (4 Dags in parallel), the utilization goes above
> 4gb (container dies with OOM)
> 2. Sometimes, I notice, when the container starts - webserver gets killed.
> By the time I find out why, the container kept dying :(  So I couldn't get
> to a point to find out why. My guess is the webserver is not getting enough
> memory to start (or it stops after it starts because scheduler starts in
> parallel and utilizes memory?)
> I have given airflow 8gb and its doing well. But in my opinion, 8gb seems
> too much just for scheduling curls? What you think?
>
> Specifically, how is scheduler handling memory management?
>
> Why for every task, does airflow will run a copy of itself?
> Each task having its own pool of db connections - seems very expensive? How
> can we avoid this?
>
> Thanks.
>
>
>
> On Mon, May 16, 2016 at 2:38 AM, Bolke de Bruin <bdbr...@gmail.com> wrote:
>
> >
> >
> > > On 15 mei 2016, at 22:50, harish singh <harish.sing...@gmail.com>
> wrote:
> > >
> > > Our DAG (hourly) has 10 tasks (all of them Bash Operators - issuing
> curl
> > > commands).
> > > We run airflow on docker.
> > >
> > > When we do a backfill for, say last 10 days, we see that airflow
> > > consistently hits the memory limit (4gb) and the container dies (OOM
> > > Killed).
> > >
> > > We increased the memory to 8gb. I still see mem utilization to be
> around
> > > 90%.
> > >
> > > when I do ps -ef, I see a lot of backfill processes. All the process
> > > running the same command. I use the pid and know more about each
> process
> > > (process sys var etc).
> > > All these processes are exactly the same. Why so many processes?
> > >
> > >
> > > Also, my worry is really how much memory is enough? How is the memory
> > > management done (object pools etc) ?
> >
> > Airflow runs as many tasks as is defined by parallelism in the config,
> > defaulting to 32. If you are backfilling a couple of days it will easily
> > reach this limit.
> >
> > For every task airflow will run a copy of itself.  It will thus also use
> > its own pool of database connections. Which can be quite significant.
> >
> > So if you are running your db and airflow in the same container you can
> > indeed quite quickly reach 8gb+, also depending you db caching and
> > parallelism settings.
> >
> > Bolke
>



-- 
Lance Norskog
lance.nors...@gmail.com
Redwood City, CA

Reply via email to