Re: Airflow DAG Serialisation

Philippe Gagnon Thu, 01 Aug 2019 06:28:10 -0700

I am not sure that pushing the responsibility of generating DAGs to clients
outright is the right approach (I think it might raise the barrier to entry
for newcomers), but I like the design that you are proposing. If we were to
completely decouple DAG parsing from the scheduler then DAG processing
could be scaled horizontally.


On Wed, Jul 31, 2019 at 8:13 AM Dan Davydov <[email protected]>
wrote:

> An idea for serialization of dynamic DAGs is moving the serialization to
> the actual clients.
> This would require having a python Airflow API that the clients could call
> like dag.publish(). This enables a couple of things:
> 1) Clients can serialize as often as they like, and can even serialize in
> an event-driven approach (e.g. when one of the underlying data sources
> changes) as opposed to polling.
> 2) Very lightweight processing for the scheduler (only need to parse JSON
> rather than DAG python files), especially for DAGs that are harder to
> parse. Eventually we can even make the Scheduler event based once this is
> done (DAG "publishes" retrigger DAGs to be parsed rather than polling on an
> interval).
> 3) Enables security/multi-tenancy; A) the scheduler no long has to run
> arbitrary python code from DAG definitions which could potentially do
> something nefarious, and wouldn't even need to "sudo su". B) there would be
> first-class support for auth at the publish stage.
> 4) Clients can decide what kinds of dependencies they want to use in order
> to parse their DAGs, the scheduler no longer e.g. needs to include the
> mysql python library, and all Airflow schedulers can have the exact same
> set of dependencies.
>
>
> On Wed, Jul 31, 2019 at 9:19 AM Driesprong, Fokko <[email protected]>
> wrote:
>
> > Hi Jon,
> >
> > I would argue that that would be the wrong approach. The processing of
> the
> > DAGs is being moved to one single place, and there they will be parsed in
> > parallel. This is already being done in the scheduler (look for
> max_threads
> > in the code). As soon as we have the DAG serialized properly, the work on
> > the webserver side should be minimal, i.e. deserializing and showing the
> > DAG. The dynamic generation will be done on the scheduler.
> >
> > The issue with the dynamic DAGs is similar. For example, my consultancy
> > company implements a LOT of Airflow dags at customers and dynamic dags
> are
> > still challenging. For example, we had one Python file which would
> generate
> > around 350 DAGs with 12 tasks. Each time one of the executors would run
> one
> > of the tasks, it would inflate the whole DAG and run the specific task.
> > This introduced a lot of overhead. @Bas Harenslak
> > <[email protected]> solved
> > this by generating 350 DAGs from a Jinja template, which works quite
> well.
> >
> > Keeping AIP-24 small and simple sounds like a very good idea, we've seen
> > scope creeps before on similar AIP's :-)
> >
> > Cheers, Fokko Driesprong
> >
> > Op di 30 jul. 2019 om 15:46 schreef Ash Berlin-Taylor <[email protected]>:
> >
> > > Hi Jon,
> > >
> > > As part of this AIP(24) we aren't going to touch the scheduler any more
> > > than absolutely required, but yes, better support of dynamic DAGs is
> > _very
> > > much_ on Kaxil and I's hit list.
> > >
> > > Our rough approach right now is to design the serialisation format well
> > > enough (including versioning it so we can change it over time) such
> that
> > we
> > > can change the scheduler to not be as coupled with the dag parsing
> loop.
> > > But for the sake of small, reviewable PRs we'll do it bit-by-bit.
> > >
> > > -ash
> > >
> > > On 2019/07/30 13:33:53, Jonathan Miles <[email protected]> wrote:
> > > > Another ask for the long-term list.
> > > >
> > > >  From a superficial read of the code, it looks like this asynchronous
> > > > DAG loading approach could also be a stepping stone towards loading
> > DAGs
> > > > in parallel? I've come across a case of someone dynamically
> generating
> > a
> > > > DAG based on an external data source. Problem with that is when the
> > data
> > > > source isn't available or is slow, it can block the loading of other
> > > > DAGs. Loading in parallel could isolate the failing or slow DAGs from
> > > > the good ones.
> > > >
> > > > I suppose even with this patch, randomising the load order of DAGs
> > could
> > > > also provide some basic protection against a small set of failing
> DAGs.
> > > > At least some would get updated.
> > > >
> > > > Do the changes only affect the webserver or also loading in the
> > > scheduler?
> > > >
> > > > Thanks,
> > > >
> > > > Jon
> > > >
> > > > On 29/07/2019 22:18, Zhou Fang wrote:
> > > > > Hi Kevin,
> > > > >
> > > > > The problem that DAG parsing takes a long time can be solved by
> > > > > Asynchronous DAG loading:
> > https://github.com/apache/airflow/pull/5594
> > > > >
> > > > > The idea is the a background process parses DAG files, and sends
> DAGs
> > > to
> > > > > webserver process every [webserver] dagbag_sync_interval = 10s.
> > > > >
> > > > > We have launched it in Composer, so our users can set webserver
> > worker
> > > > > restart interval to 1 hour (or longer). The background DAG parsing
> > > > > processing refresh all DAGs per [webserver] =
> collect_dags_interval =
> > > 30s.
> > > > >
> > > > > If parsing all DAGs take 15min, you can see DAGs being gradually
> > > freshed
> > > > > with this feature.
> > > > >
> > > > > Thanks,
> > > > > Zhou
> > > > >
> > > > >
> > > > > On Sat, Jul 27, 2019 at 2:43 AM Kevin Yang <[email protected]>
> > wrote:
> > > > >
> > > > >> Nice job Zhou!
> > > > >>
> > > > >> Really excited, exactly what we wanted for the webserver scaling
> > > issue.
> > > > >> Want to add another big drive for Airbnb to start think about this
> > > > >> previously to support the effort: it can not only bring
> consistency
> > > between
> > > > >> webservers but also bring consistency between webserver and
> > > > >> scheduler/workers. It may be less of a problem if total DAG
> parsing
> > > time is
> > > > >> small, but for us the total DAG parsing time is 15+ mins and we
> had
> > > to set
> > > > >> the webserver( gunicorn subprocesses) restart interval to 20 mins,
> > > which
> > > > >> leads to a worst case 15+20+15=50 mins delay between scheduler
> start
> > > to
> > > > >> schedule things and users can see their deployed DAGs/changes...
> > > > >>
> > > > >> I'm not so sure about the scheduler performance improvement:
> > > currently we
> > > > >> already feed the main scheduler process with SimpleDag through
> > > > >> DagFileProcessorManager running in a subprocess--in the future we
> > > feed it
> > > > >> with data from DB, which is likely slower( tho the diff should
> have
> > > > >> negligible impact to the scheduler performance). In fact if we'd
> > keep
> > > the
> > > > >> existing behavior, try schedule only fresh parsed DAGs, then we
> may
> > > need to
> > > > >> deal with some consistency issue--dag processor and the scheduler
> > > race for
> > > > >> updating the flag indicating if the DAG is newly parsed. No big
> deal
> > > there
> > > > >> but just some thoughts on the top of my head and hopefully can be
> > > helpful.
> > > > >>
> > > > >> And good idea on pre-rendering the template, believe template
> > > rendering was
> > > > >> the biggest concern in the previous discussion. We've also chose
> the
> > > > >> pre-rendering+JSON approach in our smart sensor API
> > > > >> <
> > > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+Airflow+sensor+optimization
> > > > >> and
> > > > >> seems to be working fine--a supporting case for ur proposal ;)
> > > There's a
> > > > >> WIP
> > > > >> PR <https://github.com/apache/airflow/pull/5499> for it just in
> > case
> > > you
> > > > >> are interested--maybe we can even share some logics.
> > > > >>
> > > > >> Thumbs-up again for this and please don't heisitate to reach out
> if
> > > you
> > > > >> want to discuss further with us or need any help from us.
> > > > >>
> > > > >>
> > > > >> Cheers,
> > > > >> Kevin Y
> > > > >>
> > > > >> On Sat, Jul 27, 2019 at 12:54 AM Driesprong, Fokko
> > > <[email protected]>
> > > > >> wrote:
> > > > >>
> > > > >>> Looks great Zhou,
> > > > >>>
> > > > >>> I have one thing that pops in my mind while reading the AIP;
> should
> > > keep
> > > > >>> the caching on the webserver level. As the famous quote goes:
> > > *"There are
> > > > >>> only two hard things in Computer Science: cache invalidation and
> > > naming
> > > > >>> things." -- Phil Karlton*
> > > > >>>
> > > > >>> Right now, the fundamental change that is being proposed in the
> AIP
> > > is
> > > > >>> fetching the DAGs from the database in a serialized format, and
> not
> > > > >> parsing
> > > > >>> the Python files all the time. This will give already a great
> > > performance
> > > > >>> improvement on the webserver side because it removes a lot of the
> > > > >>> processing. However, since we're still fetching the DAGs from the
> > > > >> database
> > > > >>> in a regular interval, cache it in the local process, so we still
> > > have
> > > > >> the
> > > > >>> two issues that Airflow is suffering from right now:
> > > > >>>
> > > > >>>     1. No snappy UI because it is still polling the database in a
> > > regular
> > > > >>>     interval.
> > > > >>>     2. Inconsistency between webservers because they might poll
> in
> > a
> > > > >>>     different interval, I think we've all seen this:
> > > > >>>     https://www.youtube.com/watch?v=sNrBruPS3r4
> > > > >>>
> > > > >>> As I also mentioned in the Slack channel, I strongly feel that we
> > > should
> > > > >> be
> > > > >>> able to render most views from the tables in the database, so
> > without
> > > > >>> touching the blob. For specific views, we could just pull the
> blob
> > > from
> > > > >> the
> > > > >>> database. In this case we always have the latest version, and we
> > > tackle
> > > > >> the
> > > > >>> second point above.
> > > > >>>
> > > > >>> To tackle the first one, I also have an idea. We should change
> the
> > > DAG
> > > > >>> parser from a loop to something that uses inotify
> > > > >>> https://pypi.org/project/inotify_simple/. This will change it
> from
> > > > >> polling
> > > > >>> to an event-driven design, which is much more performant and less
> > > > >> resource
> > > > >>> hungry. But this would be an AIP on its own.
> > > > >>>
> > > > >>> Again, great design and a comprehensive AIP, but I would include
> > the
> > > > >>> caching on the webserver to greatly improve the user experience
> in
> > > the
> > > > >> UI.
> > > > >>> Looking forward to the opinion of others on this.
> > > > >>>
> > > > >>> Cheers, Fokko
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> Op za 27 jul. 2019 om 01:44 schreef Zhou Fang
> > > > >> <[email protected]
> > > > >>>> :
> > > > >>>> Hi Kaxi,
> > > > >>>>
> > > > >>>> Just sent out the AIP:
> > > > >>>>
> > > > >>>>
> > > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-24+DAG+Persistence+in+DB+using+JSON+for+Airflow+Webserver+and+%28optional%29+Scheduler
> > > > >>>> Thanks!
> > > > >>>> Zhou
> > > > >>>>
> > > > >>>>
> > > > >>>> On Fri, Jul 26, 2019 at 1:33 PM Zhou Fang <[email protected]>
> > > wrote:
> > > > >>>>
> > > > >>>>> Hi Kaxil,
> > > > >>>>>
> > > > >>>>> We are also working on persisting DAGs into DB using JSON for
> > > Airflow
> > > > >>>>> webserver in Google Composer. We target at minimizing the
> change
> > to
> > > > >> the
> > > > >>>>> current Airflow code. Happy to get synced on this!
> > > > >>>>>
> > > > >>>>> Here is our progress:
> > > > >>>>> (1) Serializing DAGs using Pickle to be used in webserver
> > > > >>>>> It has been launched in Composer. I am working on the PR to
> > > upstream
> > > > >>> it:
> > > > >>>>> https://github.com/apache/airflow/pull/5594
> > > > >>>>> Currently it does not support non-Airflow operators and we are
> > > > >> working
> > > > >>> on
> > > > >>>>> a fix.
> > > > >>>>>
> > > > >>>>> (2) Caching Pickled DAGs in DB to be used by webserver
> > > > >>>>> We have a proof-of-concept implementation, working on an AIP
> now.
> > > > >>>>>
> > > > >>>>> (3) Using JSON instead of Pickle in (1) and (2)
> > > > >>>>> Decided to use JSON because Pickle is not secure and human
> > > readable.
> > > > >>> The
> > > > >>>>> serialization approach is very similar to (1).
> > > > >>>>>
> > > > >>>>> I will update the RP (
> > https://github.com/apache/airflow/pull/5594)
> > > > >> to
> > > > >>>>> replace Pickle by JSON, and send our design of (2) as an AIP
> next
> > > > >> week.
> > > > >>>>> Glad to check together whether our implementation makes sense
> and
> > > do
> > > > >>>>> improvements on that.
> > > > >>>>>
> > > > >>>>> Thanks!
> > > > >>>>> Zhou
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> On Fri, Jul 26, 2019 at 7:37 AM Kaxil Naik <
> [email protected]>
> > > > >>> wrote:
> > > > >>>>>> Hi all,
> > > > >>>>>>
> > > > >>>>>> We, at Astronomer, are going to spend time working on DAG
> > > > >>> Serialisation.
> > > > >>>>>> There are 2 AIPs that are somewhat related to what we plan to
> > work
> > > > >> on:
> > > > >>>>>>     - AIP-18 Persist all information from DAG file in DB
> > > > >>>>>>     <
> > > > >>>>>>
> > > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-18+Persist+all+information+from+DAG+file+in+DB
> > > > >>>>>>     - AIP-19 Making the webserver stateless
> > > > >>>>>>     <
> > > > >>>>>>
> > > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-19+Making+the+webserver+stateless
> > > > >>>>>> We plan to use JSON as the Serialisation format and store it
> as
> > a
> > > > >> blob
> > > > >>>> in
> > > > >>>>>> metadata DB.
> > > > >>>>>>
> > > > >>>>>> *Goals:*
> > > > >>>>>>
> > > > >>>>>>     - Make Webserver Stateless
> > > > >>>>>>     - Use the same version of the DAG across Webserver &
> > Scheduler
> > > > >>>>>>     - Keep backward compatibility and have a flag (globally &
> at
> > > DAG
> > > > >>>> level)
> > > > >>>>>>     to turn this feature on/off
> > > > >>>>>>     - Enable DAG Versioning (extended Goal)
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> We will be preparing a proposal (AIP) after some research and
> > some
> > > > >>>> initial
> > > > >>>>>> work and open it for the suggestions of the community.
> > > > >>>>>>
> > > > >>>>>> We already had some good brain-storming sessions with Twitter
> > > folks
> > > > >>>> (DanD
> > > > >>>>>> &
> > > > >>>>>> Sumit), folks from GoDataDriven (Fokko & Bas) & Alex (from
> Uber)
> > > > >> which
> > > > >>>>>> will
> > > > >>>>>> be a good starting point for us.
> > > > >>>>>>
> > > > >>>>>> If anyone in the community is interested in it or has some
> > > > >> experience
> > > > >>>>>> about
> > > > >>>>>> the same and want to collaborate please let me know and join
> > > > >>>>>> #dag-serialisation channel on Airflow Slack.
> > > > >>>>>>
> > > > >>>>>> Regards,
> > > > >>>>>> Kaxil
> > > > >>>>>>
> > > >
> > >
> >
>

Re: Airflow DAG Serialisation

Reply via email to