I implemented the first version of DAG serialization part in AIP-24:
https://github.com/apache/airflow/pull/5701. Please take a look if you are
interested @all. Thanks!

It contains almost all fields of DAGs and tasks in the serialization (an
example of serialized DAG here:
https://github.com/apache/airflow/blob/35e38f19b09646a0f85a2a7866a8d9aacc345252/tests/dags/test_dag_serialization.py#L100).
So basically the webserver can still treat them as before. No webserver UI
code change is needed. The benefit is that we can use it for 1.10.*.

Of course, it is a short-term fix compared to many long-term proposals.

It only contains serialization. I verified its usage in UI end-to-end by
using the Async DAG Loader in https://github.com/apache/airflow/pull/5594.
I split the DAG serialization out of 5594 since Async DAG Loader is an
optional one. (I suddenly recall that if there are N webserver process + 1
async DAG loading process, it may solve webserver inconsistency problem??)


On Wed, Jul 31, 2019 at 10:33 AM Tao Feng <fengta...@gmail.com> wrote:

> hey Zhou,
>
> Great to see this happens and make it backward compatible. I will persist
> DAG into DB is definitely needed. And it will make migration easier with a
> lightweight approach. At Lyft we sometimes observe nondeterministic
> increased scheduling delay once users add some dynamic generated large DAGs
> with thousands of tasks.
>
> I will spend some time to look at your proposal more in more detail. But I
> agree that this is the most important pain point that we should address.
> And let me know if anything I could help to facilitate this.
>
>
> On Mon, Jul 29, 2019 at 2:13 PM Zhou Fang <zhouf...@google.com.invalid>
> wrote:
>
> > Thanks everyone for the discussion. The comments are very helpful.
> >
> > AIP-24 that we proposed here is really a short-term one to minimize the
> > change for fast launch and compatibility. I agree with the benefits of
> the
> > long-term proposals. It would be great if AIP-24 can be a first step (if
> we
> > can agree with the basic serialization approach). Then we can gradually
> > apply long-term fixes.
> >
> > I summarized a few long-term proposals (from Fokko and Ash) and added a
> > 'timeline' in AIP-24 (make things more clear):
> >
> > *Terms*
> >
> >    - (this) stringified DAG: a patch to current DAG that can be JSONified
> >    - (long-term) serialized DAG: a new serializable DAG class used by
> >    webserver/scheduler
> >
> > *Proposed timeline*
> >
> >    1. (this) JSON Serialization of DAGs
> >       1. will be out with https://github.com/apache/airflow/pull/5594
> >
> >       2. (this, optional) Asynchronous  DAG loading in webserver
> >       1.  webserver process uses a background process to collect DAGs,
> >       solve scalability issue before DAG persistence in DB being out
> >       2. webserver process itself does not need to restart every 30s to
> >       collect DAGs
> >       3. will be out with https://github.com/apache/airflow/pull/5594
> >
> >       3. (this) DAG persistence in DB for webserver
> >       1. minimal Airflow code change
> >       2. an optional feature enabled via configuration
> >       3. rolled out with Airflow 1.10.5
> >
> >       4. (this, optional) Using DAG cached in DB for scheduling
> >
> >    5. (long-term) Defining serialized DAG for webserver
> >       1. this proposal keeps all fields of DAG/Operator, however, some
> >       fields are not used by webserver or scheduler
> >       2. trimming these fields are easy, just providing a list of fields
> to
> >       include or exclude (Sec 2.3): _serialize_object(x, visited_dags)
> >       =>_serialize_object(x, visited_dags, include=['foo'],
> > exclude=['bar'])
> >       3. we should carefully check all webserver/scheduler code to make
> >       sure trimmed fields are not used, e.g., *task.owner* is used in
> >       webserver
> >
> >       6. (long-term) Defining serialized DAG for scheduler
> >       1. Once we have 'stringified DAG' or 'serialized DAG',
> >       SimpleDAG/SimpleTaskInstance used by scheduler are not needed
> >       2. adding more fields to stringified DAGs to be compatible with
> >       scheduler
> >
> >       7. (long-term) Directly reading DAGs from DB in webserver
> >       1. let webserver process fetch data from DB, instead of making a
> DAG
> >       bag and refresh it
> >       2. it solves the webserver inconsistency issue
> >
> >       8. (long-term) Event-driven DAG parsing
> >       1. Instead of polling DAG files for updating/deleting DAGs, event
> >       based approaches, *e.g.*, inotify (
> >       https://pypi.org/project/inotify_simple/) can be used
> >
> >
> >
> >
> >
> > On Mon, Jul 29, 2019 at 3:23 AM Kaxil Naik <kaxiln...@gmail.com> wrote:
> >
> > > Thanks all for the input and thanks Zhou too for the detailed AIP.
> > >
> > > The WIP PR can be a good first step to overall optimization.
> > >
> > > Let's sync-up on the progress you have already made & what we want to
> > > target.
> > >
> > > @Jarek Potiuk <jarek.pot...@polidea.com> & @Fokko  - If we manage to
> > make
> > > it entirely backward-compatible with an enable/disable flag as we
> > > mentioned, we can think of including it in 1.10.5 but I am in favor of
> > > removing / cleaning stuff like pickles, drop Py 2.0 and cut Airflow 2.0
> > and
> > > include this change there.
> > >
> > >
> > >
> > >
> > > On Mon, Jul 29, 2019 at 1:03 PM Jarek Potiuk <jarek.pot...@polidea.com
> >
> > > wrote:
> > >
> > > > Actually I am also doing a lot of v1-10-test merges during the last
> few
> > > > months (probably several tens of them already). Rarely the conflicts
> > are
> > > > difficult to solve in fact. We have usually small, localised changes
> > and
> > > > until we go for full Black file re-formatting, we should be ok (and
> the
> > > > change from Zhou seems rather small and localised).
> > > >
> > > > J.
> > > >
> > > > On Mon, Jul 29, 2019 at 9:25 AM Driesprong, Fokko
> <fo...@driesprong.frl
> > >
> > > > wrote:
> > > >
> > > > > I would be hesitant to merge it into 1.10.5. When I try to backport
> > > > > anything into the 1.x branch, I get a whole bunch on merge
> conflicts,
> > > > even
> > > > > on the trivial tickets. For me, the only one who can really comment
> > on
> > > > this
> > > > > would be Ash, since he's doing the bulk of the conflict resolving.
> > > Apart
> > > > > from that, I'm really excited to make this happen!
> > > > >
> > > > > Cheers, Fokko
> > > > >
> > > > >
> > > > >
> > > > > Op zo 28 jul. 2019 om 20:23 schreef Jarek Potiuk <
> > > > jarek.pot...@polidea.com
> > > > > >:
> > > > >
> > > > > > Some thought I have after looking at the proposal from Zhou.
> > > > > >
> > > > > > I think this is one of the most important things feature-wise for
> > > > > Airflow.
> > > > > > It looks like we have several in-progress attempts to solve the
> > > problem
> > > > > and
> > > > > > I guess we should agree common approach.
> > > > > >
> > > > > > I like very much the approach of Zhou (AIP-24). It does seem to
> > > > minimise
> > > > > > the changes needed in Airflow and it means that we with some
> > > > > optimisations
> > > > > > (caching mentioned by Fokko) - it can solve the major pain points
> > > and I
> > > > > > think relatively quick and is potentially portable to 1.10.5 if
> we
> > > have
> > > > > it.
> > > > > >
> > > > > > I wonder how much it overlaps/differs from what Kaxil and Ash
> ideas
> > > > are.
> > > > > If
> > > > > > I read it correctly - it sounds like this idea will contain some
> > more
> > > > > > "fundamental" changes. Ones that are likely less
> > > backwards-compatible,
> > > > > and
> > > > > > potentially taking longer time to implement and test. And likely
> > > > solving
> > > > > > some of the problems better or even solving other problems. Am I
> > > right
> > > > > with
> > > > > > my assumptions?
> > > > > >
> > > > > > I think more information on this might be helpful so that we all
> > know
> > > > if
> > > > > > those are two different AIPs, or whether they can be joined in
> one
> > > > > effort,
> > > > > > and how they relate to AIP-18/AIP-19 (should those be deprecated
> or
> > > > > > independently implemented ?). Also - since 2.0.0 release is half
> a
> > > year
> > > > > > ahead we should consider how it impact the roadmap.
> > > > > >
> > > > > > I can see three approaches here that we as community can follow
> > > (maybe
> > > > I
> > > > > am
> > > > > > missing some :) ):
> > > > > >
> > > > > > 1) focus our work on single "complete" solution that will take
> > longer
> > > > > time
> > > > > > and targets 2.0.0.
> > > > > > 2) work on two of them: one quick/fast - potentially portable to
> > > > 1.10.5m
> > > > > > one longer-term for 2.0.0.
> > > > > > 3) decide that the simple solution we have from Zhou (maybe with
> > some
> > > > > > modifications) is our target solution (for both 1.10.5 if we have
> > it
> > > > and
> > > > > > 2.0.0):
> > > > > >
> > > > > > J.
> > > > > >
> > > > > > On Sat, Jul 27, 2019 at 11:43 AM Kevin Yang <yrql...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Nice job Zhou!
> > > > > > >
> > > > > > > Really excited, exactly what we wanted for the webserver
> scaling
> > > > issue.
> > > > > > > Want to add another big drive for Airbnb to start think about
> > this
> > > > > > > previously to support the effort: it can not only bring
> > consistency
> > > > > > between
> > > > > > > webservers but also bring consistency between webserver and
> > > > > > > scheduler/workers. It may be less of a problem if total DAG
> > parsing
> > > > > time
> > > > > > is
> > > > > > > small, but for us the total DAG parsing time is 15+ mins and we
> > had
> > > > to
> > > > > > set
> > > > > > > the webserver( gunicorn subprocesses) restart interval to 20
> > mins,
> > > > > which
> > > > > > > leads to a worst case 15+20+15=50 mins delay between scheduler
> > > start
> > > > to
> > > > > > > schedule things and users can see their deployed
> DAGs/changes...
> > > > > > >
> > > > > > > I'm not so sure about the scheduler performance improvement:
> > > > currently
> > > > > we
> > > > > > > already feed the main scheduler process with SimpleDag through
> > > > > > > DagFileProcessorManager running in a subprocess--in the future
> we
> > > > feed
> > > > > it
> > > > > > > with data from DB, which is likely slower( tho the diff should
> > have
> > > > > > > negligible impact to the scheduler performance). In fact if
> we'd
> > > keep
> > > > > the
> > > > > > > existing behavior, try schedule only fresh parsed DAGs, then we
> > may
> > > > > need
> > > > > > to
> > > > > > > deal with some consistency issue--dag processor and the
> scheduler
> > > > race
> > > > > > for
> > > > > > > updating the flag indicating if the DAG is newly parsed. No big
> > > deal
> > > > > > there
> > > > > > > but just some thoughts on the top of my head and hopefully can
> be
> > > > > > helpful.
> > > > > > >
> > > > > > > And good idea on pre-rendering the template, believe template
> > > > rendering
> > > > > > was
> > > > > > > the biggest concern in the previous discussion. We've also
> chose
> > > the
> > > > > > > pre-rendering+JSON approach in our smart sensor API
> > > > > > > <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+Airflow+sensor+optimization
> > > > > > > >
> > > > > > > and
> > > > > > > seems to be working fine--a supporting case for ur proposal ;)
> > > > There's
> > > > > a
> > > > > > > WIP
> > > > > > > PR <https://github.com/apache/airflow/pull/5499> for it just
> in
> > > case
> > > > > you
> > > > > > > are interested--maybe we can even share some logics.
> > > > > > >
> > > > > > > Thumbs-up again for this and please don't heisitate to reach
> out
> > if
> > > > you
> > > > > > > want to discuss further with us or need any help from us.
> > > > > > >
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Kevin Y
> > > > > > >
> > > > > > > On Sat, Jul 27, 2019 at 12:54 AM Driesprong, Fokko
> > > > > <fo...@driesprong.frl
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Looks great Zhou,
> > > > > > > >
> > > > > > > > I have one thing that pops in my mind while reading the AIP;
> > > should
> > > > > > keep
> > > > > > > > the caching on the webserver level. As the famous quote goes:
> > > > *"There
> > > > > > are
> > > > > > > > only two hard things in Computer Science: cache invalidation
> > and
> > > > > naming
> > > > > > > > things." -- Phil Karlton*
> > > > > > > >
> > > > > > > > Right now, the fundamental change that is being proposed in
> the
> > > AIP
> > > > > is
> > > > > > > > fetching the DAGs from the database in a serialized format,
> and
> > > not
> > > > > > > parsing
> > > > > > > > the Python files all the time. This will give already a great
> > > > > > performance
> > > > > > > > improvement on the webserver side because it removes a lot of
> > the
> > > > > > > > processing. However, since we're still fetching the DAGs from
> > the
> > > > > > > database
> > > > > > > > in a regular interval, cache it in the local process, so we
> > still
> > > > > have
> > > > > > > the
> > > > > > > > two issues that Airflow is suffering from right now:
> > > > > > > >
> > > > > > > >    1. No snappy UI because it is still polling the database
> in
> > a
> > > > > > regular
> > > > > > > >    interval.
> > > > > > > >    2. Inconsistency between webservers because they might
> poll
> > > in a
> > > > > > > >    different interval, I think we've all seen this:
> > > > > > > >    https://www.youtube.com/watch?v=sNrBruPS3r4
> > > > > > > >
> > > > > > > > As I also mentioned in the Slack channel, I strongly feel
> that
> > we
> > > > > > should
> > > > > > > be
> > > > > > > > able to render most views from the tables in the database, so
> > > > without
> > > > > > > > touching the blob. For specific views, we could just pull the
> > > blob
> > > > > from
> > > > > > > the
> > > > > > > > database. In this case we always have the latest version, and
> > we
> > > > > tackle
> > > > > > > the
> > > > > > > > second point above.
> > > > > > > >
> > > > > > > > To tackle the first one, I also have an idea. We should
> change
> > > the
> > > > > DAG
> > > > > > > > parser from a loop to something that uses inotify
> > > > > > > > https://pypi.org/project/inotify_simple/. This will change
> it
> > > from
> > > > > > > polling
> > > > > > > > to an event-driven design, which is much more performant and
> > less
> > > > > > > resource
> > > > > > > > hungry. But this would be an AIP on its own.
> > > > > > > >
> > > > > > > > Again, great design and a comprehensive AIP, but I would
> > include
> > > > the
> > > > > > > > caching on the webserver to greatly improve the user
> experience
> > > in
> > > > > the
> > > > > > > UI.
> > > > > > > > Looking forward to the opinion of others on this.
> > > > > > > >
> > > > > > > > Cheers, Fokko
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Op za 27 jul. 2019 om 01:44 schreef Zhou Fang
> > > > > > > <zhouf...@google.com.invalid
> > > > > > > > >:
> > > > > > > >
> > > > > > > > > Hi Kaxi,
> > > > > > > > >
> > > > > > > > > Just sent out the AIP:
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-24+DAG+Persistence+in+DB+using+JSON+for+Airflow+Webserver+and+%28optional%29+Scheduler
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > > Zhou
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, Jul 26, 2019 at 1:33 PM Zhou Fang <
> > zhouf...@google.com
> > > >
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Kaxil,
> > > > > > > > > >
> > > > > > > > > > We are also working on persisting DAGs into DB using JSON
> > for
> > > > > > Airflow
> > > > > > > > > > webserver in Google Composer. We target at minimizing the
> > > > change
> > > > > to
> > > > > > > the
> > > > > > > > > > current Airflow code. Happy to get synced on this!
> > > > > > > > > >
> > > > > > > > > > Here is our progress:
> > > > > > > > > > (1) Serializing DAGs using Pickle to be used in webserver
> > > > > > > > > > It has been launched in Composer. I am working on the PR
> to
> > > > > > upstream
> > > > > > > > it:
> > > > > > > > > > https://github.com/apache/airflow/pull/5594
> > > > > > > > > > Currently it does not support non-Airflow operators and
> we
> > > are
> > > > > > > working
> > > > > > > > on
> > > > > > > > > > a fix.
> > > > > > > > > >
> > > > > > > > > > (2) Caching Pickled DAGs in DB to be used by webserver
> > > > > > > > > > We have a proof-of-concept implementation, working on an
> > AIP
> > > > now.
> > > > > > > > > >
> > > > > > > > > > (3) Using JSON instead of Pickle in (1) and (2)
> > > > > > > > > > Decided to use JSON because Pickle is not secure and
> human
> > > > > > readable.
> > > > > > > > The
> > > > > > > > > > serialization approach is very similar to (1).
> > > > > > > > > >
> > > > > > > > > > I will update the RP (
> > > > > https://github.com/apache/airflow/pull/5594)
> > > > > > > to
> > > > > > > > > > replace Pickle by JSON, and send our design of (2) as an
> > AIP
> > > > next
> > > > > > > week.
> > > > > > > > > > Glad to check together whether our implementation makes
> > sense
> > > > and
> > > > > > do
> > > > > > > > > > improvements on that.
> > > > > > > > > >
> > > > > > > > > > Thanks!
> > > > > > > > > > Zhou
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Fri, Jul 26, 2019 at 7:37 AM Kaxil Naik <
> > > > kaxiln...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >> Hi all,
> > > > > > > > > >>
> > > > > > > > > >> We, at Astronomer, are going to spend time working on
> DAG
> > > > > > > > Serialisation.
> > > > > > > > > >> There are 2 AIPs that are somewhat related to what we
> plan
> > > to
> > > > > work
> > > > > > > on:
> > > > > > > > > >>
> > > > > > > > > >>    - AIP-18 Persist all information from DAG file in DB
> > > > > > > > > >>    <
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-18+Persist+all+information+from+DAG+file+in+DB
> > > > > > > > > >> >
> > > > > > > > > >>    - AIP-19 Making the webserver stateless
> > > > > > > > > >>    <
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-19+Making+the+webserver+stateless
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > > >> We plan to use JSON as the Serialisation format and
> store
> > it
> > > > as
> > > > > a
> > > > > > > blob
> > > > > > > > > in
> > > > > > > > > >> metadata DB.
> > > > > > > > > >>
> > > > > > > > > >> *Goals:*
> > > > > > > > > >>
> > > > > > > > > >>    - Make Webserver Stateless
> > > > > > > > > >>    - Use the same version of the DAG across Webserver &
> > > > > Scheduler
> > > > > > > > > >>    - Keep backward compatibility and have a flag
> > (globally &
> > > > at
> > > > > > DAG
> > > > > > > > > level)
> > > > > > > > > >>    to turn this feature on/off
> > > > > > > > > >>    - Enable DAG Versioning (extended Goal)
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> We will be preparing a proposal (AIP) after some
> research
> > > and
> > > > > some
> > > > > > > > > initial
> > > > > > > > > >> work and open it for the suggestions of the community.
> > > > > > > > > >>
> > > > > > > > > >> We already had some good brain-storming sessions with
> > > Twitter
> > > > > > folks
> > > > > > > > > (DanD
> > > > > > > > > >> &
> > > > > > > > > >> Sumit), folks from GoDataDriven (Fokko & Bas) & Alex
> (from
> > > > Uber)
> > > > > > > which
> > > > > > > > > >> will
> > > > > > > > > >> be a good starting point for us.
> > > > > > > > > >>
> > > > > > > > > >> If anyone in the community is interested in it or has
> some
> > > > > > > experience
> > > > > > > > > >> about
> > > > > > > > > >> the same and want to collaborate please let me know and
> > join
> > > > > > > > > >> #dag-serialisation channel on Airflow Slack.
> > > > > > > > > >>
> > > > > > > > > >> Regards,
> > > > > > > > > >> Kaxil
> > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Jarek Potiuk
> > > > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > > > >
> > > > > > M: +48 660 796 129 <+48660796129>
> > > > > > [image: Polidea] <https://www.polidea.com/>
> > > > > >
> > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Jarek Potiuk
> > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > >
> > > > M: +48 660 796 129 <+48660796129>
> > > > [image: Polidea] <https://www.polidea.com/>
> > > >
> > >
> > >
> > > --
> > > *Kaxil Naik*
> > > *Big Data Consultant | DevOps Data Engineer*
> > > *Certified *Google Cloud Data Engineer | *Certified* Apache Spark &
> Neo4j
> > > Developer
> > > *LinkedIn*: https://www.linkedin.com/in/kaxil
> > >
> >
>

Reply via email to