Thanks everyone for the discussion. The comments are very helpful.

AIP-24 that we proposed here is really a short-term one to minimize the
change for fast launch and compatibility. I agree with the benefits of the
long-term proposals. It would be great if AIP-24 can be a first step (if we
can agree with the basic serialization approach). Then we can gradually
apply long-term fixes.

I summarized a few long-term proposals (from Fokko and Ash) and added a
'timeline' in AIP-24 (make things more clear):

*Terms*

   - (this) stringified DAG: a patch to current DAG that can be JSONified
   - (long-term) serialized DAG: a new serializable DAG class used by
   webserver/scheduler

*Proposed timeline*

   1. (this) JSON Serialization of DAGs
      1. will be out with https://github.com/apache/airflow/pull/5594

      2. (this, optional) Asynchronous  DAG loading in webserver
      1.  webserver process uses a background process to collect DAGs,
      solve scalability issue before DAG persistence in DB being out
      2. webserver process itself does not need to restart every 30s to
      collect DAGs
      3. will be out with https://github.com/apache/airflow/pull/5594

      3. (this) DAG persistence in DB for webserver
      1. minimal Airflow code change
      2. an optional feature enabled via configuration
      3. rolled out with Airflow 1.10.5

      4. (this, optional) Using DAG cached in DB for scheduling

   5. (long-term) Defining serialized DAG for webserver
      1. this proposal keeps all fields of DAG/Operator, however, some
      fields are not used by webserver or scheduler
      2. trimming these fields are easy, just providing a list of fields to
      include or exclude (Sec 2.3): _serialize_object(x, visited_dags)
      =>_serialize_object(x, visited_dags, include=['foo'], exclude=['bar'])
      3. we should carefully check all webserver/scheduler code to make
      sure trimmed fields are not used, e.g., *task.owner* is used in
      webserver

      6. (long-term) Defining serialized DAG for scheduler
      1. Once we have 'stringified DAG' or 'serialized DAG',
      SimpleDAG/SimpleTaskInstance used by scheduler are not needed
      2. adding more fields to stringified DAGs to be compatible with
      scheduler

      7. (long-term) Directly reading DAGs from DB in webserver
      1. let webserver process fetch data from DB, instead of making a DAG
      bag and refresh it
      2. it solves the webserver inconsistency issue

      8. (long-term) Event-driven DAG parsing
      1. Instead of polling DAG files for updating/deleting DAGs, event
      based approaches, *e.g.*, inotify (
      https://pypi.org/project/inotify_simple/) can be used





On Mon, Jul 29, 2019 at 3:23 AM Kaxil Naik <kaxiln...@gmail.com> wrote:

> Thanks all for the input and thanks Zhou too for the detailed AIP.
>
> The WIP PR can be a good first step to overall optimization.
>
> Let's sync-up on the progress you have already made & what we want to
> target.
>
> @Jarek Potiuk <jarek.pot...@polidea.com> & @Fokko  - If we manage to make
> it entirely backward-compatible with an enable/disable flag as we
> mentioned, we can think of including it in 1.10.5 but I am in favor of
> removing / cleaning stuff like pickles, drop Py 2.0 and cut Airflow 2.0 and
> include this change there.
>
>
>
>
> On Mon, Jul 29, 2019 at 1:03 PM Jarek Potiuk <jarek.pot...@polidea.com>
> wrote:
>
> > Actually I am also doing a lot of v1-10-test merges during the last few
> > months (probably several tens of them already). Rarely the conflicts are
> > difficult to solve in fact. We have usually small, localised changes and
> > until we go for full Black file re-formatting, we should be ok (and the
> > change from Zhou seems rather small and localised).
> >
> > J.
> >
> > On Mon, Jul 29, 2019 at 9:25 AM Driesprong, Fokko <fo...@driesprong.frl>
> > wrote:
> >
> > > I would be hesitant to merge it into 1.10.5. When I try to backport
> > > anything into the 1.x branch, I get a whole bunch on merge conflicts,
> > even
> > > on the trivial tickets. For me, the only one who can really comment on
> > this
> > > would be Ash, since he's doing the bulk of the conflict resolving.
> Apart
> > > from that, I'm really excited to make this happen!
> > >
> > > Cheers, Fokko
> > >
> > >
> > >
> > > Op zo 28 jul. 2019 om 20:23 schreef Jarek Potiuk <
> > jarek.pot...@polidea.com
> > > >:
> > >
> > > > Some thought I have after looking at the proposal from Zhou.
> > > >
> > > > I think this is one of the most important things feature-wise for
> > > Airflow.
> > > > It looks like we have several in-progress attempts to solve the
> problem
> > > and
> > > > I guess we should agree common approach.
> > > >
> > > > I like very much the approach of Zhou (AIP-24). It does seem to
> > minimise
> > > > the changes needed in Airflow and it means that we with some
> > > optimisations
> > > > (caching mentioned by Fokko) - it can solve the major pain points
> and I
> > > > think relatively quick and is potentially portable to 1.10.5 if we
> have
> > > it.
> > > >
> > > > I wonder how much it overlaps/differs from what Kaxil and Ash ideas
> > are.
> > > If
> > > > I read it correctly - it sounds like this idea will contain some more
> > > > "fundamental" changes. Ones that are likely less
> backwards-compatible,
> > > and
> > > > potentially taking longer time to implement and test. And likely
> > solving
> > > > some of the problems better or even solving other problems. Am I
> right
> > > with
> > > > my assumptions?
> > > >
> > > > I think more information on this might be helpful so that we all know
> > if
> > > > those are two different AIPs, or whether they can be joined in one
> > > effort,
> > > > and how they relate to AIP-18/AIP-19 (should those be deprecated or
> > > > independently implemented ?). Also - since 2.0.0 release is half a
> year
> > > > ahead we should consider how it impact the roadmap.
> > > >
> > > > I can see three approaches here that we as community can follow
> (maybe
> > I
> > > am
> > > > missing some :) ):
> > > >
> > > > 1) focus our work on single "complete" solution that will take longer
> > > time
> > > > and targets 2.0.0.
> > > > 2) work on two of them: one quick/fast - potentially portable to
> > 1.10.5m
> > > > one longer-term for 2.0.0.
> > > > 3) decide that the simple solution we have from Zhou (maybe with some
> > > > modifications) is our target solution (for both 1.10.5 if we have it
> > and
> > > > 2.0.0):
> > > >
> > > > J.
> > > >
> > > > On Sat, Jul 27, 2019 at 11:43 AM Kevin Yang <yrql...@gmail.com>
> wrote:
> > > >
> > > > > Nice job Zhou!
> > > > >
> > > > > Really excited, exactly what we wanted for the webserver scaling
> > issue.
> > > > > Want to add another big drive for Airbnb to start think about this
> > > > > previously to support the effort: it can not only bring consistency
> > > > between
> > > > > webservers but also bring consistency between webserver and
> > > > > scheduler/workers. It may be less of a problem if total DAG parsing
> > > time
> > > > is
> > > > > small, but for us the total DAG parsing time is 15+ mins and we had
> > to
> > > > set
> > > > > the webserver( gunicorn subprocesses) restart interval to 20 mins,
> > > which
> > > > > leads to a worst case 15+20+15=50 mins delay between scheduler
> start
> > to
> > > > > schedule things and users can see their deployed DAGs/changes...
> > > > >
> > > > > I'm not so sure about the scheduler performance improvement:
> > currently
> > > we
> > > > > already feed the main scheduler process with SimpleDag through
> > > > > DagFileProcessorManager running in a subprocess--in the future we
> > feed
> > > it
> > > > > with data from DB, which is likely slower( tho the diff should have
> > > > > negligible impact to the scheduler performance). In fact if we'd
> keep
> > > the
> > > > > existing behavior, try schedule only fresh parsed DAGs, then we may
> > > need
> > > > to
> > > > > deal with some consistency issue--dag processor and the scheduler
> > race
> > > > for
> > > > > updating the flag indicating if the DAG is newly parsed. No big
> deal
> > > > there
> > > > > but just some thoughts on the top of my head and hopefully can be
> > > > helpful.
> > > > >
> > > > > And good idea on pre-rendering the template, believe template
> > rendering
> > > > was
> > > > > the biggest concern in the previous discussion. We've also chose
> the
> > > > > pre-rendering+JSON approach in our smart sensor API
> > > > > <
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+Airflow+sensor+optimization
> > > > > >
> > > > > and
> > > > > seems to be working fine--a supporting case for ur proposal ;)
> > There's
> > > a
> > > > > WIP
> > > > > PR <https://github.com/apache/airflow/pull/5499> for it just in
> case
> > > you
> > > > > are interested--maybe we can even share some logics.
> > > > >
> > > > > Thumbs-up again for this and please don't heisitate to reach out if
> > you
> > > > > want to discuss further with us or need any help from us.
> > > > >
> > > > >
> > > > > Cheers,
> > > > > Kevin Y
> > > > >
> > > > > On Sat, Jul 27, 2019 at 12:54 AM Driesprong, Fokko
> > > <fo...@driesprong.frl
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Looks great Zhou,
> > > > > >
> > > > > > I have one thing that pops in my mind while reading the AIP;
> should
> > > > keep
> > > > > > the caching on the webserver level. As the famous quote goes:
> > *"There
> > > > are
> > > > > > only two hard things in Computer Science: cache invalidation and
> > > naming
> > > > > > things." -- Phil Karlton*
> > > > > >
> > > > > > Right now, the fundamental change that is being proposed in the
> AIP
> > > is
> > > > > > fetching the DAGs from the database in a serialized format, and
> not
> > > > > parsing
> > > > > > the Python files all the time. This will give already a great
> > > > performance
> > > > > > improvement on the webserver side because it removes a lot of the
> > > > > > processing. However, since we're still fetching the DAGs from the
> > > > > database
> > > > > > in a regular interval, cache it in the local process, so we still
> > > have
> > > > > the
> > > > > > two issues that Airflow is suffering from right now:
> > > > > >
> > > > > >    1. No snappy UI because it is still polling the database in a
> > > > regular
> > > > > >    interval.
> > > > > >    2. Inconsistency between webservers because they might poll
> in a
> > > > > >    different interval, I think we've all seen this:
> > > > > >    https://www.youtube.com/watch?v=sNrBruPS3r4
> > > > > >
> > > > > > As I also mentioned in the Slack channel, I strongly feel that we
> > > > should
> > > > > be
> > > > > > able to render most views from the tables in the database, so
> > without
> > > > > > touching the blob. For specific views, we could just pull the
> blob
> > > from
> > > > > the
> > > > > > database. In this case we always have the latest version, and we
> > > tackle
> > > > > the
> > > > > > second point above.
> > > > > >
> > > > > > To tackle the first one, I also have an idea. We should change
> the
> > > DAG
> > > > > > parser from a loop to something that uses inotify
> > > > > > https://pypi.org/project/inotify_simple/. This will change it
> from
> > > > > polling
> > > > > > to an event-driven design, which is much more performant and less
> > > > > resource
> > > > > > hungry. But this would be an AIP on its own.
> > > > > >
> > > > > > Again, great design and a comprehensive AIP, but I would include
> > the
> > > > > > caching on the webserver to greatly improve the user experience
> in
> > > the
> > > > > UI.
> > > > > > Looking forward to the opinion of others on this.
> > > > > >
> > > > > > Cheers, Fokko
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Op za 27 jul. 2019 om 01:44 schreef Zhou Fang
> > > > > <zhouf...@google.com.invalid
> > > > > > >:
> > > > > >
> > > > > > > Hi Kaxi,
> > > > > > >
> > > > > > > Just sent out the AIP:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-24+DAG+Persistence+in+DB+using+JSON+for+Airflow+Webserver+and+%28optional%29+Scheduler
> > > > > > >
> > > > > > > Thanks!
> > > > > > > Zhou
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Jul 26, 2019 at 1:33 PM Zhou Fang <zhouf...@google.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Hi Kaxil,
> > > > > > > >
> > > > > > > > We are also working on persisting DAGs into DB using JSON for
> > > > Airflow
> > > > > > > > webserver in Google Composer. We target at minimizing the
> > change
> > > to
> > > > > the
> > > > > > > > current Airflow code. Happy to get synced on this!
> > > > > > > >
> > > > > > > > Here is our progress:
> > > > > > > > (1) Serializing DAGs using Pickle to be used in webserver
> > > > > > > > It has been launched in Composer. I am working on the PR to
> > > > upstream
> > > > > > it:
> > > > > > > > https://github.com/apache/airflow/pull/5594
> > > > > > > > Currently it does not support non-Airflow operators and we
> are
> > > > > working
> > > > > > on
> > > > > > > > a fix.
> > > > > > > >
> > > > > > > > (2) Caching Pickled DAGs in DB to be used by webserver
> > > > > > > > We have a proof-of-concept implementation, working on an AIP
> > now.
> > > > > > > >
> > > > > > > > (3) Using JSON instead of Pickle in (1) and (2)
> > > > > > > > Decided to use JSON because Pickle is not secure and human
> > > > readable.
> > > > > > The
> > > > > > > > serialization approach is very similar to (1).
> > > > > > > >
> > > > > > > > I will update the RP (
> > > https://github.com/apache/airflow/pull/5594)
> > > > > to
> > > > > > > > replace Pickle by JSON, and send our design of (2) as an AIP
> > next
> > > > > week.
> > > > > > > > Glad to check together whether our implementation makes sense
> > and
> > > > do
> > > > > > > > improvements on that.
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > > Zhou
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Jul 26, 2019 at 7:37 AM Kaxil Naik <
> > kaxiln...@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > >> Hi all,
> > > > > > > >>
> > > > > > > >> We, at Astronomer, are going to spend time working on DAG
> > > > > > Serialisation.
> > > > > > > >> There are 2 AIPs that are somewhat related to what we plan
> to
> > > work
> > > > > on:
> > > > > > > >>
> > > > > > > >>    - AIP-18 Persist all information from DAG file in DB
> > > > > > > >>    <
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-18+Persist+all+information+from+DAG+file+in+DB
> > > > > > > >> >
> > > > > > > >>    - AIP-19 Making the webserver stateless
> > > > > > > >>    <
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-19+Making+the+webserver+stateless
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >> We plan to use JSON as the Serialisation format and store it
> > as
> > > a
> > > > > blob
> > > > > > > in
> > > > > > > >> metadata DB.
> > > > > > > >>
> > > > > > > >> *Goals:*
> > > > > > > >>
> > > > > > > >>    - Make Webserver Stateless
> > > > > > > >>    - Use the same version of the DAG across Webserver &
> > > Scheduler
> > > > > > > >>    - Keep backward compatibility and have a flag (globally &
> > at
> > > > DAG
> > > > > > > level)
> > > > > > > >>    to turn this feature on/off
> > > > > > > >>    - Enable DAG Versioning (extended Goal)
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> We will be preparing a proposal (AIP) after some research
> and
> > > some
> > > > > > > initial
> > > > > > > >> work and open it for the suggestions of the community.
> > > > > > > >>
> > > > > > > >> We already had some good brain-storming sessions with
> Twitter
> > > > folks
> > > > > > > (DanD
> > > > > > > >> &
> > > > > > > >> Sumit), folks from GoDataDriven (Fokko & Bas) & Alex (from
> > Uber)
> > > > > which
> > > > > > > >> will
> > > > > > > >> be a good starting point for us.
> > > > > > > >>
> > > > > > > >> If anyone in the community is interested in it or has some
> > > > > experience
> > > > > > > >> about
> > > > > > > >> the same and want to collaborate please let me know and join
> > > > > > > >> #dag-serialisation channel on Airflow Slack.
> > > > > > > >>
> > > > > > > >> Regards,
> > > > > > > >> Kaxil
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Jarek Potiuk
> > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > >
> > > > M: +48 660 796 129 <+48660796129>
> > > > [image: Polidea] <https://www.polidea.com/>
> > > >
> > >
> > > >
> > >
> >
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >
>
>
> --
> *Kaxil Naik*
> *Big Data Consultant | DevOps Data Engineer*
> *Certified *Google Cloud Data Engineer | *Certified* Apache Spark & Neo4j
> Developer
> *LinkedIn*: https://www.linkedin.com/in/kaxil
>

Reply via email to