-1 (binding), this may sound a bit FUD-y but I don't feel this has been
thought through enough...

Having both a SimpleDagBag representation and the JSON representation
doesn't make sense to me at the moment: *"**Quoting from Airflow code, it
is “a simplified representation of a DAG that contains all attributes
required for instantiating and scheduling its associated tasks.”. It does
not contain enough information required by the webserver.". *Why not create
a representation that can be used by both? This is going to be a big
headache to both understand and work with in the codebase since it will be
another definition that we need to keep in sync.

Not sure if fileloc/fileloc_hash is the right solution, the longterm
solution I am imagining has clients responsible for uploading DAGs rather
than retrieving them from the filesystem so fileloc/fileloc_hash wouldn't
even exist (dag_id would be used for indexing here).

Versioning isn't really addressed either (e.g. if a DAG topology changes
with some update you want to be able to show both the old and new ones, or
at least have a way to deal with them), there is an argument that this is
acceptable since it isn't currently addressed now, but I'm worried that
large schema changes should think through the long term plan a bit more.

I feel like getting this wrong is going to make it very hard to migrate
things in the future, and make the codebase worse (another representation
of DAGs that we need to support/understand/keep parity for). If I'm wrong
about this then I would be more willing to +1 this change. This doc is a
1-2 pager and I feel like it is not thorough or deep enough and doesn't
give me enough confidence that the work in the PR is going to make it
easier to complete the future milestones instead of harder.

On Tue, Oct 15, 2019 at 11:26 AM Kamil Breguła <kamil.breg...@polidea.com>
wrote:

> +1 (binding)
>
> On Tue, Oct 15, 2019 at 2:57 AM Kaxil Naik <kaxiln...@gmail.com> wrote:
>
> > Hello, Airflow community,
> >
> > This email calls for a vote to add the DAG Serialization feature at
> > https://github.com/apache/airflow/pull/5743.
> >
> > *AIP*:
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-24+DAG+Persistence+in+DB+using+JSON+for+Airflow+Webserver+and+%28optional%29+Scheduler
> >
> > *Previous Mailing List discussion*:
> >
> https://lists.apache.org/thread.html/65d282368e0a7c19815badb8b1c6c8d72b0975ce94f601e13af44f74@%3Cdev.airflow.apache.org%3E
> >  .
> >
> > *Authors*: Kaxil Naik, Zhou Fang, Ash-Berlin Taylor
> >
> > *Summary*:
> >
> >    - DAGs are serialized using JSON format and stored in a SerializedDag
> >    table
> >    - The Webserver now instead of having to parse the DAG file again,
> >    reads the serialized DAGs in JSON, de-serializes them and creates the
> >    DagBag and uses it to show in the UI.
> >    - Instead of loading an entire DagBag when the WebServer starts we
> >    only load each DAG on demand from the Serialized Dag table. This helps
> >    reduce Webserver startup time and memory. The reduction is notable
> when you
> >    have a large number of DAGs.
> >    - A JSON Schema has been defined and we validate the serialized dag
> >    before writing it to the database
> >
> > [image: image.png]
> >
> > A PR (https://github.com/apache/airflow/pull/5743) is ready for review
> > from the committers and community.
> >
> > We also have a WIP PR (https://github.com/apache/airflow/pull/5992) to
> > backport this feature to 1.10.* branch.
> >
> > A big thank you to Zhou and Ash for their continuous help in improving
> > this feature/PR.
> >
> > This email is formally calling for a vote to accept the AIP and PR.
> Please
> > note that we will update the PR / feature to fix bugs if we find any.
> >
> > Cheers,
> > Kaxil
> >
> >
> >
> >
> >
> >
> >
>

Reply via email to