@Dan Davydov

Continuing the discussion from the PR
<https://github.com/apache/airflow/pull/4396> here--so we discuss only code
change in the PR.

For the AIP:
I agree with what Max and Dan mentioned in the last thread discussion this
AIP: for long term we want DAG serialization and that should serve as a
source of truth for scheduling/execution/rendering. And just as what Max
mentioned, when I last time look into serializing DAG in webserver, it was
template rendering that blocked me and we might have to drop it before we
proceed. Another challenge is about serializing execution environment,
maybe a python package manager that can package the dependencies, or
something like a docker image. Shorter term we can at least make the
serialization work for webserver, so we can solve the scaling issue and
potentially the consistency issue between gunicorn workers.

For the PR:
Serializing the DagRun along with its graph can work in the long term plan.
We just need to think through the relationship between the graph inside
DagRun and the graph inside DAG--maybe keep the list of tasks and the edges
only in serializable DagRun but not the SimpleDag. Later on after we can
capture the execution environment we just append it to the serialized
DagRun. This is related to our previous discussion about versioning DagRun.
Now since we're already trying to have multiple graphs for one
execution_date, maybe we should just have multiple DagRun.

The entire DAG serialization is a huge topic with huge potential and as
mentioned we have intensively discussed it. I've being thinking and
discussion this topic for a while and would love to have some brainstorming
session with the community--Dan's point about set up for the long term is
good. I'll be more comfortable investing resources pushing this PR with a
plan that works for long term.

Cheers,
Kevin Y

On Sun, Feb 24, 2019 at 12:29 AM Kevin Yang <yrql...@gmail.com> wrote:

> Sorry about being late to this. Love this change. Will look deep into this
> in the next couple of days.
>
> Thank you,
> Kevin Y
>
> On Sat, Feb 23, 2019 at 10:54 PM Tao Feng <fengta...@gmail.com> wrote:
>
>> I manage to test out the latest version of the branch which I think all my
>> previous concerns don't exist anymore. Great job! The latest screenshot
>> for
>> the new graph view could be found in the pr comment.
>>
>> On Sat, Feb 23, 2019 at 3:30 PM Tao Feng <fengta...@gmail.com> wrote:
>>
>> > Thanks for bringing this up.  I look at the pr before and here recaps
>> some
>> > of the previous concerns:
>> > 1. There will have a new *giant *button next to "go" button in graph
>> > view(it could be found in the screenshot in the pr) based on my last
>> read
>> > of the pr. It even exists after DAGRun is created. I don't think it is
>> > ideal. Have you removed the button or not? I understand we need to read
>> the
>> > DAG File when run doesn't exist. But IMO we need a better UI than this.
>> If
>> > the latest UI removed the button, how user found out this link(
>> > http://host/graph?dag_id=my_dag&read_from_file=True) ?
>> > 2. Previous discussion mentioned that it requires user to run `airflow
>> > resetdb` if there are historical DAGRun. Is it still the case or not?
>> > 3. During Andrew's testing, he found that the DAG shape keeps changing
>> > during the backfill. Have we fixed the issue?
>> > 4. By reading the pr, I recall the graph_id is always increment by 1
>> based
>> > on last graph_id. The last graph_id matches the latest execution date
>> > DAGRun. What happens to the case when the latest execution date doesn't
>> > change while user backfill multiple days? Will all these backfill
>> DAGRuns
>> > get the same graph id in this case? Is it intentded?
>> >
>> > Let me if the latest version made this change. And I would suggest the
>> > community to test out the pr in your setup(with historical DAGRuns) if
>> > possible as it is potentially a big(breaking) change.  We would like to
>> > make sure nothing unexpected happens if we are going with this
>> direction.
>> >
>> > Thanks,
>> > -Tao
>> >
>> > On Sat, Feb 23, 2019 at 11:41 AM Bas Harenslak <
>> > basharens...@godatadriven.com> wrote:
>> >
>> >> Let's discuss AIP-12 here:
>> >>
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-12+Persist+DAG+into+DB
>> .
>> >> It involves persisting the entire DAG into the metastore. For full
>> details,
>> >> please read the AIP.
>> >>
>> >> A PR was made to create “versioned graphs” given by option #3 in the
>> AIP:
>> >> https://github.com/apache/airflow/pull/4396. This led to a long
>> >> discussion but has been quiet for the last few days. It would be sad
>> to see
>> >> the effort put in by @ffinfo not leading to anything. To recap the
>> summary
>> >> at the end of the PR, the current status is:
>> >>
>> >> Internal changes:
>> >>
>> >>
>> >>   *   This PR persists task dependencies in a new table `dag_edge`.
>> >>   *   The term "graph" is introduced in the code, this contains the
>> >> structure of a DAG, so the "edges" (dependencies) and "nodes" (tasks).
>> >>   *   A DagRun is bound to one `graph_id`.
>> >>   *   Currently in Airflow only the latest version of a DAG is
>> displayed
>> >> in the UI (both graph & tree view). This means if you delete a task,
>> you
>> >> cannot see runs of that task in the past anymore.
>> >>   *   In the graph view you can now see different graph versions,
>> because
>> >> we store both tasks and edges.
>> >>   *   For the record: in the tree view you still only the latest
>> version
>> >> because it is not possible to combine all history into a single view.
>> >>
>> >> Changes from a user perspective:
>> >>
>> >>
>> >>   *   Nothing in the tree view.
>> >>   *   In the graph view, you can now view different "graphs" if you
>> >> change the structure of your DAG. Note the graph view shows DAG runs.
>> If
>> >> you change your DAG without running it, it does not show in the graph
>> view.
>> >>   *   When you have no DAG runs, there is no graph to show. So, as
>> >> @ffinfo described above he then reads the graph from the DAG file
>> instead.
>> >> You can see this behaviour in the graph view url:
>> >>      *   if DagRuns exist: http://host/graph?dag_id=my_dag
>> >>      *   if no DagRuns exist:
>> >> http://host/graph?dag_id=my_dag&read_from_file=True
>> >>   *   In the screenshots in
>> >> https://github.com/apache/airflow/pull/4396#issuecomment-465217731,
>> you
>> >> see this case. Since this is more of an internal thing how Airflow
>> works,
>> >> and not really informative for the user, @ffinfo removed the message
>> in his
>> >> last commit.
>> >>
>> >> Judging by the PR comments, everybody likes the idea of persisting more
>> >> of the DAG in the DB. All issues mentioned were addressed. It would be
>> >> great to see this work merged in Airflow, so please discuss anything
>> about
>> >> the PR/AIP here.
>> >>
>> >> Cheers,
>> >> Bas
>> >>
>> >
>>
>

Reply via email to