+1, proposal looks good.

The original intention was really to have tasks groups and a zoom-in/out in
the UI. The original reasoning was to reuse the DAG object since it is a
group of tasks, but as highlighted here it does create underlying
confusions since a DAG is much more than just a group of tasks.

Max

On Mon, Jun 15, 2020 at 2:43 AM Poornima Joshi <joshipoornim...@gmail.com>
wrote:

> Thank you for your email.
>
> On Sat, Jun 13, 2020 at 12:18 AM Xinbin Huang <bin.huan...@gmail.com>
> wrote:
>
> > > >   - *Unpack SubDags during dag parsing*: This rewrites the
> > *DagBag.bag_dag*
> > > >   method to unpack subdag while parsing, and it will give a flat
> > > > structure at
> > > >   the task level
> > >
> > > The serialized_dag representation already does this I think. At least
> if
> > > I've understood your idea here correctly.
> >
> > I am not sure about serialized_dag representation, but at least it will
> > still keep the subdag entry in the DAG table? In my proposal as also in
> the
> > draft PR, the idea is to *extract the tasks from the subdag and add them
> > back to the root_dag. *So the runtime DAG graph will look exactly the
> > same as without subdag but with metadata attached to those sections.
> These
> > metadata will be later on used to render in the UI. So after parsing (
> > *DagBag.process_file()*), it will just output the *root_dag *instead of
> *root_dag +
> > subdag + subdag + nested subdag* etc.
> >
> >    - e.g. section-1-* will have metadata current_group=section-1,
> >    parent_group=<the-root-dag-id> (welcome for naming suggestions), the
> >    reason for parent_group is that we can have nested group and still be
> >    able to capture the dependency.
> >
> > Runtime DAG:
> > [image: image.png]
> >
> > While at the UI, what we see would be something like this by utilizing
> the
> > metadata, and then we can expand or zoom into in some way.
> > [image: image.png]
> >
> > The benefits I can see is that:
> > 1. We don't need to deal with the extra complexity of SubDag for
> execution
> > and scheduling. It will be the same as not using SubDag.
> > 2. Still have the benefits of modularized and reusable dag code and
> > declare dependencies between them. And with the new SubDagOperator (see
> AIP
> > or draft PR), we can use the same dag_factory function for generating 1
> > dag, a lot of dynamic dags, or used for SubDag (in this case, it will
> just
> > extract all underlying tasks and append to the root dag).
> >
> >    - Then it gets to the idea of replacing subdag with a simpler concept
> >    by Ash:  the proposed change basically drains out the contents of a
> SubDag
> >    and becomes more like ExtractSubdagTasksAndAppendToRootdagOperator
> (forgive
> >    me about the crazy name..). In this case, it is still necessary to
> keep the
> >    concept of subdag as it is nothing more than a name?
> >
> > That's why the TaskGroup idea comes up. Thanks Chris Palmer for helping
> > conceptualize the functionality of TaskGroup, I will just paste it here.
> >
> > >   - Tasks can be added to a TaskGroup
> > >  - You *can* have dependencies between Tasks in the same TaskGroup, but
> > >   *cannot* have dependencies between a Task in a TaskGroup and either a
> > >   Task in a different TaskGroup or a Task not in any group
> > >   - You *can* have dependencies between a TaskGroup and either other
> > >   TaskGroups or Tasks not in any group
> > >   - The UI will by default render a TaskGroup as a single "object", but
> > >   which you expand or zoom into in some way
> > >   - You'd need some way to determine what the "status" of a TaskGroup
> was
> > >   at least for UI display purposes
> >
> > I agree with Chris:
> > - From the backend's view (scheduler & executor), I think TaskGroup
> should
> > be ignored during execution. (unless we decide to implement some metadata
> > operations that allows start/stop a group of tasks etc.)
> > - From the UI's View, it should be able to pick up the individual tasks'
> > status and then determine the TaskGroup's status
> >
> > Bin
> >
> > On Fri, Jun 12, 2020 at 10:28 AM Daniel Imberman <
> > daniel.imber...@gmail.com> wrote:
> >
> >> I hadn’t thought about using the `>>` operator to tie dags together but
> I
> >> think that sounds pretty great! I wonder if we could essentially write
> in
> >> the ability to set dependencies to all starter-tasks for that DAG.
> >>
> >> I’m personally ok with SubDag being a mostly UI concept. It doesn’t need
> >> to execute separately, you’re just adding more tasks to the queue that
> will
> >> be executed when there are resources available.
> >>
> >> via Newton Mail [
> >>
> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.14.6&source=email_footer_2
> >> ]
> >> On Fri, Jun 12, 2020 at 9:45 AM, Chris Palmer <ch...@crpalmer.com>
> wrote:
> >> I agree that SubDAGs are an overly complex abstraction. I think what is
> >> needed/useful is a TaskGroup concept. On a high level I think you want
> >> this
> >> functionality:
> >>
> >> - Tasks can be added to a TaskGroup
> >> - You *can* have dependencies between Tasks in the same TaskGroup, but
> >> *cannot* have dependencies between a Task in a TaskGroup and either a
> >> Task in a different TaskGroup or a Task not in any group
> >> - You *can* have dependencies between a TaskGroup and either other
> >> TaskGroups or Tasks not in any group
> >> - The UI will by default render a TaskGroup as a single "object", but
> >> which you expand or zoom into in some way
> >> - You'd need some way to determine what the "status" of a TaskGroup was
> >> at least for UI display purposes
> >>
> >> Not sure if it would need to be a top level object with its own database
> >> table and model or just another attribute on tasks. I think you could
> >> build
> >> it in a way such that from the schedulers point of view a DAG with
> >> TaskGroups doesn't get treated any differently. So it really just
> becomes
> >> a
> >> shortcut for setting dependencies between sets of Tasks, and allows the
> UI
> >> to simplify the render of the DAG structure.
> >>
> >> Chris
> >>
> >> On Fri, Jun 12, 2020 at 12:12 PM Dan Davydov
> <ddavy...@twitter.com.invalid
> >> >
> >> wrote:
> >>
> >> > Agree with James (and think it's actually the more important issue to
> >> fix),
> >> > but I am still convinced Ash' idea is the right way forward (just it
> >> might
> >> > require a bit more work to deprecate than adding visual grouping in
> the
> >> > UI).
> >> >
> >> > There was a previous thread about this FYI with more context on why
> >> subdags
> >> > are bad and potential solutions:
> >> > https://www.mail-archive.com/dev@airflow.apache.org/msg01202.html . A
> >> > solution I outline there to Jame's problem is e.g. enabling the >>
> >> operator
> >> > for Airflow operators to work with DAGs as well. I see this being
> >> separate
> >> > from Ash' solution for DAG grouping in the UI but one of the two items
> >> > required to replace all existing subdag functionality.
> >> >
> >> > I've been working with subdags for 3 years and they are always a giant
> >> pain
> >> > to use. They are a constant source of user confusion and breakages
> >> during
> >> > upgrades. Would love to see them gone :).
> >> >
> >> > On Fri, Jun 12, 2020 at 11:11 AM James Coder <jcode...@gmail.com>
> >> wrote:
> >> >
> >> > > I'm not sure I totally agree it's just a UI concept. I use the
> subdag
> >> > > operator to simplify dependencies too. If you have a group of tasks
> >> that
> >> > > need to finish before another group of tasks start, using a subdag
> is
> >> a
> >> > > pretty quick way to set those dependencies and I think also make it
> >> > easier
> >> > > to follow the dag code.
> >> > >
> >> > > On Fri, Jun 12, 2020 at 9:53 AM Kyle Hamlin <hamlin...@gmail.com>
> >> wrote:
> >> > >
> >> > > > I second Ash’s grouping concept.
> >> > > >
> >> > > > On Fri, Jun 12, 2020 at 5:10 AM Ash Berlin-Taylor <a...@apache.org
> >
> >> > > wrote:
> >> > > >
> >> > > > > Question:
> >> > > > >
> >> > > > > Do we even need the SubDagOperator anymore?
> >> > > > >
> >> > > > > Would removing it entirely and just replacing it with a UI
> >> grouping
> >> > > > > concept be conceptually simpler, less to get wrong, and closer
> to
> >> > what
> >> > > > > users actually want to achieve with subdags?
> >> > > > >
> >> > > > > With your proposed change, tasks in subdags could start running
> in
> >> > > > > parallel (a good change) -- so should we not also just
> _enitrely_
> >> > > remove
> >> > > > > the concept of a sub dag and replace it with something simpler.
> >> > > > >
> >> > > > > Problems with subdags (I think. I haven't used them extensively
> so
> >> > may
> >> > > > > be wrong on some of these):
> >> > > > > - They need their own dag_id, but it has(?) to be of the form
> >> > > > > `parent_dag_id.subdag_id`.
> >> > > > > - They need their own schedule_interval, but it has to match the
> >> > parent
> >> > > > dag
> >> > > > > - Sub dags can be paused on their own. (Does it make sense to do
> >> > this?
> >> > > > > Pausing just a sub dag would mean the sub dag would never
> >> execute, so
> >> > > > > the SubDagOperator would fail too.
> >> > > > > - You had to choose the executor to operator a subdag with --
> >> always
> >> > a
> >> > > > > bit of a kludge.
> >> > > > >
> >> > > > > Thoughts?
> >> > > > >
> >> > > > > -ash
> >> > > > >
> >> > > > > On Jun 12 2020, at 12:01 pm, Ash Berlin-Taylor <a...@apache.org>
> >> > wrote:
> >> > > > >
> >> > > > > > Workon sub-dags is much needed, I'm excited to see how this
> >> > > progresses.
> >> > > > > >
> >> > > > > >
> >> > > > > >> - *Unpack SubDags during dag parsing*: This rewrites the
> >> > > > > *DagBag.bag_dag*
> >> > > > > >> method to unpack subdag while parsing, and it will give a
> flat
> >> > > > > >> structure at
> >> > > > > >> the task level
> >> > > > > >
> >> > > > > > The serialized_dag representation already does this I think.
> At
> >> > least
> >> > > > if
> >> > > > > > I've understood your idea here correctly.
> >> > > > > >
> >> > > > > > -ash
> >> > > > > >
> >> > > > > >
> >> > > > > > On Jun 12 2020, at 9:51 am, Xinbin Huang <
> bin.huan...@gmail.com
> >> >
> >> > > > wrote:
> >> > > > > >
> >> > > > > >> Hi everyone,
> >> > > > > >>
> >> > > > > >> Sending a message to everyone and collect feedback on the
> >> AIP-34
> >> > on
> >> > > > > >> rewriting SubDagOperator. This was previously briefly
> >> mentioned in
> >> > > the
> >> > > > > >> discussion about what needs to be done for Airflow 2.0, and
> >> one of
> >> > > the
> >> > > > > >> ideas is to make SubDagOperator attach tasks back to the root
> >> DAG.
> >> > > > > >>
> >> > > > > >> This AIP-34 focuses on solving SubDagOperator related issues
> by
> >> > > > > reattaching
> >> > > > > >> all tasks back to the root dag while respecting dependencies
> >> > during
> >> > > > > >> parsing. The original grouping effect on the UI will be
> >> achieved
> >> > > > through
> >> > > > > >> grouping related tasks by metadata.
> >> > > > > >>
> >> > > > > >> This also makes the dag_factory function more reusable
> because
> >> you
> >> > > > don't
> >> > > > > >> need to have parent_dag_name and child_dag_name in the
> function
> >> > > > > signature
> >> > > > > >> anymore.
> >> > > > > >>
> >> > > > > >> Changes proposed:
> >> > > > > >>
> >> > > > > >> - *Unpack SubDags during dag parsing*: This rewrites the
> >> > > > > *DagBag.bag_dag*
> >> > > > > >> method to unpack subdag while parsing, and it will give a
> flat
> >> > > > > >> structure at
> >> > > > > >> the task level
> >> > > > > >> - *Simplify SubDagOperator*: The new SubDagOperator acts
> like a
> >> > > > > >> container and most of the original methods are removed. The
> >> > > > > >> signature is
> >> > > > > >> also changed to *subdag_factory *with *subdag_args *and
> >> > > > > *subdag_kwargs*.
> >> > > > > >> This is similar to the PythonOperator signature.
> >> > > > > >> - *Add a TaskGroup model and add current_group & parent_group
> >> > > > > attributes
> >> > > > > >> to BaseOperator*: This metadata is used to group tasks for
> >> > > > > >> rendering at
> >> > > > > >> UI level. It may potentially extend further to group
> arbitrary
> >> > > tasks
> >> > > > > >> outside the context of subdag to allow group-level operations
> >> > > (i.e.
> >> > > > > >> stop/trigger a group of task within the dag)
> >> > > > > >> - *Webserver UI for SubDag*: Proposed UI modification to
> allow
> >> > > > > >> (un)collapse a group of tasks for a flat structure to pair
> with
> >> > > the
> >> > > > > first
> >> > > > > >> change instead of the original hierarchical structure.
> >> > > > > >>
> >> > > > > >>
> >> > > > > >> Please see related documents and PRs for details:
> >> > > > > >> AIP:
> >> > > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-34+Rewrite+SubDagOperator
> >> > > > > >>
> >> > > > > >> Original Issue:
> https://github.com/apache/airflow/issues/8078
> >> > > > > >> Draft PR: https://github.com/apache/airflow/pull/9243
> >> > > > > >>
> >> > > > > >> Please let me know if there are any aspects that you
> >> > agree/disagree
> >> > > > > >> with or
> >> > > > > >> need more clarification (especially the third change
> regarding
> >> > > > > TaskGroup).
> >> > > > > >> Any comments are welcome and I am looking forward to it!
> >> > > > > >>
> >> > > > > >> Cheers
> >> > > > > >> Bin
> >> > > > > >>
> >> > > > >
> >> > > > --
> >> > > > Kyle Hamlin
> >> > > >
> >> > >
> >> >
> >
> >
>
> --
> Thanks & Regards
>  Poornima
>

Reply via email to