+1, proposal looks good. The original intention was really to have tasks groups and a zoom-in/out in the UI. The original reasoning was to reuse the DAG object since it is a group of tasks, but as highlighted here it does create underlying confusions since a DAG is much more than just a group of tasks.
Max On Mon, Jun 15, 2020 at 2:43 AM Poornima Joshi <joshipoornim...@gmail.com> wrote: > Thank you for your email. > > On Sat, Jun 13, 2020 at 12:18 AM Xinbin Huang <bin.huan...@gmail.com> > wrote: > > > > > - *Unpack SubDags during dag parsing*: This rewrites the > > *DagBag.bag_dag* > > > > method to unpack subdag while parsing, and it will give a flat > > > > structure at > > > > the task level > > > > > > The serialized_dag representation already does this I think. At least > if > > > I've understood your idea here correctly. > > > > I am not sure about serialized_dag representation, but at least it will > > still keep the subdag entry in the DAG table? In my proposal as also in > the > > draft PR, the idea is to *extract the tasks from the subdag and add them > > back to the root_dag. *So the runtime DAG graph will look exactly the > > same as without subdag but with metadata attached to those sections. > These > > metadata will be later on used to render in the UI. So after parsing ( > > *DagBag.process_file()*), it will just output the *root_dag *instead of > *root_dag + > > subdag + subdag + nested subdag* etc. > > > > - e.g. section-1-* will have metadata current_group=section-1, > > parent_group=<the-root-dag-id> (welcome for naming suggestions), the > > reason for parent_group is that we can have nested group and still be > > able to capture the dependency. > > > > Runtime DAG: > > [image: image.png] > > > > While at the UI, what we see would be something like this by utilizing > the > > metadata, and then we can expand or zoom into in some way. > > [image: image.png] > > > > The benefits I can see is that: > > 1. We don't need to deal with the extra complexity of SubDag for > execution > > and scheduling. It will be the same as not using SubDag. > > 2. Still have the benefits of modularized and reusable dag code and > > declare dependencies between them. And with the new SubDagOperator (see > AIP > > or draft PR), we can use the same dag_factory function for generating 1 > > dag, a lot of dynamic dags, or used for SubDag (in this case, it will > just > > extract all underlying tasks and append to the root dag). > > > > - Then it gets to the idea of replacing subdag with a simpler concept > > by Ash: the proposed change basically drains out the contents of a > SubDag > > and becomes more like ExtractSubdagTasksAndAppendToRootdagOperator > (forgive > > me about the crazy name..). In this case, it is still necessary to > keep the > > concept of subdag as it is nothing more than a name? > > > > That's why the TaskGroup idea comes up. Thanks Chris Palmer for helping > > conceptualize the functionality of TaskGroup, I will just paste it here. > > > > > - Tasks can be added to a TaskGroup > > > - You *can* have dependencies between Tasks in the same TaskGroup, but > > > *cannot* have dependencies between a Task in a TaskGroup and either a > > > Task in a different TaskGroup or a Task not in any group > > > - You *can* have dependencies between a TaskGroup and either other > > > TaskGroups or Tasks not in any group > > > - The UI will by default render a TaskGroup as a single "object", but > > > which you expand or zoom into in some way > > > - You'd need some way to determine what the "status" of a TaskGroup > was > > > at least for UI display purposes > > > > I agree with Chris: > > - From the backend's view (scheduler & executor), I think TaskGroup > should > > be ignored during execution. (unless we decide to implement some metadata > > operations that allows start/stop a group of tasks etc.) > > - From the UI's View, it should be able to pick up the individual tasks' > > status and then determine the TaskGroup's status > > > > Bin > > > > On Fri, Jun 12, 2020 at 10:28 AM Daniel Imberman < > > daniel.imber...@gmail.com> wrote: > > > >> I hadn’t thought about using the `>>` operator to tie dags together but > I > >> think that sounds pretty great! I wonder if we could essentially write > in > >> the ability to set dependencies to all starter-tasks for that DAG. > >> > >> I’m personally ok with SubDag being a mostly UI concept. It doesn’t need > >> to execute separately, you’re just adding more tasks to the queue that > will > >> be executed when there are resources available. > >> > >> via Newton Mail [ > >> > https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.14.6&source=email_footer_2 > >> ] > >> On Fri, Jun 12, 2020 at 9:45 AM, Chris Palmer <ch...@crpalmer.com> > wrote: > >> I agree that SubDAGs are an overly complex abstraction. I think what is > >> needed/useful is a TaskGroup concept. On a high level I think you want > >> this > >> functionality: > >> > >> - Tasks can be added to a TaskGroup > >> - You *can* have dependencies between Tasks in the same TaskGroup, but > >> *cannot* have dependencies between a Task in a TaskGroup and either a > >> Task in a different TaskGroup or a Task not in any group > >> - You *can* have dependencies between a TaskGroup and either other > >> TaskGroups or Tasks not in any group > >> - The UI will by default render a TaskGroup as a single "object", but > >> which you expand or zoom into in some way > >> - You'd need some way to determine what the "status" of a TaskGroup was > >> at least for UI display purposes > >> > >> Not sure if it would need to be a top level object with its own database > >> table and model or just another attribute on tasks. I think you could > >> build > >> it in a way such that from the schedulers point of view a DAG with > >> TaskGroups doesn't get treated any differently. So it really just > becomes > >> a > >> shortcut for setting dependencies between sets of Tasks, and allows the > UI > >> to simplify the render of the DAG structure. > >> > >> Chris > >> > >> On Fri, Jun 12, 2020 at 12:12 PM Dan Davydov > <ddavy...@twitter.com.invalid > >> > > >> wrote: > >> > >> > Agree with James (and think it's actually the more important issue to > >> fix), > >> > but I am still convinced Ash' idea is the right way forward (just it > >> might > >> > require a bit more work to deprecate than adding visual grouping in > the > >> > UI). > >> > > >> > There was a previous thread about this FYI with more context on why > >> subdags > >> > are bad and potential solutions: > >> > https://www.mail-archive.com/dev@airflow.apache.org/msg01202.html . A > >> > solution I outline there to Jame's problem is e.g. enabling the >> > >> operator > >> > for Airflow operators to work with DAGs as well. I see this being > >> separate > >> > from Ash' solution for DAG grouping in the UI but one of the two items > >> > required to replace all existing subdag functionality. > >> > > >> > I've been working with subdags for 3 years and they are always a giant > >> pain > >> > to use. They are a constant source of user confusion and breakages > >> during > >> > upgrades. Would love to see them gone :). > >> > > >> > On Fri, Jun 12, 2020 at 11:11 AM James Coder <jcode...@gmail.com> > >> wrote: > >> > > >> > > I'm not sure I totally agree it's just a UI concept. I use the > subdag > >> > > operator to simplify dependencies too. If you have a group of tasks > >> that > >> > > need to finish before another group of tasks start, using a subdag > is > >> a > >> > > pretty quick way to set those dependencies and I think also make it > >> > easier > >> > > to follow the dag code. > >> > > > >> > > On Fri, Jun 12, 2020 at 9:53 AM Kyle Hamlin <hamlin...@gmail.com> > >> wrote: > >> > > > >> > > > I second Ash’s grouping concept. > >> > > > > >> > > > On Fri, Jun 12, 2020 at 5:10 AM Ash Berlin-Taylor <a...@apache.org > > > >> > > wrote: > >> > > > > >> > > > > Question: > >> > > > > > >> > > > > Do we even need the SubDagOperator anymore? > >> > > > > > >> > > > > Would removing it entirely and just replacing it with a UI > >> grouping > >> > > > > concept be conceptually simpler, less to get wrong, and closer > to > >> > what > >> > > > > users actually want to achieve with subdags? > >> > > > > > >> > > > > With your proposed change, tasks in subdags could start running > in > >> > > > > parallel (a good change) -- so should we not also just > _enitrely_ > >> > > remove > >> > > > > the concept of a sub dag and replace it with something simpler. > >> > > > > > >> > > > > Problems with subdags (I think. I haven't used them extensively > so > >> > may > >> > > > > be wrong on some of these): > >> > > > > - They need their own dag_id, but it has(?) to be of the form > >> > > > > `parent_dag_id.subdag_id`. > >> > > > > - They need their own schedule_interval, but it has to match the > >> > parent > >> > > > dag > >> > > > > - Sub dags can be paused on their own. (Does it make sense to do > >> > this? > >> > > > > Pausing just a sub dag would mean the sub dag would never > >> execute, so > >> > > > > the SubDagOperator would fail too. > >> > > > > - You had to choose the executor to operator a subdag with -- > >> always > >> > a > >> > > > > bit of a kludge. > >> > > > > > >> > > > > Thoughts? > >> > > > > > >> > > > > -ash > >> > > > > > >> > > > > On Jun 12 2020, at 12:01 pm, Ash Berlin-Taylor <a...@apache.org> > >> > wrote: > >> > > > > > >> > > > > > Workon sub-dags is much needed, I'm excited to see how this > >> > > progresses. > >> > > > > > > >> > > > > > > >> > > > > >> - *Unpack SubDags during dag parsing*: This rewrites the > >> > > > > *DagBag.bag_dag* > >> > > > > >> method to unpack subdag while parsing, and it will give a > flat > >> > > > > >> structure at > >> > > > > >> the task level > >> > > > > > > >> > > > > > The serialized_dag representation already does this I think. > At > >> > least > >> > > > if > >> > > > > > I've understood your idea here correctly. > >> > > > > > > >> > > > > > -ash > >> > > > > > > >> > > > > > > >> > > > > > On Jun 12 2020, at 9:51 am, Xinbin Huang < > bin.huan...@gmail.com > >> > > >> > > > wrote: > >> > > > > > > >> > > > > >> Hi everyone, > >> > > > > >> > >> > > > > >> Sending a message to everyone and collect feedback on the > >> AIP-34 > >> > on > >> > > > > >> rewriting SubDagOperator. This was previously briefly > >> mentioned in > >> > > the > >> > > > > >> discussion about what needs to be done for Airflow 2.0, and > >> one of > >> > > the > >> > > > > >> ideas is to make SubDagOperator attach tasks back to the root > >> DAG. > >> > > > > >> > >> > > > > >> This AIP-34 focuses on solving SubDagOperator related issues > by > >> > > > > reattaching > >> > > > > >> all tasks back to the root dag while respecting dependencies > >> > during > >> > > > > >> parsing. The original grouping effect on the UI will be > >> achieved > >> > > > through > >> > > > > >> grouping related tasks by metadata. > >> > > > > >> > >> > > > > >> This also makes the dag_factory function more reusable > because > >> you > >> > > > don't > >> > > > > >> need to have parent_dag_name and child_dag_name in the > function > >> > > > > signature > >> > > > > >> anymore. > >> > > > > >> > >> > > > > >> Changes proposed: > >> > > > > >> > >> > > > > >> - *Unpack SubDags during dag parsing*: This rewrites the > >> > > > > *DagBag.bag_dag* > >> > > > > >> method to unpack subdag while parsing, and it will give a > flat > >> > > > > >> structure at > >> > > > > >> the task level > >> > > > > >> - *Simplify SubDagOperator*: The new SubDagOperator acts > like a > >> > > > > >> container and most of the original methods are removed. The > >> > > > > >> signature is > >> > > > > >> also changed to *subdag_factory *with *subdag_args *and > >> > > > > *subdag_kwargs*. > >> > > > > >> This is similar to the PythonOperator signature. > >> > > > > >> - *Add a TaskGroup model and add current_group & parent_group > >> > > > > attributes > >> > > > > >> to BaseOperator*: This metadata is used to group tasks for > >> > > > > >> rendering at > >> > > > > >> UI level. It may potentially extend further to group > arbitrary > >> > > tasks > >> > > > > >> outside the context of subdag to allow group-level operations > >> > > (i.e. > >> > > > > >> stop/trigger a group of task within the dag) > >> > > > > >> - *Webserver UI for SubDag*: Proposed UI modification to > allow > >> > > > > >> (un)collapse a group of tasks for a flat structure to pair > with > >> > > the > >> > > > > first > >> > > > > >> change instead of the original hierarchical structure. > >> > > > > >> > >> > > > > >> > >> > > > > >> Please see related documents and PRs for details: > >> > > > > >> AIP: > >> > > > > >> > >> > > > > > >> > > > > >> > > > >> > > >> > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-34+Rewrite+SubDagOperator > >> > > > > >> > >> > > > > >> Original Issue: > https://github.com/apache/airflow/issues/8078 > >> > > > > >> Draft PR: https://github.com/apache/airflow/pull/9243 > >> > > > > >> > >> > > > > >> Please let me know if there are any aspects that you > >> > agree/disagree > >> > > > > >> with or > >> > > > > >> need more clarification (especially the third change > regarding > >> > > > > TaskGroup). > >> > > > > >> Any comments are welcome and I am looking forward to it! > >> > > > > >> > >> > > > > >> Cheers > >> > > > > >> Bin > >> > > > > >> > >> > > > > > >> > > > -- > >> > > > Kyle Hamlin > >> > > > > >> > > > >> > > > > > > > -- > Thanks & Regards > Poornima >