Permissions granted, let me know if you face any issues. On Fri, Aug 14, 2020 at 1:10 PM Yu Qian <yuqian1...@gmail.com> wrote:
> Hi, Kaxil, my ID for cwiki.apache.org is yuqian1990. Thank you! > > On Fri, Aug 14, 2020 at 7:35 PM Kaxil Naik <kaxiln...@gmail.com> wrote: > > > What's your ID i.e. if you haven't created an account yet, please create > > one at https://cwiki.apache.org/confluence/signup.action and send us > your > > ID and we will add permissions. > > > > Thanks. I'll edit the AIP. May I request permission to edit it? > > > My wiki user email is yuqian1...@gmail.com. > > > > > > On Fri, Aug 14, 2020 at 9:45 AM Yu Qian <yuqian1...@gmail.com> wrote: > > > > > Re, Xinbin. Thanks. I'll edit the AIP. May I request permission to edit > > it? > > > My wiki user email is yuqian1...@gmail.com. > > > > > > Re Gerard: yes the UI loads all the nodes as json from the web server > at > > > once. However, it only adds the top level nodes and edges to the graph > > when > > > the Graph View page is first opened. And then adds the expanded nodes > to > > > the graph as the user expands them. From what I've experienced with > DAGs > > > containing around 400 tasks (not using TaskGroup or SubDagOperator), > > > opening the whole dag in Graph View usually takes 5 seconds. Less than > > 60ms > > > of that is taken by loading the data from webserver. The remaining > 4.9s+ > > is > > > taken by javascript functions in dagre-d3.min.js such as createNodes, > > > createEdgeLabels, etc and by rendering the graph. With TaskGroup being > > used > > > to group tasks into a smaller number of top-level nodes, the amount of > > data > > > loaded from webserver will remain about the same compared to a flat dag > > of > > > the same size, but the number of nodes and edges needed to be plot on > the > > > graph can be reduced significantly. So in theory this should speed up > the > > > time it takes to open Graph View even without lazy-loading the data > (I'll > > > experiment to find out). That said, if it comes to a point lazy-loading > > > helps, we can still implement it as an improvement. > > > > > > Re James: the Tree View looks as if all all the groups are fully > > expanded. > > > (because under the hood all the tasks are in a single DAG). I'm less > > > worried about Tree View at the moment because it already has a > mechanism > > > for collapsing tasks by the dependency tree. That said, the Tree View > can > > > definitely be improved too with TaskGroup. (e.g. collapse tasks in the > > same > > > TaskGroup when Tree View is first opened). > > > > > > For both suggestions, implementing them don't require fundamental > changes > > > to the idea. I think we can have a basic working TaskGroup first, and > > then > > > improve it incrementally in several PRs as we get more feedback from > the > > > community. What do you think? > > > > > > Qian > > > > > > > > > On Wed, Aug 12, 2020 at 9:15 AM James Coder <jcode...@gmail.com> > wrote: > > > > > > > I agree this looks great, one question, how does the tree view look? > > > > > > > > James Coder > > > > > > > > > On Aug 11, 2020, at 6:48 PM, Gerard Casas Saez < > > gcasass...@twitter.com > > > .invalid> > > > > wrote: > > > > > > > > > > First of all, this is awesome!! > > > > > > > > > > Secondly, checking your UI code, seems you are loading all > operators > > at > > > > > once. Wondering if we can load them as needed (aka load whenever we > > > click > > > > > the TaskGroup). Some of our DAGs are so large that take forever to > > load > > > > on > > > > > the Graph view, so worried about this still being an issue here. It > > may > > > > be > > > > > easily solvable by implementing lazy loading of the graph. Not sure > > how > > > > > easy to implement/add to the UI extension (and dont want to push > for > > > > early > > > > > optimization as its the root of all evil). > > > > > Gerard Casas Saez > > > > > Twitter | Cortex | @casassaez <http://twitter.com/casassaez> > > > > > > > > > > > > > > >> On Tue, Aug 11, 2020 at 10:35 AM Xinbin Huang < > > bin.huan...@gmail.com> > > > > wrote: > > > > >> > > > > >> Hi Yu, > > > > >> > > > > >> Thank you so much for taking on this. I was fairly distracted > > > previously > > > > >> and I didn't have the time to update the proposal. In fact, after > > > > >> discussing with Ash, Kaxil and Daniel, the direction of this AIP > has > > > > been > > > > >> changed to favor the concept of TaskGroup instead of rewriting > > > > >> SubDagOperator (though it may may sense to deprecate SubDag in a > > > future > > > > >> date.). > > > > >> > > > > >> Your PR is amazing and it has implemented the desire features. I > > think > > > > we > > > > >> can focus on your new PR instead. Do you mind updating the AIP > based > > > on > > > > >> what you have done in your PR? > > > > >> > > > > >> Best, > > > > >> Bin > > > > >> > > > > >> > > > > >>> On Tue, Aug 11, 2020 at 7:11 AM Yu Qian <yuqian1...@gmail.com> > > > wrote: > > > > >>> > > > > >>> Hi, all, I've added the basic UI changes to my proposed > > > implementation > > > > of > > > > >>> TaskGroup as UI grouping concept: > > > > >>> https://github.com/apache/airflow/pull/10153 > > > > >>> > > > > >>> I think Chris had a pretty good specification of TaskGroup so i'm > > > > quoting > > > > >>> it here. The only thing I don't fully agree with is the > restriction > > > > >>> "... **cannot* > > > > >>> have dependencies between a Task in a TaskGroup and either a* > > > > >>> * Task in a different TaskGroup or a Task not in any group*". I > > > think > > > > >>> this is over restrictive. Since TaskGroup is a UI concept, tasks > > can > > > > have > > > > >>> dependencies on tasks in other TaskGroup or not in any TaskGroup. > > In > > > my > > > > >> PR, > > > > >>> this is allowed. The graph edges will update accordingly when > > > > TaskGroups > > > > >>> are expanded/collapsed. TaskGroup is only helping to make the UI > > look > > > > >> less > > > > >>> crowded. Under the hood, everything is still a DAG of tasks and > > edges > > > > so > > > > >>> things work normally. Here's a screenshot > > > > >>> < > > > > >>> > > > > >> > > > > > > > > > > https://raw.githubusercontent.com/yuqian90/airflow/gif_for_demo/airflow/www/static/screen-shot-short.gif > > > > >>>> > > > > >>> of the UI interaction. > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> * - Tasks can be added to a TaskGroup - You *can* have > > > dependencies > > > > >>> between Tasks in the same TaskGroup, but *cannot* have > > dependencies > > > > >>> between a Task in a TaskGroup and either a Task in a different > > > > >> TaskGroup > > > > >>> or a Task not in any group - You *can* have dependencies > between > > a > > > > >>> TaskGroup and either other TaskGroups or Tasks not in any group > > - > > > > The > > > > >>> UI will by default render a TaskGroup as a single "object", but > > > which > > > > >> you > > > > >>> expand or zoom into in some way - You'd need some way to > > determine > > > > what > > > > >>> the "status" of a TaskGroup was at least for UI display > purposes* > > > > >>> > > > > >>> > > > > >>> Regarding Jake's comment, I agree it's possible to implement the > > > > >> "retrying > > > > >>> tasks in a group" pattern he mentioned as an optional feature of > > > > >> TaskGroup > > > > >>> although that may go against having TaskGroup as a pure UI > concept. > > > For > > > > >> the > > > > >>> motivating example Jake provided, I suggest implementing both > > > > >>> SubmitLongRunningJobTask and PollJobStatusSensor in a single > > > operator. > > > > It > > > > >>> can do something like BaseSensorOperator.execute() does in > > > "reschedule" > > > > >>> mode, i.e. it first executes some code to submit the long running > > job > > > > to > > > > >>> the external service, and store the state (e.g. in XCom). Then > > > > reschedule > > > > >>> itself. Subsequent runs then pokes for the completion state. > > > > >>> > > > > >>> > > > > >>> On Thu, Aug 6, 2020 at 2:08 AM Jacob Ferriero > > > > >> <jferri...@google.com.invalid > > > > >>>> > > > > >>> wrote: > > > > >>> > > > > >>>> I really like this idea of a TaskGroup container as I think this > > > will > > > > >> be > > > > >>>> much easier to use than SubDag. > > > > >>>> > > > > >>>> I'd like to propose an optional behavior for special retry > > mechanics > > > > >> via > > > > >>> a > > > > >>>> TaskGroup.retry_all property. > > > > >>>> This way I could use TaskGroup to replace my favorite use of > > SubDag > > > > for > > > > >>>> atomically retrying tasks of the pattern "act on external state > > then > > > > >>>> reschedule poll until desired state reached". > > > > >>>> > > > > >>>> Motivating use case I have for a SubDag is very simple two task > > > group > > > > >>>> [SubmitLongRunningJobTask >> PollJobStatusSensor]. > > > > >>>> I use SubDag is because it gives me an easy way to retry the > > > > >>> SubmitJobTask > > > > >>>> if something about the PollJobSensor fails. > > > > >>>> This pattern would be really nice for jobs that are expected to > > run > > > a > > > > >>> long > > > > >>>> time (because we can use sensor can use reschedule mode freeing > up > > > > >> slots) > > > > >>>> but might fail for a retryable reason. > > > > >>>> However, using SubDag to meet this use case defeats the purpose > > > > because > > > > >>>> SubDag infamously > > > > >>>> < > > > > >>>> > > > > >>> > > > > >> > > > > > > > > > > https://medium.com/@team_24989/fixing-subdagoperator-deadlock-in-airflow-6c64312ebb10 > > > > >>>>> > > > > >>>> blocks a "controller" slot for the entire duration. > > > > >>>> This may feel like a cyclic behavior but reality it is very > common > > > for > > > > >> a > > > > >>>> single operator to submit job / wait til done. > > > > >>>> We could use this case refactor many operators (e.g. BQ, > Dataproc, > > > > >>>> Dataflow) to be implemented as TaskGroup[SubmitTask >> PollTask] > > > with > > > > >> an > > > > >>>> optional reschedule mode if user knows that this job may take a > > long > > > > >>> time. > > > > >>>> > > > > >>>> I'd be happy to the development work on adding this specific > retry > > > > >>> behavior > > > > >>>> to TaskGroup once the base concept is implemented if others in > the > > > > >>>> community would find this a useful feature. > > > > >>>> > > > > >>>> Cheers, > > > > >>>> Jake > > > > >>>> > > > > >>>> On Tue, Aug 4, 2020 at 10:07 AM Jarek Potiuk < > > > > jarek.pot...@polidea.com > > > > >>> > > > > >>>> wrote: > > > > >>>> > > > > >>>>> All for it :) . I think we are getting closer to have regular > > > > >> planning > > > > >>>> and > > > > >>>>> making some structured approach to 2.0 and starting task force > > for > > > it > > > > >>>> soon, > > > > >>>>> so I think this should be perfectly fine to discuss and even > > start > > > > >>>>> implementing what's beyond as soon as we make sure that we are > > > > >>>> prioritizing > > > > >>>>> 2.0 work. > > > > >>>>> > > > > >>>>> J, > > > > >>>>> > > > > >>>>> > > > > >>>>> On Tue, Aug 4, 2020 at 12:09 PM Yu Qian <yuqian1...@gmail.com> > > > > >> wrote: > > > > >>>>> > > > > >>>>>> Hi Jarek, > > > > >>>>>> > > > > >>>>>> I agree we should not change the behaviour of the existing > > > > >>>> SubDagOperator > > > > >>>>>> till Airflow 2.1. Is it okay to continue the discussion about > > > > >>> TaskGroup > > > > >>>>> as > > > > >>>>>> a brand new concept/feature independent from the existing > > > > >>>> SubDagOperator? > > > > >>>>>> In other words, shall we add TaskGroup as a UI grouping > concept > > > > >> like > > > > >>>> Ash > > > > >>>>>> suggested, and not touch SubDagOperator atl all. Whenever we > are > > > > >>> ready > > > > >>>>> with > > > > >>>>>> TaskGroup, we then deprecate SubDagOperator in Airflow 2.1. > > > > >>>>>> > > > > >>>>>> I really like Ash's idea of simplifying the SubDagOperator > idea > > > > >> into > > > > >>> a > > > > >>>>>> simple UI grouping concept. I think Xinbin's idea of > > "reattaching > > > > >> all > > > > >>>> the > > > > >>>>>> tasks to the root DAG" is the way to go. And I see James > pointed > > > > >> out > > > > >>> we > > > > >>>>>> need some helper functions to simplify dependencies setting of > > > > >>>> TaskGroup. > > > > >>>>>> Xinbin put up a pretty elegant example in his PR > > > > >>>>>> <https://github.com/apache/airflow/pull/9243>. I think having > > > > >>>> TaskGroup > > > > >>>>> as > > > > >>>>>> a UI concept should be a relatively small change. We can > > simplify > > > > >>>>> Xinbin's > > > > >>>>>> PR further. So I put up this alternative proposal here: > > > > >>>>>> https://github.com/apache/airflow/pull/10153 > > > > >>>>>> > > > > >>>>>> I have not done any UI changes due to lack of experience with > > web > > > > >> UI. > > > > >>>> If > > > > >>>>>> anyone's interested, please take a look at the PR. > > > > >>>>>> > > > > >>>>>> Qian > > > > >>>>>> > > > > >>>>>> On Mon, Jun 22, 2020 at 5:15 AM Jarek Potiuk < > > > > >>> jarek.pot...@polidea.com > > > > >>>>> > > > > >>>>>> wrote: > > > > >>>>>> > > > > >>>>>>> Similar point here to the other ideas that are popping up. > > Maybe > > > > >> we > > > > >>>>>> should > > > > >>>>>>> just focus on completing 2.0 and make all discussions about > > > > >> further > > > > >>>>>>> improvements to 2.1? While those are important discussions > (and > > > > >> we > > > > >>>>> should > > > > >>>>>>> continue them in the near future !) I think at this point > > > > >> focusing > > > > >>>> on > > > > >>>>>>> delivering 2.0 in its current shape should be our focus now ? > > > > >>>>>>> > > > > >>>>>>> J. > > > > >>>>>>> > > > > >>>>>>> On Thu, Jun 18, 2020 at 6:35 PM Xinbin Huang < > > > > >>> bin.huan...@gmail.com> > > > > >>>>>>> wrote: > > > > >>>>>>> > > > > >>>>>>>> Hi Daniel > > > > >>>>>>>> > > > > >>>>>>>> I agree that the TaskGroup should have the same API as a DAG > > > > >>> object > > > > >>>>>>> related > > > > >>>>>>>> to task dependencies, but it will not have anything related > to > > > > >>>> actual > > > > >>>>>>>> execution or scheduling. > > > > >>>>>>>> I will update the AIP according to this over the weekend. > > > > >>>>>>>> > > > > >>>>>>>>> We could even make a “DAGTemplate” object s.t. when you > > > > >> import > > > > >>>> the > > > > >>>>>>> object > > > > >>>>>>>> you can import it with parameters to determine the shape of > > the > > > > >>>> DAG. > > > > >>>>>>>> > > > > >>>>>>>> Can you elaborate a bit more on this? Does it serve a > similar > > > > >>>> purpose > > > > >>>>>> as > > > > >>>>>>> a > > > > >>>>>>>> DAG factory function? > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> On Thu, Jun 18, 2020 at 9:13 AM Daniel Imberman < > > > > >>>>>>> daniel.imber...@gmail.com > > > > >>>>>>>>> > > > > >>>>>>>> wrote: > > > > >>>>>>>> > > > > >>>>>>>>> Hi Bin, > > > > >>>>>>>>> > > > > >>>>>>>>> Why not give the TaskGroup the same API as a DAG object > (e.g. > > > > >>> the > > > > >>>>>>> bitwise > > > > >>>>>>>>> operator fro task dependencies). We could even make a > > > > >>>> “DAGTemplate” > > > > >>>>>>>> object > > > > >>>>>>>>> s.t. when you import the object you can import it with > > > > >>> parameters > > > > >>>>> to > > > > >>>>>>>>> determine the shape of the DAG. > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>>> On Wed, Jun 17, 2020 at 8:54 PM, Xinbin Huang < > > > > >>>>> bin.huan...@gmail.com > > > > >>>>>>> > > > > >>>>>>>>> wrote: > > > > >>>>>>>>> The TaskGroup will not take schedule interval as a > parameter > > > > >>>>> itself, > > > > >>>>>>> and > > > > >>>>>>>> it > > > > >>>>>>>>> depends on the DAG where it attaches to. In my opinion, the > > > > >>>>> TaskGroup > > > > >>>>>>>> will > > > > >>>>>>>>> only contain a group of tasks with interdependencies, and > the > > > > >>>>>> TaskGroup > > > > >>>>>>>>> behaves like a task. It doesn't contain any > > > > >>> execution/scheduling > > > > >>>>>> logic > > > > >>>>>>>>> (i.e. schedule_interval, concurrency, max_active_runs etc.) > > > > >>> like > > > > >>>> a > > > > >>>>>> DAG > > > > >>>>>>>>> does. > > > > >>>>>>>>> > > > > >>>>>>>>>> For example, there is the scenario that the schedule > > > > >> interval > > > > >>>> of > > > > >>>>>> DAG > > > > >>>>>>> is > > > > >>>>>>>>> 1 hour and the schedule interval of TaskGroup is 20 min. > > > > >>>>>>>>> > > > > >>>>>>>>> I am curious why you ask this. Is this a use case that you > > > > >> want > > > > >>>> to > > > > >>>>>>>> achieve? > > > > >>>>>>>>> > > > > >>>>>>>>> Bin > > > > >>>>>>>>> > > > > >>>>>>>>> On Wed, Jun 17, 2020 at 7:59 PM 蒋晓峰 < > > > > >> thanosxnicho...@gmail.com > > > > >>>> > > > > >>>>>> wrote: > > > > >>>>>>>>> > > > > >>>>>>>>>> Hi Bin, > > > > >>>>>>>>>> Using TaskGroup, Is the schedule interval of TaskGroup the > > > > >>> same > > > > >>>>> as > > > > >>>>>>> the > > > > >>>>>>>>>> parent DAG? My main concern is whether the schedule > > > > >> interval > > > > >>> of > > > > >>>>>>>> TaskGroup > > > > >>>>>>>>>> could be different with that of the DAG? For example, > there > > > > >>> is > > > > >>>>> the > > > > >>>>>>>>> scenario > > > > >>>>>>>>>> that the schedule interval of DAG is 1 hour and the > > > > >> schedule > > > > >>>>>> interval > > > > >>>>>>>> of > > > > >>>>>>>>>> TaskGroup is 20 min. > > > > >>>>>>>>>> > > > > >>>>>>>>>> Cheers, > > > > >>>>>>>>>> Nicholas > > > > >>>>>>>>>> > > > > >>>>>>>>>> On Thu, Jun 18, 2020 at 10:30 AM Xinbin Huang < > > > > >>>>>> bin.huan...@gmail.com > > > > >>>>>>>> > > > > >>>>>>>>>> wrote: > > > > >>>>>>>>>> > > > > >>>>>>>>>>> Hi Nicholas, > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> I am not sure about the old behavior of SubDagOperator, > > > > >>> maybe > > > > >>>>> it > > > > >>>>>>> will > > > > >>>>>>>>>> throw > > > > >>>>>>>>>>> an error? But in the original proposal, the subdag's > > > > >>>>>>>> schedule_interval > > > > >>>>>>>>>> will > > > > >>>>>>>>>>> be ignored. Or if we decide to use TaskGroup to replace > > > > >>>> SubDag, > > > > >>>>>>> there > > > > >>>>>>>>>> will > > > > >>>>>>>>>>> be no subdag schedule_interval. > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> Bin > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> On Wed, Jun 17, 2020 at 6:21 PM 蒋晓峰 < > > > > >>>> thanosxnicho...@gmail.com > > > > >>>>>> > > > > >>>>>>>> wrote: > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> Hi Bin, > > > > >>>>>>>>>>>> Thanks for your good proposal. I was confused whether > > > > >> the > > > > >>>>>>> schedule > > > > >>>>>>>>>>>> interval of SubDAG is different from that of the parent > > > > >>>> DAG? > > > > >>>>> I > > > > >>>>>>> have > > > > >>>>>>>>>>>> discussed with Jiajie Zhong about the schedule interval > > > > >>> of > > > > >>>>>>> SubDAG. > > > > >>>>>>>> If > > > > >>>>>>>>>> the > > > > >>>>>>>>>>>> SubDagOperator has a different schedule interval, what > > > > >>> will > > > > >>>>>>> happen > > > > >>>>>>>>> for > > > > >>>>>>>>>>> the > > > > >>>>>>>>>>>> scheduler to schedule the parent DAG? > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> Regards, > > > > >>>>>>>>>>>> Nicholas Jiang > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> On Thu, Jun 18, 2020 at 8:04 AM Xinbin Huang < > > > > >>>>>>>> bin.huan...@gmail.com> > > > > >>>>>>>>>>>> wrote: > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>>> Thank you, Max, Kaxil, and everyone's feedback! > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> I have rethought about the concept of subdag and task > > > > >>>>>> groups. I > > > > >>>>>>>>> think > > > > >>>>>>>>>>> the > > > > >>>>>>>>>>>>> better way to approach this is to entirely remove > > > > >>> subdag > > > > >>>>> and > > > > >>>>>>>>>> introduce > > > > >>>>>>>>>>>> the > > > > >>>>>>>>>>>>> concept of TaskGroup, which is a container of tasks > > > > >>> along > > > > >>>>>> with > > > > >>>>>>>>> their > > > > >>>>>>>>>>>>> dependencies *without execution/scheduling logic as a > > > > >>>> DAG*. > > > > >>>>>> The > > > > >>>>>>>>> only > > > > >>>>>>>>>>>>> purpose of it is to group a list of tasks, but you > > > > >>> still > > > > >>>>> need > > > > >>>>>>> to > > > > >>>>>>>>> add > > > > >>>>>>>>>> it > > > > >>>>>>>>>>>> to > > > > >>>>>>>>>>>>> a DAG for execution. > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> Here is a small code snippet. > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> ``` > > > > >>>>>>>>>>>>> class TaskGroup: > > > > >>>>>>>>>>>>> """ > > > > >>>>>>>>>>>>> A TaskGroup contains a group of tasks. > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> If default_args is missing, it will take default args > > > > >>>> from > > > > >>>>>> the > > > > >>>>>>>>>> DAG. > > > > >>>>>>>>>>>>> """ > > > > >>>>>>>>>>>>> def __init__(self, group_id, default_args): > > > > >>>>>>>>>>>>> pass > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> """ > > > > >>>>>>>>>>>>> You can add tasks to a task group similar to adding > > > > >>> tasks > > > > >>>>> to > > > > >>>>>> a > > > > >>>>>>>> DAG > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> This can be declared in a separate file from the dag > > > > >>> file > > > > >>>>>>>>>>>>> """ > > > > >>>>>>>>>>>>> download_group = TaskGroup(group_id='download', > > > > >>>>>>>>>>>> default_args=default_args) > > > > >>>>>>>>>>>>> download_group.add_task(task1) > > > > >>>>>>>>>>>>> task2.dag = download_group > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> with download_group: > > > > >>>>>>>>>>>>> task3 = DummyOperator(task_id='task3') > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> [task, task2] >> task3 > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> """Add it to a DAG for execution""" > > > > >>>>>>>>>>>>> with DAG(dag_id='start_download_dag', > > > > >>>>>>> default_args=default_args, > > > > >>>>>>>>>>>>> schedule_interval='@daily', ...) as dag: > > > > >>>>>>>>>>>>> start = DummyOperator(task_id='start') > > > > >>>>>>>>>>>>> start >> download_group > > > > >>>>>>>>>>>>> # this is equivalent to > > > > >>>>>>>>>>>>> # start >> [task, task2] >> task3 > > > > >>>>>>>>>>>>> ``` > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> With this, we can still reuse a group of tasks and > > > > >> set > > > > >>>>>>>> dependencies > > > > >>>>>>>>>>>> between > > > > >>>>>>>>>>>>> them; it avoids the boilerplate code from using > > > > >>>>>> SubDagOperator, > > > > >>>>>>>> and > > > > >>>>>>>>>> we > > > > >>>>>>>>>>>> can > > > > >>>>>>>>>>>>> declare dependencies as `task >> task_group >> task`. > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> User migration wise, we can introduce it before > > > > >> Airflow > > > > >>>> 2.0 > > > > >>>>>> and > > > > >>>>>>>>> allow > > > > >>>>>>>>>>>>> gradual transition. Then we can decide if we still > > > > >> want > > > > >>>> to > > > > >>>>>> keep > > > > >>>>>>>> the > > > > >>>>>>>>>>>>> SubDagOperator or simply remove it. > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> Any thoughts? > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> Cheers, > > > > >>>>>>>>>>>>> Bin > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> On Wed, Jun 17, 2020 at 7:37 AM Maxime Beauchemin < > > > > >>>>>>>>>>>>> maximebeauche...@gmail.com> wrote: > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> +1, proposal looks good. > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> The original intention was really to have tasks > > > > >>> groups > > > > >>>>> and > > > > >>>>>> a > > > > >>>>>>>>>>>> zoom-in/out > > > > >>>>>>>>>>>>> in > > > > >>>>>>>>>>>>>> the UI. The original reasoning was to reuse the DAG > > > > >>>>> object > > > > >>>>>>>> since > > > > >>>>>>>>> it > > > > >>>>>>>>>>> is > > > > >>>>>>>>>>>> a > > > > >>>>>>>>>>>>>> group of tasks, but as highlighted here it does > > > > >>> create > > > > >>>>>>>> underlying > > > > >>>>>>>>>>>>>> confusions since a DAG is much more than just a > > > > >> group > > > > >>>> of > > > > >>>>>>> tasks. > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> Max > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> On Mon, Jun 15, 2020 at 2:43 AM Poornima Joshi < > > > > >>>>>>>>>>>>> joshipoornim...@gmail.com> > > > > >>>>>>>>>>>>>> wrote: > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> Thank you for your email. > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> On Sat, Jun 13, 2020 at 12:18 AM Xinbin Huang < > > > > >>>>>>>>>>> bin.huan...@gmail.com > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> wrote: > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>> - *Unpack SubDags during dag parsing*: This > > > > >>>>>> rewrites > > > > >>>>>>>> the > > > > >>>>>>>>>>>>>>>> *DagBag.bag_dag* > > > > >>>>>>>>>>>>>>>>>> method to unpack subdag while parsing, and > > > > >> it > > > > >>>>> will > > > > >>>>>>>> give a > > > > >>>>>>>>>>>> flat > > > > >>>>>>>>>>>>>>>>>> structure at > > > > >>>>>>>>>>>>>>>>>> the task level > > > > >>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>> The serialized_dag representation already > > > > >> does > > > > >>>>> this I > > > > >>>>>>>>> think. > > > > >>>>>>>>>> At > > > > >>>>>>>>>>>>> least > > > > >>>>>>>>>>>>>>> if > > > > >>>>>>>>>>>>>>>>> I've understood your idea here correctly. > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> I am not sure about serialized_dag > > > > >>> representation, > > > > >>>>> but > > > > >>>>>> at > > > > >>>>>>>>> least > > > > >>>>>>>>>>> it > > > > >>>>>>>>>>>>> will > > > > >>>>>>>>>>>>>>>> still keep the subdag entry in the DAG table? > > > > >> In > > > > >>> my > > > > >>>>>>>> proposal > > > > >>>>>>>>> as > > > > >>>>>>>>>>>> also > > > > >>>>>>>>>>>>> in > > > > >>>>>>>>>>>>>>> the > > > > >>>>>>>>>>>>>>>> draft PR, the idea is to *extract the tasks > > > > >> from > > > > >>>> the > > > > >>>>>>> subdag > > > > >>>>>>>>> and > > > > >>>>>>>>>>> add > > > > >>>>>>>>>>>>>> them > > > > >>>>>>>>>>>>>>>> back to the root_dag. *So the runtime DAG graph > > > > >>>> will > > > > >>>>>> look > > > > >>>>>>>>>> exactly > > > > >>>>>>>>>>>> the > > > > >>>>>>>>>>>>>>>> same as without subdag but with metadata > > > > >> attached > > > > >>>> to > > > > >>>>>>> those > > > > >>>>>>>>>>>> sections. > > > > >>>>>>>>>>>>>>> These > > > > >>>>>>>>>>>>>>>> metadata will be later on used to render in the > > > > >>> UI. > > > > >>>>> So > > > > >>>>>>>> after > > > > >>>>>>>>>>>> parsing > > > > >>>>>>>>>>>>> ( > > > > >>>>>>>>>>>>>>>> *DagBag.process_file()*), it will just output > > > > >> the > > > > >>>>>>> *root_dag > > > > >>>>>>>>>>>> *instead > > > > >>>>>>>>>>>>> of > > > > >>>>>>>>>>>>>>> *root_dag + > > > > >>>>>>>>>>>>>>>> subdag + subdag + nested subdag* etc. > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> - e.g. section-1-* will have metadata > > > > >>>>>>>>>> current_group=section-1, > > > > >>>>>>>>>>>>>>>> parent_group=<the-root-dag-id> (welcome for > > > > >>> naming > > > > >>>>>>>>>>> suggestions), > > > > >>>>>>>>>>>>> the > > > > >>>>>>>>>>>>>>>> reason for parent_group is that we can have > > > > >>> nested > > > > >>>>>> group > > > > >>>>>>>> and > > > > >>>>>>>>>>>> still > > > > >>>>>>>>>>>>>> be > > > > >>>>>>>>>>>>>>>> able to capture the dependency. > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> Runtime DAG: > > > > >>>>>>>>>>>>>>>> [image: image.png] > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> While at the UI, what we see would be something > > > > >>>> like > > > > >>>>>> this > > > > >>>>>>>> by > > > > >>>>>>>>>>>>> utilizing > > > > >>>>>>>>>>>>>>> the > > > > >>>>>>>>>>>>>>>> metadata, and then we can expand or zoom into > > > > >> in > > > > >>>> some > > > > >>>>>>> way. > > > > >>>>>>>>>>>>>>>> [image: image.png] > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> The benefits I can see is that: > > > > >>>>>>>>>>>>>>>> 1. We don't need to deal with the extra > > > > >>> complexity > > > > >>>> of > > > > >>>>>>>> SubDag > > > > >>>>>>>>>> for > > > > >>>>>>>>>>>>>>> execution > > > > >>>>>>>>>>>>>>>> and scheduling. It will be the same as not > > > > >> using > > > > >>>>>> SubDag. > > > > >>>>>>>>>>>>>>>> 2. Still have the benefits of modularized and > > > > >>>>> reusable > > > > >>>>>>> dag > > > > >>>>>>>>> code > > > > >>>>>>>>>>> and > > > > >>>>>>>>>>>>>>>> declare dependencies between them. And with the > > > > >>> new > > > > >>>>>>>>>>> SubDagOperator > > > > >>>>>>>>>>>>> (see > > > > >>>>>>>>>>>>>>> AIP > > > > >>>>>>>>>>>>>>>> or draft PR), we can use the same dag_factory > > > > >>>>> function > > > > >>>>>>> for > > > > >>>>>>>>>>>>> generating 1 > > > > >>>>>>>>>>>>>>>> dag, a lot of dynamic dags, or used for SubDag > > > > >>> (in > > > > >>>>> this > > > > >>>>>>>> case, > > > > >>>>>>>>>> it > > > > >>>>>>>>>>>> will > > > > >>>>>>>>>>>>>>> just > > > > >>>>>>>>>>>>>>>> extract all underlying tasks and append to the > > > > >>> root > > > > >>>>>> dag). > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> - Then it gets to the idea of replacing subdag > > > > >>>> with a > > > > >>>>>>>>>> simpler > > > > >>>>>>>>>>>>>> concept > > > > >>>>>>>>>>>>>>>> by Ash: the proposed change basically drains > > > > >> out > > > > >>>> the > > > > >>>>>>>>>> contents > > > > >>>>>>>>>>>> of > > > > >>>>>>>>>>>>> a > > > > >>>>>>>>>>>>>>> SubDag > > > > >>>>>>>>>>>>>>>> and becomes more like > > > > >>>>>>>>>>>> ExtractSubdagTasksAndAppendToRootdagOperator > > > > >>>>>>>>>>>>>>> (forgive > > > > >>>>>>>>>>>>>>>> me about the crazy name..). In this case, it is > > > > >>>> still > > > > >>>>>>>>>>> necessary > > > > >>>>>>>>>>>> to > > > > >>>>>>>>>>>>>>> keep the > > > > >>>>>>>>>>>>>>>> concept of subdag as it is nothing more than a > > > > >>>> name? > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> That's why the TaskGroup idea comes up. Thanks > > > > >>>> Chris > > > > >>>>>>> Palmer > > > > >>>>>>>>> for > > > > >>>>>>>>>>>>> helping > > > > >>>>>>>>>>>>>>>> conceptualize the functionality of TaskGroup, I > > > > >>>> will > > > > >>>>>> just > > > > >>>>>>>>> paste > > > > >>>>>>>>>>> it > > > > >>>>>>>>>>>>>> here. > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>> - Tasks can be added to a TaskGroup > > > > >>>>>>>>>>>>>>>>> - You *can* have dependencies between Tasks > > > > >> in > > > > >>>> the > > > > >>>>>> same > > > > >>>>>>>>>>>> TaskGroup, > > > > >>>>>>>>>>>>>> but > > > > >>>>>>>>>>>>>>>>> *cannot* have dependencies between a Task in > > > > >> a > > > > >>>>>>> TaskGroup > > > > >>>>>>>>>> and > > > > >>>>>>>>>>>>>> either a > > > > >>>>>>>>>>>>>>>>> Task in a different TaskGroup or a Task not > > > > >> in > > > > >>>> any > > > > >>>>>>> group > > > > >>>>>>>>>>>>>>>>> - You *can* have dependencies between a > > > > >>> TaskGroup > > > > >>>>> and > > > > >>>>>>>>>> either > > > > >>>>>>>>>>>>> other > > > > >>>>>>>>>>>>>>>>> TaskGroups or Tasks not in any group > > > > >>>>>>>>>>>>>>>>> - The UI will by default render a TaskGroup > > > > >> as > > > > >>> a > > > > >>>>>> single > > > > >>>>>>>>>>>> "object", > > > > >>>>>>>>>>>>>> but > > > > >>>>>>>>>>>>>>>>> which you expand or zoom into in some way > > > > >>>>>>>>>>>>>>>>> - You'd need some way to determine what the > > > > >>>>> "status" > > > > >>>>>>> of a > > > > >>>>>>>>>>>>> TaskGroup > > > > >>>>>>>>>>>>>>> was > > > > >>>>>>>>>>>>>>>>> at least for UI display purposes > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> I agree with Chris: > > > > >>>>>>>>>>>>>>>> - From the backend's view (scheduler & > > > > >>> executor), I > > > > >>>>>> think > > > > >>>>>>>>>>> TaskGroup > > > > >>>>>>>>>>>>>>> should > > > > >>>>>>>>>>>>>>>> be ignored during execution. (unless we decide > > > > >> to > > > > >>>>>>> implement > > > > >>>>>>>>>> some > > > > >>>>>>>>>>>>>> metadata > > > > >>>>>>>>>>>>>>>> operations that allows start/stop a group of > > > > >>> tasks > > > > >>>>>> etc.) > > > > >>>>>>>>>>>>>>>> - From the UI's View, it should be able to pick > > > > >>> up > > > > >>>>> the > > > > >>>>>>>>>> individual > > > > >>>>>>>>>>>>>> tasks' > > > > >>>>>>>>>>>>>>>> status and then determine the TaskGroup's > > > > >> status > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> Bin > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> On Fri, Jun 12, 2020 at 10:28 AM Daniel > > > > >> Imberman > > > > >>> < > > > > >>>>>>>>>>>>>>>> daniel.imber...@gmail.com> wrote: > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>> I hadn’t thought about using the `>>` operator > > > > >>> to > > > > >>>>> tie > > > > >>>>>>> dags > > > > >>>>>>>>>>>> together > > > > >>>>>>>>>>>>>> but > > > > >>>>>>>>>>>>>>> I > > > > >>>>>>>>>>>>>>>>> think that sounds pretty great! I wonder if we > > > > >>>> could > > > > >>>>>>>>>> essentially > > > > >>>>>>>>>>>>> write > > > > >>>>>>>>>>>>>>> in > > > > >>>>>>>>>>>>>>>>> the ability to set dependencies to all > > > > >>>> starter-tasks > > > > >>>>>> for > > > > >>>>>>>>> that > > > > >>>>>>>>>>> DAG. > > > > >>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>> I’m personally ok with SubDag being a mostly > > > > >> UI > > > > >>>>>> concept. > > > > >>>>>>>> It > > > > >>>>>>>>>>>> doesn’t > > > > >>>>>>>>>>>>>> need > > > > >>>>>>>>>>>>>>>>> to execute separately, you’re just adding more > > > > >>>> tasks > > > > >>>>>> to > > > > >>>>>>>> the > > > > >>>>>>>>>>> queue > > > > >>>>>>>>>>>>> that > > > > >>>>>>>>>>>>>>> will > > > > >>>>>>>>>>>>>>>>> be executed when there are resources > > > > >> available. > > > > >>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>> via Newton Mail [ > > > > >>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>> > > > > >>>>>>> > > > > >>>>>> > > > > >>>>> > > > > >>>> > > > > >>> > > > > >> > > > > > > > > > > https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.14.6&source=email_footer_2 > > > > >>>>>>>>>>>>>>>>> ] > > > > >>>>>>>>>>>>>>>>> On Fri, Jun 12, 2020 at 9:45 AM, Chris Palmer > > > > >> < > > > > >>>>>>>>>>> ch...@crpalmer.com > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> wrote: > > > > >>>>>>>>>>>>>>>>> I agree that SubDAGs are an overly complex > > > > >>>>>> abstraction. > > > > >>>>>>> I > > > > >>>>>>>>>> think > > > > >>>>>>>>>>>> what > > > > >>>>>>>>>>>>>> is > > > > >>>>>>>>>>>>>>>>> needed/useful is a TaskGroup concept. On a > > > > >> high > > > > >>>>> level > > > > >>>>>> I > > > > >>>>>>>>> think > > > > >>>>>>>>>>> you > > > > >>>>>>>>>>>>> want > > > > >>>>>>>>>>>>>>>>> this > > > > >>>>>>>>>>>>>>>>> functionality: > > > > >>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>> - Tasks can be added to a TaskGroup > > > > >>>>>>>>>>>>>>>>> - You *can* have dependencies between Tasks in > > > > >>> the > > > > >>>>>> same > > > > >>>>>>>>>>> TaskGroup, > > > > >>>>>>>>>>>>> but > > > > >>>>>>>>>>>>>>>>> *cannot* have dependencies between a Task in a > > > > >>>>>> TaskGroup > > > > >>>>>>>> and > > > > >>>>>>>>>>>> either > > > > >>>>>>>>>>>>> a > > > > >>>>>>>>>>>>>>>>> Task in a different TaskGroup or a Task not in > > > > >>> any > > > > >>>>>> group > > > > >>>>>>>>>>>>>>>>> - You *can* have dependencies between a > > > > >>> TaskGroup > > > > >>>>> and > > > > >>>>>>>> either > > > > >>>>>>>>>>> other > > > > >>>>>>>>>>>>>>>>> TaskGroups or Tasks not in any group > > > > >>>>>>>>>>>>>>>>> - The UI will by default render a TaskGroup > > > > >> as a > > > > >>>>>> single > > > > >>>>>>>>>>> "object", > > > > >>>>>>>>>>>>> but > > > > >>>>>>>>>>>>>>>>> which you expand or zoom into in some way > > > > >>>>>>>>>>>>>>>>> - You'd need some way to determine what the > > > > >>>> "status" > > > > >>>>>> of > > > > >>>>>>> a > > > > >>>>>>>>>>>> TaskGroup > > > > >>>>>>>>>>>>>> was > > > > >>>>>>>>>>>>>>>>> at least for UI display purposes > > > > >>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>> Not sure if it would need to be a top level > > > > >>> object > > > > >>>>>> with > > > > >>>>>>>> its > > > > >>>>>>>>>> own > > > > >>>>>>>>>>>>>> database > > > > >>>>>>>>>>>>>>>>> table and model or just another attribute on > > > > >>>> tasks. > > > > >>>>> I > > > > >>>>>>>> think > > > > >>>>>>>>>> you > > > > >>>>>>>>>>>>> could > > > > >>>>>>>>>>>>>>>>> build > > > > >>>>>>>>>>>>>>>>> it in a way such that from the schedulers > > > > >> point > > > > >>> of > > > > >>>>>> view > > > > >>>>>>> a > > > > >>>>>>>>> DAG > > > > >>>>>>>>>>> with > > > > >>>>>>>>>>>>>>>>> TaskGroups doesn't get treated any > > > > >> differently. > > > > >>> So > > > > >>>>> it > > > > >>>>>>>> really > > > > >>>>>>>>>>> just > > > > >>>>>>>>>>>>>>> becomes > > > > >>>>>>>>>>>>>>>>> a > > > > >>>>>>>>>>>>>>>>> shortcut for setting dependencies between sets > > > > >>> of > > > > >>>>>> Tasks, > > > > >>>>>>>> and > > > > >>>>>>>>>>>> allows > > > > >>>>>>>>>>>>>> the > > > > >>>>>>>>>>>>>>> UI > > > > >>>>>>>>>>>>>>>>> to simplify the render of the DAG structure. > > > > >>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>> Chris > > > > >>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>> On Fri, Jun 12, 2020 at 12:12 PM Dan Davydov > > > > >>>>>>>>>>>>>>> <ddavy...@twitter.com.invalid > > > > >>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>> wrote: > > > > >>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>> Agree with James (and think it's actually > > > > >> the > > > > >>>> more > > > > >>>>>>>>> important > > > > >>>>>>>>>>>> issue > > > > >>>>>>>>>>>>>> to > > > > >>>>>>>>>>>>>>>>> fix), > > > > >>>>>>>>>>>>>>>>>> but I am still convinced Ash' idea is the > > > > >>> right > > > > >>>>> way > > > > >>>>>>>>> forward > > > > >>>>>>>>>>>> (just > > > > >>>>>>>>>>>>> it > > > > >>>>>>>>>>>>>>>>> might > > > > >>>>>>>>>>>>>>>>>> require a bit more work to deprecate than > > > > >>> adding > > > > >>>>>>> visual > > > > >>>>>>>>>>> grouping > > > > >>>>>>>>>>>>> in > > > > >>>>>>>>>>>>>>> the > > > > >>>>>>>>>>>>>>>>>> UI). > > > > >>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>> There was a previous thread about this FYI > > > > >>> with > > > > >>>>> more > > > > >>>>>>>>> context > > > > >>>>>>>>>>> on > > > > >>>>>>>>>>>>> why > > > > >>>>>>>>>>>>>>>>> subdags > > > > >>>>>>>>>>>>>>>>>> are bad and potential solutions: > > > > >>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>> > > > > >>>>>>> > > > > >> https://www.mail-archive.com/dev@airflow.apache.org/msg01202.html > > > > >>>>>>>>>>>>>> . A > > > > >>>>>>>>>>>>>>>>>> solution I outline there to Jame's problem > > > > >> is > > > > >>>> e.g. > > > > >>>>>>>>> enabling > > > > >>>>>>>>>>> the > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>> operator > > > > >>>>>>>>>>>>>>>>>> for Airflow operators to work with DAGs as > > > > >>>> well. I > > > > >>>>>> see > > > > >>>>>>>>> this > > > > >>>>>>>>>>>> being > > > > >>>>>>>>>>>>>>>>> separate > > > > >>>>>>>>>>>>>>>>>> from Ash' solution for DAG grouping in the > > > > >> UI > > > > >>>> but > > > > >>>>>> one > > > > >>>>>>> of > > > > >>>>>>>>> the > > > > >>>>>>>>>>> two > > > > >>>>>>>>>>>>>> items > > > > >>>>>>>>>>>>>>>>>> required to replace all existing subdag > > > > >>>>>> functionality. > > > > >>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>> I've been working with subdags for 3 years > > > > >> and > > > > >>>>> they > > > > >>>>>>> are > > > > >>>>>>>>>>> always a > > > > >>>>>>>>>>>>>> giant > > > > >>>>>>>>>>>>>>>>> pain > > > > >>>>>>>>>>>>>>>>>> to use. They are a constant source of user > > > > >>>>> confusion > > > > >>>>>>> and > > > > >>>>>>>>>>>> breakages > > > > >>>>>>>>>>>>>>>>> during > > > > >>>>>>>>>>>>>>>>>> upgrades. Would love to see them gone :). > > > > >>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>> On Fri, Jun 12, 2020 at 11:11 AM James > > > > >> Coder < > > > > >>>>>>>>>>>> jcode...@gmail.com> > > > > >>>>>>>>>>>>>>>>> wrote: > > > > >>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>> I'm not sure I totally agree it's just a > > > > >> UI > > > > >>>>>>> concept. I > > > > >>>>>>>>> use > > > > >>>>>>>>>>> the > > > > >>>>>>>>>>>>>>> subdag > > > > >>>>>>>>>>>>>>>>>>> operator to simplify dependencies too. If > > > > >>> you > > > > >>>>>> have a > > > > >>>>>>>>> group > > > > >>>>>>>>>>> of > > > > >>>>>>>>>>>>>> tasks > > > > >>>>>>>>>>>>>>>>> that > > > > >>>>>>>>>>>>>>>>>>> need to finish before another group of > > > > >> tasks > > > > >>>>>> start, > > > > >>>>>>>>> using > > > > >>>>>>>>>> a > > > > >>>>>>>>>>>>> subdag > > > > >>>>>>>>>>>>>>> is > > > > >>>>>>>>>>>>>>>>> a > > > > >>>>>>>>>>>>>>>>>>> pretty quick way to set those dependencies > > > > >>>> and I > > > > >>>>>>> think > > > > >>>>>>>>>> also > > > > >>>>>>>>>>>> make > > > > >>>>>>>>>>>>>> it > > > > >>>>>>>>>>>>>>>>>> easier > > > > >>>>>>>>>>>>>>>>>>> to follow the dag code. > > > > >>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>> On Fri, Jun 12, 2020 at 9:53 AM Kyle > > > > >> Hamlin > > > > >>> < > > > > >>>>>>>>>>>>> hamlin...@gmail.com> > > > > >>>>>>>>>>>>>>>>> wrote: > > > > >>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>> I second Ash’s grouping concept. > > > > >>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>> On Fri, Jun 12, 2020 at 5:10 AM Ash > > > > >>>>>> Berlin-Taylor > > > > >>>>>>> < > > > > >>>>>>>>>>>>>> a...@apache.org > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>> wrote: > > > > >>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>> Question: > > > > >>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>> Do we even need the SubDagOperator > > > > >>>> anymore? > > > > >>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>> Would removing it entirely and just > > > > >>>>> replacing > > > > >>>>>> it > > > > >>>>>>>>> with > > > > >>>>>>>>>> a > > > > >>>>>>>>>>> UI > > > > >>>>>>>>>>>>>>>>> grouping > > > > >>>>>>>>>>>>>>>>>>>>> concept be conceptually simpler, less > > > > >> to > > > > >>>> get > > > > >>>>>>>> wrong, > > > > >>>>>>>>>> and > > > > >>>>>>>>>>>>> closer > > > > >>>>>>>>>>>>>>> to > > > > >>>>>>>>>>>>>>>>>> what > > > > >>>>>>>>>>>>>>>>>>>>> users actually want to achieve with > > > > >>>> subdags? > > > > >>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>> With your proposed change, tasks in > > > > >>>> subdags > > > > >>>>>>> could > > > > >>>>>>>>>> start > > > > >>>>>>>>>>>>>> running > > > > >>>>>>>>>>>>>>> in > > > > >>>>>>>>>>>>>>>>>>>>> parallel (a good change) -- so should > > > > >> we > > > > >>>> not > > > > >>>>>>> also > > > > >>>>>>>>> just > > > > >>>>>>>>>>>>>>> _enitrely_ > > > > >>>>>>>>>>>>>>>>>>> remove > > > > >>>>>>>>>>>>>>>>>>>>> the concept of a sub dag and replace > > > > >> it > > > > >>>> with > > > > >>>>>>>>> something > > > > >>>>>>>>>>>>>> simpler. > > > > >>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>> Problems with subdags (I think. I > > > > >>> haven't > > > > >>>>> used > > > > >>>>>>>> them > > > > >>>>>>>>>>>>>> extensively > > > > >>>>>>>>>>>>>>> so > > > > >>>>>>>>>>>>>>>>>> may > > > > >>>>>>>>>>>>>>>>>>>>> be wrong on some of these): > > > > >>>>>>>>>>>>>>>>>>>>> - They need their own dag_id, but it > > > > >>>> has(?) > > > > >>>>> to > > > > >>>>>>> be > > > > >>>>>>>> of > > > > >>>>>>>>>> the > > > > >>>>>>>>>>>>> form > > > > >>>>>>>>>>>>>>>>>>>>> `parent_dag_id.subdag_id`. > > > > >>>>>>>>>>>>>>>>>>>>> - They need their own > > > > >> schedule_interval, > > > > >>>> but > > > > >>>>>> it > > > > >>>>>>>> has > > > > >>>>>>>>> to > > > > >>>>>>>>>>>> match > > > > >>>>>>>>>>>>>> the > > > > >>>>>>>>>>>>>>>>>> parent > > > > >>>>>>>>>>>>>>>>>>>> dag > > > > >>>>>>>>>>>>>>>>>>>>> - Sub dags can be paused on their own. > > > > >>>> (Does > > > > >>>>>> it > > > > >>>>>>>> make > > > > >>>>>>>>>>> sense > > > > >>>>>>>>>>>>> to > > > > >>>>>>>>>>>>>> do > > > > >>>>>>>>>>>>>>>>>> this? > > > > >>>>>>>>>>>>>>>>>>>>> Pausing just a sub dag would mean the > > > > >>> sub > > > > >>>>> dag > > > > >>>>>>>> would > > > > >>>>>>>>>>> never > > > > >>>>>>>>>>>>>>>>> execute, so > > > > >>>>>>>>>>>>>>>>>>>>> the SubDagOperator would fail too. > > > > >>>>>>>>>>>>>>>>>>>>> - You had to choose the executor to > > > > >>>>> operator a > > > > >>>>>>>>> subdag > > > > >>>>>>>>>>> with > > > > >>>>>>>>>>>>> -- > > > > >>>>>>>>>>>>>>>>> always > > > > >>>>>>>>>>>>>>>>>> a > > > > >>>>>>>>>>>>>>>>>>>>> bit of a kludge. > > > > >>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>> Thoughts? > > > > >>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>> -ash > > > > >>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>> On Jun 12 2020, at 12:01 pm, Ash > > > > >>>>>> Berlin-Taylor < > > > > >>>>>>>>>>>>>> a...@apache.org> > > > > >>>>>>>>>>>>>>>>>> wrote: > > > > >>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>>> Workon sub-dags is much needed, I'm > > > > >>>>> excited > > > > >>>>>> to > > > > >>>>>>>> see > > > > >>>>>>>>>> how > > > > >>>>>>>>>>>>> this > > > > >>>>>>>>>>>>>>>>>>> progresses. > > > > >>>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>>>> - *Unpack SubDags during dag > > > > >>> parsing*: > > > > >>>>> This > > > > >>>>>>>>>> rewrites > > > > >>>>>>>>>>>> the > > > > >>>>>>>>>>>>>>>>>>>>> *DagBag.bag_dag* > > > > >>>>>>>>>>>>>>>>>>>>>>> method to unpack subdag while > > > > >>> parsing, > > > > >>>>> and > > > > >>>>>> it > > > > >>>>>>>>> will > > > > >>>>>>>>>>>> give a > > > > >>>>>>>>>>>>>>> flat > > > > >>>>>>>>>>>>>>>>>>>>>>> structure at > > > > >>>>>>>>>>>>>>>>>>>>>>> the task level > > > > >>>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>>> The serialized_dag representation > > > > >>>> already > > > > >>>>>> does > > > > >>>>>>>>> this > > > > >>>>>>>>>> I > > > > >>>>>>>>>>>>> think. > > > > >>>>>>>>>>>>>>> At > > > > >>>>>>>>>>>>>>>>>> least > > > > >>>>>>>>>>>>>>>>>>>> if > > > > >>>>>>>>>>>>>>>>>>>>>> I've understood your idea here > > > > >>>> correctly. > > > > >>>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>>> -ash > > > > >>>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>>> On Jun 12 2020, at 9:51 am, Xinbin > > > > >>>> Huang < > > > > >>>>>>>>>>>>>>> bin.huan...@gmail.com > > > > >>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>> wrote: > > > > >>>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>>>> Hi everyone, > > > > >>>>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>>>> Sending a message to everyone and > > > > >>>> collect > > > > >>>>>>>>> feedback > > > > >>>>>>>>>> on > > > > >>>>>>>>>>>> the > > > > >>>>>>>>>>>>>>>>> AIP-34 > > > > >>>>>>>>>>>>>>>>>> on > > > > >>>>>>>>>>>>>>>>>>>>>>> rewriting SubDagOperator. This was > > > > >>>>>> previously > > > > >>>>>>>>>> briefly > > > > >>>>>>>>>>>>>>>>> mentioned in > > > > >>>>>>>>>>>>>>>>>>> the > > > > >>>>>>>>>>>>>>>>>>>>>>> discussion about what needs to be > > > > >>> done > > > > >>>>> for > > > > >>>>>>>>> Airflow > > > > >>>>>>>>>>> 2.0, > > > > >>>>>>>>>>>>> and > > > > >>>>>>>>>>>>>>>>> one of > > > > >>>>>>>>>>>>>>>>>>> the > > > > >>>>>>>>>>>>>>>>>>>>>>> ideas is to make SubDagOperator > > > > >>> attach > > > > >>>>>> tasks > > > > >>>>>>>> back > > > > >>>>>>>>>> to > > > > >>>>>>>>>>>> the > > > > >>>>>>>>>>>>>> root > > > > >>>>>>>>>>>>>>>>> DAG. > > > > >>>>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>>>> This AIP-34 focuses on solving > > > > >>>>>> SubDagOperator > > > > >>>>>>>>>> related > > > > >>>>>>>>>>>>>> issues > > > > >>>>>>>>>>>>>>> by > > > > >>>>>>>>>>>>>>>>>>>>> reattaching > > > > >>>>>>>>>>>>>>>>>>>>>>> all tasks back to the root dag > > > > >> while > > > > >>>>>>> respecting > > > > >>>>>>>>>>>>>> dependencies > > > > >>>>>>>>>>>>>>>>>> during > > > > >>>>>>>>>>>>>>>>>>>>>>> parsing. The original grouping > > > > >> effect > > > > >>>> on > > > > >>>>>> the > > > > >>>>>>> UI > > > > >>>>>>>>>> will > > > > >>>>>>>>>>> be > > > > >>>>>>>>>>>>>>>>> achieved > > > > >>>>>>>>>>>>>>>>>>>> through > > > > >>>>>>>>>>>>>>>>>>>>>>> grouping related tasks by metadata. > > > > >>>>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>>>> This also makes the dag_factory > > > > >>>> function > > > > >>>>>> more > > > > >>>>>>>>>>> reusable > > > > >>>>>>>>>>>>>>> because > > > > >>>>>>>>>>>>>>>>> you > > > > >>>>>>>>>>>>>>>>>>>> don't > > > > >>>>>>>>>>>>>>>>>>>>>>> need to have parent_dag_name and > > > > >>>>>>> child_dag_name > > > > >>>>>>>>> in > > > > >>>>>>>>>>> the > > > > >>>>>>>>>>>>>>> function > > > > >>>>>>>>>>>>>>>>>>>>> signature > > > > >>>>>>>>>>>>>>>>>>>>>>> anymore. > > > > >>>>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>>>> Changes proposed: > > > > >>>>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>>>> - *Unpack SubDags during dag > > > > >>> parsing*: > > > > >>>>> This > > > > >>>>>>>>>> rewrites > > > > >>>>>>>>>>>> the > > > > >>>>>>>>>>>>>>>>>>>>> *DagBag.bag_dag* > > > > >>>>>>>>>>>>>>>>>>>>>>> method to unpack subdag while > > > > >>> parsing, > > > > >>>>> and > > > > >>>>>> it > > > > >>>>>>>>> will > > > > >>>>>>>>>>>> give a > > > > >>>>>>>>>>>>>>> flat > > > > >>>>>>>>>>>>>>>>>>>>>>> structure at > > > > >>>>>>>>>>>>>>>>>>>>>>> the task level > > > > >>>>>>>>>>>>>>>>>>>>>>> - *Simplify SubDagOperator*: The > > > > >> new > > > > >>>>>>>>> SubDagOperator > > > > >>>>>>>>>>>> acts > > > > >>>>>>>>>>>>>>> like a > > > > >>>>>>>>>>>>>>>>>>>>>>> container and most of the original > > > > >>>>> methods > > > > >>>>>>> are > > > > >>>>>>>>>>> removed. > > > > >>>>>>>>>>>>> The > > > > >>>>>>>>>>>>>>>>>>>>>>> signature is > > > > >>>>>>>>>>>>>>>>>>>>>>> also changed to *subdag_factory > > > > >> *with > > > > >>>>>>>>> *subdag_args > > > > >>>>>>>>>>> *and > > > > >>>>>>>>>>>>>>>>>>>>> *subdag_kwargs*. > > > > >>>>>>>>>>>>>>>>>>>>>>> This is similar to the > > > > >> PythonOperator > > > > >>>>>>>> signature. > > > > >>>>>>>>>>>>>>>>>>>>>>> - *Add a TaskGroup model and add > > > > >>>>>>> current_group > > > > >>>>>>>> & > > > > >>>>>>>>>>>>>> parent_group > > > > >>>>>>>>>>>>>>>>>>>>> attributes > > > > >>>>>>>>>>>>>>>>>>>>>>> to BaseOperator*: This metadata is > > > > >>> used > > > > >>>>> to > > > > >>>>>>>> group > > > > >>>>>>>>>>> tasks > > > > >>>>>>>>>>>>> for > > > > >>>>>>>>>>>>>>>>>>>>>>> rendering at > > > > >>>>>>>>>>>>>>>>>>>>>>> UI level. It may potentially extend > > > > >>>>> further > > > > >>>>>>> to > > > > >>>>>>>>>> group > > > > >>>>>>>>>>>>>>> arbitrary > > > > >>>>>>>>>>>>>>>>>>> tasks > > > > >>>>>>>>>>>>>>>>>>>>>>> outside the context of subdag to > > > > >>> allow > > > > >>>>>>>>> group-level > > > > >>>>>>>>>>>>>> operations > > > > >>>>>>>>>>>>>>>>>>> (i.e. > > > > >>>>>>>>>>>>>>>>>>>>>>> stop/trigger a group of task within > > > > >>> the > > > > >>>>>> dag) > > > > >>>>>>>>>>>>>>>>>>>>>>> - *Webserver UI for SubDag*: > > > > >> Proposed > > > > >>>> UI > > > > >>>>>>>>>> modification > > > > >>>>>>>>>>>> to > > > > >>>>>>>>>>>>>>> allow > > > > >>>>>>>>>>>>>>>>>>>>>>> (un)collapse a group of tasks for a > > > > >>>> flat > > > > >>>>>>>>> structure > > > > >>>>>>>>>> to > > > > >>>>>>>>>>>>> pair > > > > >>>>>>>>>>>>>>> with > > > > >>>>>>>>>>>>>>>>>>> the > > > > >>>>>>>>>>>>>>>>>>>>> first > > > > >>>>>>>>>>>>>>>>>>>>>>> change instead of the original > > > > >>>>> hierarchical > > > > >>>>>>>>>>> structure. > > > > >>>>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>>>> Please see related documents and > > > > >> PRs > > > > >>>> for > > > > >>>>>>>> details: > > > > >>>>>>>>>>>>>>>>>>>>>>> AIP: > > > > >>>>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>> > > > > >>>>>>> > > > > >>>>>> > > > > >>>>> > > > > >>>> > > > > >>> > > > > >> > > > > > > > > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-34+Rewrite+SubDagOperator > > > > >>>>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>>>> Original Issue: > > > > >>>>>>>>>>>>>>> https://github.com/apache/airflow/issues/8078 > > > > >>>>>>>>>>>>>>>>>>>>>>> Draft PR: > > > > >>>>>>>>>>> https://github.com/apache/airflow/pull/9243 > > > > >>>>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>>>> Please let me know if there are any > > > > >>>>> aspects > > > > >>>>>>>> that > > > > >>>>>>>>>> you > > > > >>>>>>>>>>>>>>>>>> agree/disagree > > > > >>>>>>>>>>>>>>>>>>>>>>> with or > > > > >>>>>>>>>>>>>>>>>>>>>>> need more clarification (especially > > > > >>> the > > > > >>>>>> third > > > > >>>>>>>>>> change > > > > >>>>>>>>>>>>>>> regarding > > > > >>>>>>>>>>>>>>>>>>>>> TaskGroup). > > > > >>>>>>>>>>>>>>>>>>>>>>> Any comments are welcome and I am > > > > >>>> looking > > > > >>>>>>>> forward > > > > >>>>>>>>>> to > > > > >>>>>>>>>>>> it! > > > > >>>>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>>>> Cheers > > > > >>>>>>>>>>>>>>>>>>>>>>> Bin > > > > >>>>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>>> -- > > > > >>>>>>>>>>>>>>>>>>>> Kyle Hamlin > > > > >>>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> -- > > > > >>>>>>>>>>>>>>> Thanks & Regards > > > > >>>>>>>>>>>>>>> Poornima > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> -- > > > > >>>>>>> > > > > >>>>>>> Jarek Potiuk > > > > >>>>>>> Polidea <https://www.polidea.com/> | Principal Software > > Engineer > > > > >>>>>>> > > > > >>>>>>> M: +48 660 796 129 <+48%20660%20796%20129> <+48660796129 > > > > >>>>> <+48%20660%20796%20129>> > > > > >>>>>>> [image: Polidea] <https://www.polidea.com/> > > > > >>>>>>> > > > > >>>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> -- > > > > >>>>> > > > > >>>>> Jarek Potiuk > > > > >>>>> Polidea <https://www.polidea.com/> | Principal Software > Engineer > > > > >>>>> > > > > >>>>> M: +48 660 796 129 <+48%20660%20796%20129> <+48660796129 > > > > >>>>> <+48%20660%20796%20129>> > > > > >>>>> [image: Polidea] <https://www.polidea.com/> > > > > >>>>> > > > > >>>> > > > > >>>> > > > > >>>> -- > > > > >>>> > > > > >>>> *Jacob Ferriero* > > > > >>>> > > > > >>>> Strategic Cloud Engineer: Data Engineering > > > > >>>> > > > > >>>> jferri...@google.com > > > > >>>> > > > > >>>> 617-714-2509 > > > > >>>> > > > > >>> > > > > >> > > > > > > > > > > > > > >