Ah. I only saw your answer now Austin - sorry for that, catching up only now after some deep dive down into several issues.
TL;DR; It's just a much longer explanation on why I think it should start outside (without ruling out the possibility we will bring it in). Pablo, Austin, > Sounds somewhat like a question of whether to grow the tent of contributors, committers, pmc of what is deemed to be "Airflow" (capital "A" and in)? Or err towards things manageable for the existing committers, pmc? With more things deemed not-in, would adding new blood to the project be more difficult? No, not really. Airflow consists of the core "Airflow" - where most contributors and committers commit their code. And it has - optional - providers. Airflow Providers is an optional feature of Airflow and I think our goal is to have not only Airflow "community" managed providers, for things that are useful for a vast number of users, but also to build a thriving ecosystem of people who would like to build their own providers. Simply because Airflow is so popular and is the "backbone" of data processing orchestration. And we have plenty of room for anyone who wants to contribute to either of them. And conceptually - there is no need for "less popular" providers to be part of the community. We already have 70+ community managed providers that are rather popular. Also there are many providers that people develop outside of Airflow - some of them - if they gain popularity - might be eventually contributed to the community, some of them - especially if they are "niche" services, are likely better to stay outside. I think that is a pretty natural choice. The existing Community providers IMHO are a HUGE asset of Airflow. The fact that when you start using Airflow it comes with Aws, Google, Databricks and 70+ other integrations that you know you can rely on because they are maintained by the community is a huge selling point for Airflow as a platform of choice. But there is a law of diminishing returns - the more "less important" providers we add to the community code, the less value they bring and the more maintenance burden they cause. And I personally think we reached the level where we MUST consider both when we make a decision of accepting a new code. Not everyone realises that, but code is more often a liability than asset. Accepting a new code is not "benefits only". It often slows you down, limits you in what you can do and has the chance of angry users flooding you with issues when your change breaks their workflow. We already refused to accept a few donations of code in the past on that ground - most notably CWL (Common Workflow Language) wanted to donate their integration and we refused - because the cost of maintaining it would far outgrow the benefits and we made the decision that if CWL is interested in maintaining the integration, it's better that it stays with them rather than we take over the maintenance of it. The most important aspect of it was that it was a deliberate decision taken after long debate and considering various voices (you can look it up in the devlist). The tests that Pablo mentions are rather similar to the CWL case. We might or might not choose to accept it to the community, but there is literally no problem in starting it outside and when we see the usefulness of it and the amount of burden it brings where we can make a decision whether to accept it or not. In the case of CWL - when they approached us, it was already working, not only POC but this was something they iterated over and developed, because they found it super-useful to integrate and run CWL workflows on Airflow. But it was Airflow 1.10 only at that time, and it would require quite an effort to make it work with Airflow 2.0 and continue with newer releases. We just felt that it's not worth it to take the extra burden on the community. It would slow us down. Of course - one could say that then it might make Pablo and others less interested in developing it - knowing that it might not make it to the "community managed code". But if this is the only motivation, then it is a bad motivation in the first place. If we know that something is useful and we see that the community will benefit - this should be the main motivator. If it turns out to be useful and has a great value - we will accept it, if not - maybe that was not a great idea in the first place. But bottomline what we are talking about is one or few people interested in making things better for the community risking their time and effort vs. the whole team of maintainers committing to something that is difficult to assess what maintenance cost it has and what benefit it brings. I think if you believe in the idea, taking the cost of showing and developing something like that outside initially is a good idea - at least to the point where we see it and see how it can be used and what it can bring. Side comment - we are gearing up for splitting providers out of the "airflow core" technically (i.e. put them in separate repositories). I spoke about some challenges it involves with a number of people - one of them is that when we split repo, there might be a notion that the separate repos might make people less inclined to contribute. I think we will be able to avoid that, but it will require a really good communication and building some kind of "umbrella" experience to not make the people who only contribute to a provider feel less "Airflow contributors". This is actually one of the most important problems to solve in the whole split. How to keep the people "Airflow" contributors even if they will be contributing to separate repositories? I already have a few ideas but let's leave it for a separate discussion. But I spoke even to a few ASF old-timers already (I am preparing for the split for about a year now and talking to a number of people, and I am going to have a lot of discussions about it in the ApacheCon in New Orleans in a month) and I even heard voices that the 3rd-party integrations should not be part of the ASF code in the first place. I do not agree with that actually, but you can see how you can have various opinions there. I would rather think now how we can keep the current "popular" providers in and keep the community around it when we split, rather than growing the number of providers, when we do not see a clear "need" from the wider community to bring the provider in. And knowing that the split is coming - would it make huge difference if the new providers are added to a new "apache/airlfow-xxxx" repo, or kept in "xxxx/airflow-provider". And similar with the "test" frameworks. It's likely by the time it gets contirbuted we will already have in place the multi-repo structure and bringing it in might be simply easier - just literally transferring the repo and plugging-in the CI of ours. I am preparing a very solid ground now to be able to do it. So starting separately might be a way to go first. I also think it has nothing to do with the growing number of contributors and committers. I am not sure if you are aware of it, Austin, but Airflow has by far the biggest number of contributors out of all Apache Software Foundation projects. By FAR. We are the TOP 1 project. We bypassed Apache Spark in November 2021 (both projects had ca. 1800 contributors then in GitHub). The numbers are 1829 Spark, 2180 Airflow today. We are FLYING when it comes to growing our contributor's base. We are also continuously adding new committers and PMC members. J. On Wed, Aug 10, 2022 at 9:35 PM Austin Bennett <[email protected]> wrote: > If much is out, rather than in, is there a different pool from where you > will draw contributors and eventually committers/pmc? > > Sounds somewhat like a question of whether to grow the tent of > contributors, committers, pmc of what is deemed to be "Airflow" (capital > "A" and in)? Or err towards things manageable for the existing committers, > pmc? With more things deemed not-in, would adding new blood to the project > be more difficult? > > > > On Sat, Aug 6, 2022, 9:02 AM Jarek Potiuk <[email protected]> wrote: > >> What do you think, Pablo about the "being out" vs. "being in" the >> official repo? >> >> On Thu, Jul 28, 2022 at 3:51 PM Jarek Potiuk <[email protected]> wrote: >> >>> Anyone :) ? >>> >>> On Mon, Jul 18, 2022 at 10:38 AM Jarek Potiuk <[email protected]> wrote: >>> >>>> I would love to hear what others think about the "in/out" approach - >>>> mine is just the line of thoughts I've been exploring during the last few >>>> months where I prepared my own line of thought about providers, >>>> maintenance, incentive of entities maintaining open-source projects, and >>>> especially - expectations of the users that it creates. But those are just >>>> my thoughts and I'd love to hear what others think about it. >>>> >>>> On Mon, Jul 18, 2022 at 10:33 AM Jarek Potiuk <[email protected]> wrote: >>>> >>>>> I had some thoughts about it - this also connected with recent >>>>> discussions about mixed governance for providers, and I think it's worth >>>>> using this discussion to set some rules and "boundaries" on when and >>>>> how and especially why we want to accept some contributions, while for >>>>> some >>>>> other contributions it's better to be outside. >>>>> >>>>> We are about to start more seriously thinking (and discussing) on how >>>>> to split Airflow providers off airflow. And I think we can split off more >>>>> than providers - this might be a good candidate to be a standalone, but >>>>> still community maintained package. If we are going to solve the problem >>>>> of >>>>> splitting airflow to N packages, one more package does not matter. >>>>> And it would nicely solve "version independence". We could even make >>>>> it airflow 2.0+ compliant if we want. >>>>> >>>>> So I think while the question of "is it tied with a specific airflow >>>>> version or not" does not really prevent us from making it part of >>>>> community >>>>> - those two are not related (if we are going to have more repositories >>>>> anyway) >>>>> >>>>> The important part is really how "self-servicing" we can make it and >>>>> how we make sure it stays relevant with future versions of Airflow and who >>>>> does it I think - namely who has the incentive and "responsibility" to >>>>> maintain it. I am sure we will add more features to Airflow DAGs and >>>>> simplify the way DAGs are written over time, and the test harness will >>>>> have >>>>> to adapt to it. >>>>> >>>>> There are pros and cons of having such a standalone package "in the >>>>> community/ASF project" and "out of it". We have a good example (from >>>>> similar kinds of tools/utils) in the past that we can learn from(and maybe >>>>> Bas can share more insights). >>>>> >>>>> https://github.com/BasPH/pylint-airflow - pylint plugin for >>>>> Airflow DAGs >>>>> >>>>> Initially that was "sponsored" by GoDataDriven where Bas worked and I >>>>> think this is where it was born. And that made sense as it was likely also >>>>> useful for the customers of GoDataDriven (here I am guessing). But >>>>> apparently both GoDataDriven's incentives winded down and it turned out >>>>> that usefulness of it was not as big (also I think we all in Python >>>>> community learned that Pylint is more of a distraction than real help - we >>>>> dumped Pylint eventually and the plugin was not maintained beyond some >>>>> versions of 1.10. And the tool is all but defunct now. Which is perfectly >>>>> understandable. >>>>> >>>>> In this case there is (I think) no risk of a "pylint" like problem, >>>>> but the question of maintenance and adaptation to future versions of >>>>> Airflow remains. >>>>> >>>>> I think there is one big differences of something that is "in ASF >>>>> repos" and "out": >>>>> >>>>> * if we make it a standalone package in "asf airflow community" - we >>>>> will have some obligation and expectations from our users to maintain it. >>>>> We can add some test harness (regardless if it will be in airflow >>>>> repository or in a separate one) to make sure that new airflow "core" >>>>> changes will not break it (and we can fail our PRs if they do - basically >>>>> making "core" maintainers take care about this problem rather than >>>>> delegate >>>>> it to someone else to react on core changes (this is what has to happen >>>>> with providers I believe even if we split them to separate repo). I think >>>>> anything that we as the ASF community release should have such harnesses - >>>>> making sure that whatever we release and make available to our users work >>>>> together. >>>>> >>>>> * if it is outside of the "ASF community", someone will have to react >>>>> to "core airflow" changes. We will not do it in the community, we will not >>>>> pay attention, such an "external tool" might break at any time because we >>>>> introduced a change in part of a core that the external tool implicitly >>>>> relied on. >>>>> >>>>> For me the question is whether something should be in/out should be >>>>> based on : >>>>> >>>>> * is it really useful for the community as a whole? -> if yes we >>>>> should consider it >>>>> * is it strongly tied with the core of airflow in the sense of relying >>>>> on some internals that might change easily? -> if not, there is no need to >>>>> bring it in, it can be easily maintained outside by anyone >>>>> * if it is strongly tied with the core - > is there someone (person, >>>>> organisation) who wants to take the burden of maintaining it and has >>>>> incentive of doing it for quite some time -> if yes, great, let them do >>>>> that! >>>>> * if it is strongly tied, do we want to take a burden as "core airflow >>>>> maintainers" to keep it updated together with the core if it is? -> if >>>>> yes, >>>>> we should bring it in >>>>> >>>>> If we have a strongly tied tool that we do not want to maintain in the >>>>> core and there is no entity who would like to do it, then I think this >>>>> idea >>>>> should be dropped :). >>>>> >>>>> J. >>>>> >>>>> >>>>> On Mon, Jul 18, 2022 at 1:52 AM Ping Zhang <[email protected]> wrote: >>>>> >>>>>> Hi Pablo, >>>>>> >>>>>> Wow, I really love this idea. This will greatly enrich the airflow >>>>>> ecosystem. >>>>>> >>>>>> I agree with Ash, it is better to have it as a standalone package. >>>>>> And we can use this framework to write airflow core invariants tests, so >>>>>> that we will run them on every airflow release to guarantee no >>>>>> regressions. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Ping >>>>>> >>>>>> >>>>>> On Sun, Jul 17, 2022 at 1:09 PM Pablo Estrada >>>>>> <[email protected]> wrote: >>>>>> >>>>>>> Understood! >>>>>>> >>>>>>> TL;DR: I propose a testing framework where users can check for 'DAG >>>>>>> execution invariants' or 'DAG execution expectations' given certain task >>>>>>> outcomes. >>>>>>> >>>>>>> As DAGs grow in complexity, sometimes it might become difficult to >>>>>>> reason about their runtime behavior in many scenarios. Users may want to >>>>>>> lay out rules in the form of tests that can verify DAG execution >>>>>>> results. >>>>>>> For example: >>>>>>> >>>>>>> - If any of my database_backup_* tasks fails, I want to ensure that >>>>>>> at least one email_alert_* task will run. >>>>>>> - If my 'check_authentication' task fails, I want to ensure that the >>>>>>> whole DAG will fail. >>>>>>> - If any of my DataflowOperator tasks fails, I want to ensure that a >>>>>>> PubsubOperator downstream will always run. >>>>>>> >>>>>>> These sorts of invariants don't need the DAG to be executed; but in >>>>>>> fact, they are pretty hard to test today: Staging environments can't >>>>>>> check >>>>>>> every possible runtime outcome. >>>>>>> >>>>>>> In this framework, users would define unit tests like this: >>>>>>> >>>>>>> ``` >>>>>>> def test_my_example_dag(): >>>>>>> the_dag = models.DAG( >>>>>>> 'the_basic_dag', >>>>>>> schedule_interval='@daily', >>>>>>> start_date=DEFAULT_DATE, >>>>>>> ) >>>>>>> >>>>>>> with the_dag: >>>>>>> op1 = EmptyOperator(task_id='task_1') >>>>>>> op2 = EmptyOperator(task_id='task_2') >>>>>>> op3 = EmptyOperator(task_id='task_3') >>>>>>> >>>>>>> op1 >> op2 >> op3 >>>>>>> # DAG invariant: If task_1 and task_2 succeeds, then task_3 will >>>>>>> always run >>>>>>> assert_that( >>>>>>> given(thedag)\ >>>>>>> .when(task('task_1'), succeeds())\ >>>>>>> .and_(task('task_2'), succeeds())\ >>>>>>> .then(task('task_3'), runs())) >>>>>>> ``` >>>>>>> >>>>>>> This is a very simple example - and it's not great, because it only >>>>>>> duplicates the DAG logic - but you can see more examples in my >>>>>>> draft PR >>>>>>> <https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82>[1] >>>>>>> and in my draft AIP >>>>>>> <https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#heading=h.atmk0p7fmv7g> >>>>>>> [2]. >>>>>>> >>>>>>> I started writing up an AIP in a Google doc[2] which y'all can >>>>>>> check. It's very close to what I have written here : ) >>>>>>> >>>>>>> LMK what y'all think. I am also happy to publish this as a separate >>>>>>> library if y'all wanna be cautious about adding it directly to Airflow. >>>>>>> -P. >>>>>>> >>>>>>> [1] >>>>>>> https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82 >>>>>>> [2] >>>>>>> https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit# >>>>>>> >>>>>>> >>>>>>> On Sun, Jul 17, 2022 at 2:13 AM Jarek Potiuk <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Yep. Just outline your proposal on devlist, Pablo :). >>>>>>>> >>>>>>>> On Sun, Jul 17, 2022 at 10:35 AM Ash Berlin-Taylor <[email protected]> >>>>>>>> wrote: >>>>>>>> > >>>>>>>> > Hi Pablo, >>>>>>>> > >>>>>>>> > Could you describe at a high level what you are thinking of? It's >>>>>>>> entirely possible it doesn't need any changes to core Airflow, or isn't >>>>>>>> significant enough to need an AIP. >>>>>>>> > >>>>>>>> > Thanks, >>>>>>>> > Ash >>>>>>>> > >>>>>>>> > On 17 July 2022 07:43:54 BST, Pablo Estrada >>>>>>>> <[email protected]> wrote: >>>>>>>> >> >>>>>>>> >> Hi there! >>>>>>>> >> I would like to start a discussion of an idea that I had for a >>>>>>>> testing framework for airflow. >>>>>>>> >> I believe the first step would be to write up an AIP - so could >>>>>>>> I have access to write a new one on the cwiki? >>>>>>>> >> >>>>>>>> >> Thanks! >>>>>>>> >> -P. >>>>>>>> >>>>>>>
