Anyone :) ? On Mon, Jul 18, 2022 at 10:38 AM Jarek Potiuk <[email protected]> wrote:
> I would love to hear what others think about the "in/out" approach - mine > is just the line of thoughts I've been exploring during the last few months > where I prepared my own line of thought about providers, maintenance, > incentive of entities maintaining open-source projects, and especially - > expectations of the users that it creates. But those are just my thoughts > and I'd love to hear what others think about it. > > On Mon, Jul 18, 2022 at 10:33 AM Jarek Potiuk <[email protected]> wrote: > >> I had some thoughts about it - this also connected with recent >> discussions about mixed governance for providers, and I think it's worth >> using this discussion to set some rules and "boundaries" on when and >> how and especially why we want to accept some contributions, while for some >> other contributions it's better to be outside. >> >> We are about to start more seriously thinking (and discussing) on how to >> split Airflow providers off airflow. And I think we can split off more than >> providers - this might be a good candidate to be a standalone, but still >> community maintained package. If we are going to solve the problem of >> splitting airflow to N packages, one more package does not matter. >> And it would nicely solve "version independence". We could even make it >> airflow 2.0+ compliant if we want. >> >> So I think while the question of "is it tied with a specific airflow >> version or not" does not really prevent us from making it part of community >> - those two are not related (if we are going to have more repositories >> anyway) >> >> The important part is really how "self-servicing" we can make it and how >> we make sure it stays relevant with future versions of Airflow and who does >> it I think - namely who has the incentive and "responsibility" to maintain >> it. I am sure we will add more features to Airflow DAGs and simplify the >> way DAGs are written over time, and the test harness will have to adapt to >> it. >> >> There are pros and cons of having such a standalone package "in the >> community/ASF project" and "out of it". We have a good example (from >> similar kinds of tools/utils) in the past that we can learn from(and maybe >> Bas can share more insights). >> >> https://github.com/BasPH/pylint-airflow - pylint plugin for Airflow DAGs >> >> Initially that was "sponsored" by GoDataDriven where Bas worked and I >> think this is where it was born. And that made sense as it was likely also >> useful for the customers of GoDataDriven (here I am guessing). But >> apparently both GoDataDriven's incentives winded down and it turned out >> that usefulness of it was not as big (also I think we all in Python >> community learned that Pylint is more of a distraction than real help - we >> dumped Pylint eventually and the plugin was not maintained beyond some >> versions of 1.10. And the tool is all but defunct now. Which is perfectly >> understandable. >> >> In this case there is (I think) no risk of a "pylint" like problem, but >> the question of maintenance and adaptation to future versions of Airflow >> remains. >> >> I think there is one big differences of something that is "in ASF repos" >> and "out": >> >> * if we make it a standalone package in "asf airflow community" - we will >> have some obligation and expectations from our users to maintain it. We can >> add some test harness (regardless if it will be in airflow repository or in >> a separate one) to make sure that new airflow "core" changes will not break >> it (and we can fail our PRs if they do - basically making "core" >> maintainers take care about this problem rather than delegate it to someone >> else to react on core changes (this is what has to happen with providers I >> believe even if we split them to separate repo). I think anything that we >> as the ASF community release should have such harnesses - making sure that >> whatever we release and make available to our users work together. >> >> * if it is outside of the "ASF community", someone will have to react to >> "core airflow" changes. We will not do it in the community, we will not pay >> attention, such an "external tool" might break at any time because we >> introduced a change in part of a core that the external tool implicitly >> relied on. >> >> For me the question is whether something should be in/out should be based >> on : >> >> * is it really useful for the community as a whole? -> if yes we should >> consider it >> * is it strongly tied with the core of airflow in the sense of relying on >> some internals that might change easily? -> if not, there is no need to >> bring it in, it can be easily maintained outside by anyone >> * if it is strongly tied with the core - > is there someone (person, >> organisation) who wants to take the burden of maintaining it and has >> incentive of doing it for quite some time -> if yes, great, let them do >> that! >> * if it is strongly tied, do we want to take a burden as "core airflow >> maintainers" to keep it updated together with the core if it is? -> if yes, >> we should bring it in >> >> If we have a strongly tied tool that we do not want to maintain in the >> core and there is no entity who would like to do it, then I think this idea >> should be dropped :). >> >> J. >> >> >> On Mon, Jul 18, 2022 at 1:52 AM Ping Zhang <[email protected]> wrote: >> >>> Hi Pablo, >>> >>> Wow, I really love this idea. This will greatly enrich the airflow >>> ecosystem. >>> >>> I agree with Ash, it is better to have it as a standalone package. And >>> we can use this framework to write airflow core invariants tests, so that >>> we will run them on every airflow release to guarantee no regressions. >>> >>> Thanks, >>> >>> Ping >>> >>> >>> On Sun, Jul 17, 2022 at 1:09 PM Pablo Estrada <[email protected]> >>> wrote: >>> >>>> Understood! >>>> >>>> TL;DR: I propose a testing framework where users can check for 'DAG >>>> execution invariants' or 'DAG execution expectations' given certain task >>>> outcomes. >>>> >>>> As DAGs grow in complexity, sometimes it might become difficult to >>>> reason about their runtime behavior in many scenarios. Users may want to >>>> lay out rules in the form of tests that can verify DAG execution results. >>>> For example: >>>> >>>> - If any of my database_backup_* tasks fails, I want to ensure that at >>>> least one email_alert_* task will run. >>>> - If my 'check_authentication' task fails, I want to ensure that the >>>> whole DAG will fail. >>>> - If any of my DataflowOperator tasks fails, I want to ensure that a >>>> PubsubOperator downstream will always run. >>>> >>>> These sorts of invariants don't need the DAG to be executed; but in >>>> fact, they are pretty hard to test today: Staging environments can't check >>>> every possible runtime outcome. >>>> >>>> In this framework, users would define unit tests like this: >>>> >>>> ``` >>>> def test_my_example_dag(): >>>> the_dag = models.DAG( >>>> 'the_basic_dag', >>>> schedule_interval='@daily', >>>> start_date=DEFAULT_DATE, >>>> ) >>>> >>>> with the_dag: >>>> op1 = EmptyOperator(task_id='task_1') >>>> op2 = EmptyOperator(task_id='task_2') >>>> op3 = EmptyOperator(task_id='task_3') >>>> >>>> op1 >> op2 >> op3 >>>> # DAG invariant: If task_1 and task_2 succeeds, then task_3 will >>>> always run >>>> assert_that( >>>> given(thedag)\ >>>> .when(task('task_1'), succeeds())\ >>>> .and_(task('task_2'), succeeds())\ >>>> .then(task('task_3'), runs())) >>>> ``` >>>> >>>> This is a very simple example - and it's not great, because it only >>>> duplicates the DAG logic - but you can see more examples in my draft PR >>>> <https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82>[1] >>>> and in my draft AIP >>>> <https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#heading=h.atmk0p7fmv7g> >>>> [2]. >>>> >>>> I started writing up an AIP in a Google doc[2] which y'all can check. >>>> It's very close to what I have written here : ) >>>> >>>> LMK what y'all think. I am also happy to publish this as a separate >>>> library if y'all wanna be cautious about adding it directly to Airflow. >>>> -P. >>>> >>>> [1] >>>> https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82 >>>> [2] >>>> https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit# >>>> >>>> >>>> On Sun, Jul 17, 2022 at 2:13 AM Jarek Potiuk <[email protected]> wrote: >>>> >>>>> Yep. Just outline your proposal on devlist, Pablo :). >>>>> >>>>> On Sun, Jul 17, 2022 at 10:35 AM Ash Berlin-Taylor <[email protected]> >>>>> wrote: >>>>> > >>>>> > Hi Pablo, >>>>> > >>>>> > Could you describe at a high level what you are thinking of? It's >>>>> entirely possible it doesn't need any changes to core Airflow, or isn't >>>>> significant enough to need an AIP. >>>>> > >>>>> > Thanks, >>>>> > Ash >>>>> > >>>>> > On 17 July 2022 07:43:54 BST, Pablo Estrada >>>>> <[email protected]> wrote: >>>>> >> >>>>> >> Hi there! >>>>> >> I would like to start a discussion of an idea that I had for a >>>>> testing framework for airflow. >>>>> >> I believe the first step would be to write up an AIP - so could I >>>>> have access to write a new one on the cwiki? >>>>> >> >>>>> >> Thanks! >>>>> >> -P. >>>>> >>>>
