What do you think, Pablo about the "being out" vs. "being in" the
official repo?

On Thu, Jul 28, 2022 at 3:51 PM Jarek Potiuk <[email protected]> wrote:

> Anyone :) ?
>
> On Mon, Jul 18, 2022 at 10:38 AM Jarek Potiuk <[email protected]> wrote:
>
>> I would love to hear what others think about the "in/out" approach - mine
>> is just the line of thoughts I've been exploring during the last few months
>> where I prepared my own line of thought about providers, maintenance,
>> incentive of entities maintaining open-source projects, and especially -
>> expectations of the users that it creates. But those are just my thoughts
>> and I'd love to hear what others think about it.
>>
>> On Mon, Jul 18, 2022 at 10:33 AM Jarek Potiuk <[email protected]> wrote:
>>
>>> I had some thoughts about it - this also connected with recent
>>> discussions about mixed governance for providers, and I think it's worth
>>> using this discussion to set some rules and "boundaries" on when and
>>> how and especially why we want to accept some contributions, while for some
>>> other contributions it's better to be outside.
>>>
>>> We are about to start more seriously thinking (and discussing) on how to
>>> split Airflow providers off airflow. And I think we can split off more than
>>> providers - this might be a good candidate to be a standalone, but still
>>> community maintained package. If we are going to solve the problem of
>>> splitting airflow to N packages, one more package does not matter.
>>> And it would nicely solve "version independence". We could even make it
>>> airflow 2.0+ compliant if we want.
>>>
>>> So I think while the question of "is it tied with a specific airflow
>>> version or not" does not really prevent us from making it part of community
>>> - those two are not related (if we are going to have more repositories
>>> anyway)
>>>
>>> The important part is really how "self-servicing" we can make it and how
>>> we make sure it stays relevant with future versions of Airflow and who does
>>> it I think - namely who has the incentive and "responsibility" to maintain
>>> it. I am sure we will add more features to Airflow DAGs and simplify the
>>> way DAGs are written over time, and the test harness will have to adapt to
>>> it.
>>>
>>> There are pros and cons of having such a standalone package "in the
>>> community/ASF project" and "out of it". We have a good example (from
>>> similar kinds of tools/utils) in the past that we can learn from(and maybe
>>> Bas can share more insights).
>>>
>>> https://github.com/BasPH/pylint-airflow - pylint plugin for Airflow DAGs
>>>
>>> Initially that was "sponsored" by GoDataDriven where Bas worked and I
>>> think this is where it was born. And that made sense as it was likely also
>>> useful for the customers of GoDataDriven (here I am guessing). But
>>> apparently both GoDataDriven's incentives winded down and it turned out
>>> that usefulness of it was not as big (also I think we all in Python
>>> community learned that Pylint is more of a distraction than real help - we
>>> dumped Pylint eventually and the plugin was not maintained beyond some
>>> versions of 1.10. And the tool is all but defunct now. Which is perfectly
>>> understandable.
>>>
>>> In this case there is (I think) no risk of a "pylint" like problem, but
>>> the question of maintenance and adaptation to future versions of Airflow
>>> remains.
>>>
>>> I think there is one big differences of something that is "in ASF repos"
>>> and "out":
>>>
>>> * if we make it a standalone package in "asf airflow community" - we
>>> will have some obligation and expectations from our users to maintain it.
>>> We can add some test harness (regardless if it will be in airflow
>>> repository or in a separate one) to make sure that new airflow "core"
>>> changes will not break it (and we can fail our PRs if they do - basically
>>> making "core" maintainers take care about this problem rather than delegate
>>> it to someone else to react on core changes (this is what has to  happen
>>> with providers I believe even if we split them to separate repo).  I think
>>> anything that we as the ASF community release should have such harnesses -
>>> making sure that whatever we release and make available to our users work
>>> together.
>>>
>>> * if it is outside of the "ASF community", someone will have to react to
>>> "core airflow" changes. We will not do it in the community, we will not pay
>>> attention, such an "external tool" might break at any time because we
>>> introduced a change in part of a core that the external tool implicitly
>>> relied on.
>>>
>>> For me the question is whether something should be in/out should be
>>> based on :
>>>
>>> * is it really useful for the community as a whole? -> if yes we should
>>> consider it
>>> * is it strongly tied with the core of airflow in the sense of relying
>>> on some internals that might change easily? -> if not, there is no need to
>>> bring it in, it can be easily maintained outside by anyone
>>> * if it is strongly tied with the core - > is there someone (person,
>>> organisation) who wants to take the burden of maintaining it and has
>>> incentive of doing it for quite some time -> if yes, great, let them do
>>> that!
>>> * if it is strongly tied, do we want to take a burden as "core airflow
>>> maintainers" to keep it updated together with the core if it is? -> if yes,
>>> we should bring it in
>>>
>>> If we have a strongly tied tool that we do not want to maintain in the
>>> core and there is no entity who would like to do it, then I think this idea
>>> should be dropped :).
>>>
>>> J.
>>>
>>>
>>> On Mon, Jul 18, 2022 at 1:52 AM Ping Zhang <[email protected]> wrote:
>>>
>>>> Hi Pablo,
>>>>
>>>> Wow, I really love this idea. This will greatly enrich the airflow
>>>> ecosystem.
>>>>
>>>> I agree with Ash, it is better to have it as a standalone package. And
>>>> we can use this framework to write airflow core invariants tests, so that
>>>> we will run them on every airflow release to guarantee no regressions.
>>>>
>>>> Thanks,
>>>>
>>>> Ping
>>>>
>>>>
>>>> On Sun, Jul 17, 2022 at 1:09 PM Pablo Estrada
>>>> <[email protected]> wrote:
>>>>
>>>>> Understood!
>>>>>
>>>>> TL;DR: I propose a testing framework where users can check for 'DAG
>>>>> execution invariants' or 'DAG execution expectations' given certain task
>>>>> outcomes.
>>>>>
>>>>> As DAGs grow in complexity, sometimes it might become difficult to
>>>>> reason about their runtime behavior in many scenarios. Users may want to
>>>>> lay out rules in the form of tests that can verify  DAG execution results.
>>>>> For example:
>>>>>
>>>>> - If any of my database_backup_* tasks fails, I want to ensure that at
>>>>> least one email_alert_* task will run.
>>>>> - If my 'check_authentication' task fails, I want to ensure that the
>>>>> whole DAG will fail.
>>>>> - If any of my DataflowOperator tasks fails, I want to ensure that a
>>>>> PubsubOperator downstream will always run.
>>>>>
>>>>> These sorts of invariants don't need the DAG to be executed; but in
>>>>> fact, they are pretty hard to test today: Staging environments can't check
>>>>> every possible runtime outcome.
>>>>>
>>>>> In this framework, users would define unit tests like this:
>>>>>
>>>>> ```
>>>>> def test_my_example_dag():
>>>>>   the_dag = models.DAG(
>>>>>         'the_basic_dag',
>>>>>         schedule_interval='@daily',
>>>>>         start_date=DEFAULT_DATE,
>>>>>     )
>>>>>
>>>>>     with the_dag:
>>>>>         op1 = EmptyOperator(task_id='task_1')
>>>>>         op2 = EmptyOperator(task_id='task_2')
>>>>>         op3 = EmptyOperator(task_id='task_3')
>>>>>
>>>>>         op1 >> op2 >> op3
>>>>>     # DAG invariant: If task_1 and task_2 succeeds, then task_3 will
>>>>> always run
>>>>>     assert_that(
>>>>>             given(thedag)\
>>>>>                 .when(task('task_1'), succeeds())\
>>>>>                 .and_(task('task_2'), succeeds())\
>>>>>                 .then(task('task_3'), runs()))
>>>>> ```
>>>>>
>>>>> This is a very simple example - and it's not great, because it only
>>>>> duplicates the DAG logic - but you can see more examples in my draft
>>>>> PR
>>>>> <https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82>[1]
>>>>> and in my draft AIP
>>>>> <https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#heading=h.atmk0p7fmv7g>
>>>>> [2].
>>>>>
>>>>> I started writing up an AIP in a Google doc[2] which y'all can check.
>>>>> It's very close to what I have written here : )
>>>>>
>>>>> LMK what y'all think. I am also happy to publish this as a separate
>>>>> library if y'all wanna be cautious about adding it directly to Airflow.
>>>>> -P.
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82
>>>>> [2]
>>>>> https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#
>>>>>
>>>>>
>>>>> On Sun, Jul 17, 2022 at 2:13 AM Jarek Potiuk <[email protected]> wrote:
>>>>>
>>>>>> Yep. Just outline your proposal on devlist, Pablo :).
>>>>>>
>>>>>> On Sun, Jul 17, 2022 at 10:35 AM Ash Berlin-Taylor <[email protected]>
>>>>>> wrote:
>>>>>> >
>>>>>> > Hi Pablo,
>>>>>> >
>>>>>> > Could you describe at a high level what you are thinking of? It's
>>>>>> entirely possible it doesn't need any changes to core Airflow, or isn't
>>>>>> significant enough to need an AIP.
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Ash
>>>>>> >
>>>>>> > On 17 July 2022 07:43:54 BST, Pablo Estrada
>>>>>> <[email protected]> wrote:
>>>>>> >>
>>>>>> >> Hi there!
>>>>>> >> I would like to start a discussion of an idea that I had for a
>>>>>> testing framework for airflow.
>>>>>> >> I believe the first step would be to write up an AIP - so could I
>>>>>> have access to write a new one on the cwiki?
>>>>>> >>
>>>>>> >> Thanks!
>>>>>> >> -P.
>>>>>>
>>>>>

Reply via email to