Anyone :) ?

On Mon, Jul 18, 2022 at 10:38 AM Jarek Potiuk <[email protected]> wrote:

> I would love to hear what others think about the "in/out" approach - mine
> is just the line of thoughts I've been exploring during the last few months
> where I prepared my own line of thought about providers, maintenance,
> incentive of entities maintaining open-source projects, and especially -
> expectations of the users that it creates. But those are just my thoughts
> and I'd love to hear what others think about it.
>
> On Mon, Jul 18, 2022 at 10:33 AM Jarek Potiuk <[email protected]> wrote:
>
>> I had some thoughts about it - this also connected with recent
>> discussions about mixed governance for providers, and I think it's worth
>> using this discussion to set some rules and "boundaries" on when and
>> how and especially why we want to accept some contributions, while for some
>> other contributions it's better to be outside.
>>
>> We are about to start more seriously thinking (and discussing) on how to
>> split Airflow providers off airflow. And I think we can split off more than
>> providers - this might be a good candidate to be a standalone, but still
>> community maintained package. If we are going to solve the problem of
>> splitting airflow to N packages, one more package does not matter.
>> And it would nicely solve "version independence". We could even make it
>> airflow 2.0+ compliant if we want.
>>
>> So I think while the question of "is it tied with a specific airflow
>> version or not" does not really prevent us from making it part of community
>> - those two are not related (if we are going to have more repositories
>> anyway)
>>
>> The important part is really how "self-servicing" we can make it and how
>> we make sure it stays relevant with future versions of Airflow and who does
>> it I think - namely who has the incentive and "responsibility" to maintain
>> it. I am sure we will add more features to Airflow DAGs and simplify the
>> way DAGs are written over time, and the test harness will have to adapt to
>> it.
>>
>> There are pros and cons of having such a standalone package "in the
>> community/ASF project" and "out of it". We have a good example (from
>> similar kinds of tools/utils) in the past that we can learn from(and maybe
>> Bas can share more insights).
>>
>> https://github.com/BasPH/pylint-airflow - pylint plugin for Airflow DAGs
>>
>> Initially that was "sponsored" by GoDataDriven where Bas worked and I
>> think this is where it was born. And that made sense as it was likely also
>> useful for the customers of GoDataDriven (here I am guessing). But
>> apparently both GoDataDriven's incentives winded down and it turned out
>> that usefulness of it was not as big (also I think we all in Python
>> community learned that Pylint is more of a distraction than real help - we
>> dumped Pylint eventually and the plugin was not maintained beyond some
>> versions of 1.10. And the tool is all but defunct now. Which is perfectly
>> understandable.
>>
>> In this case there is (I think) no risk of a "pylint" like problem, but
>> the question of maintenance and adaptation to future versions of Airflow
>> remains.
>>
>> I think there is one big differences of something that is "in ASF repos"
>> and "out":
>>
>> * if we make it a standalone package in "asf airflow community" - we will
>> have some obligation and expectations from our users to maintain it. We can
>> add some test harness (regardless if it will be in airflow repository or in
>> a separate one) to make sure that new airflow "core" changes will not break
>> it (and we can fail our PRs if they do - basically making "core"
>> maintainers take care about this problem rather than delegate it to someone
>> else to react on core changes (this is what has to  happen with providers I
>> believe even if we split them to separate repo).  I think anything that we
>> as the ASF community release should have such harnesses - making sure that
>> whatever we release and make available to our users work together.
>>
>> * if it is outside of the "ASF community", someone will have to react to
>> "core airflow" changes. We will not do it in the community, we will not pay
>> attention, such an "external tool" might break at any time because we
>> introduced a change in part of a core that the external tool implicitly
>> relied on.
>>
>> For me the question is whether something should be in/out should be based
>> on :
>>
>> * is it really useful for the community as a whole? -> if yes we should
>> consider it
>> * is it strongly tied with the core of airflow in the sense of relying on
>> some internals that might change easily? -> if not, there is no need to
>> bring it in, it can be easily maintained outside by anyone
>> * if it is strongly tied with the core - > is there someone (person,
>> organisation) who wants to take the burden of maintaining it and has
>> incentive of doing it for quite some time -> if yes, great, let them do
>> that!
>> * if it is strongly tied, do we want to take a burden as "core airflow
>> maintainers" to keep it updated together with the core if it is? -> if yes,
>> we should bring it in
>>
>> If we have a strongly tied tool that we do not want to maintain in the
>> core and there is no entity who would like to do it, then I think this idea
>> should be dropped :).
>>
>> J.
>>
>>
>> On Mon, Jul 18, 2022 at 1:52 AM Ping Zhang <[email protected]> wrote:
>>
>>> Hi Pablo,
>>>
>>> Wow, I really love this idea. This will greatly enrich the airflow
>>> ecosystem.
>>>
>>> I agree with Ash, it is better to have it as a standalone package. And
>>> we can use this framework to write airflow core invariants tests, so that
>>> we will run them on every airflow release to guarantee no regressions.
>>>
>>> Thanks,
>>>
>>> Ping
>>>
>>>
>>> On Sun, Jul 17, 2022 at 1:09 PM Pablo Estrada <[email protected]>
>>> wrote:
>>>
>>>> Understood!
>>>>
>>>> TL;DR: I propose a testing framework where users can check for 'DAG
>>>> execution invariants' or 'DAG execution expectations' given certain task
>>>> outcomes.
>>>>
>>>> As DAGs grow in complexity, sometimes it might become difficult to
>>>> reason about their runtime behavior in many scenarios. Users may want to
>>>> lay out rules in the form of tests that can verify  DAG execution results.
>>>> For example:
>>>>
>>>> - If any of my database_backup_* tasks fails, I want to ensure that at
>>>> least one email_alert_* task will run.
>>>> - If my 'check_authentication' task fails, I want to ensure that the
>>>> whole DAG will fail.
>>>> - If any of my DataflowOperator tasks fails, I want to ensure that a
>>>> PubsubOperator downstream will always run.
>>>>
>>>> These sorts of invariants don't need the DAG to be executed; but in
>>>> fact, they are pretty hard to test today: Staging environments can't check
>>>> every possible runtime outcome.
>>>>
>>>> In this framework, users would define unit tests like this:
>>>>
>>>> ```
>>>> def test_my_example_dag():
>>>>   the_dag = models.DAG(
>>>>         'the_basic_dag',
>>>>         schedule_interval='@daily',
>>>>         start_date=DEFAULT_DATE,
>>>>     )
>>>>
>>>>     with the_dag:
>>>>         op1 = EmptyOperator(task_id='task_1')
>>>>         op2 = EmptyOperator(task_id='task_2')
>>>>         op3 = EmptyOperator(task_id='task_3')
>>>>
>>>>         op1 >> op2 >> op3
>>>>     # DAG invariant: If task_1 and task_2 succeeds, then task_3 will
>>>> always run
>>>>     assert_that(
>>>>             given(thedag)\
>>>>                 .when(task('task_1'), succeeds())\
>>>>                 .and_(task('task_2'), succeeds())\
>>>>                 .then(task('task_3'), runs()))
>>>> ```
>>>>
>>>> This is a very simple example - and it's not great, because it only
>>>> duplicates the DAG logic - but you can see more examples in my draft PR
>>>> <https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82>[1]
>>>> and in my draft AIP
>>>> <https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#heading=h.atmk0p7fmv7g>
>>>> [2].
>>>>
>>>> I started writing up an AIP in a Google doc[2] which y'all can check.
>>>> It's very close to what I have written here : )
>>>>
>>>> LMK what y'all think. I am also happy to publish this as a separate
>>>> library if y'all wanna be cautious about adding it directly to Airflow.
>>>> -P.
>>>>
>>>> [1]
>>>> https://github.com/apache/airflow/pull/25112/files#diff-b1f30afa38d247f9204790392ab6888b04288603ac4d38154d05e6c5b998cf85R28-R82
>>>> [2]
>>>> https://docs.google.com/document/d/1priak1uiJTXP1F9K5B8XS8qmeRbJ8trYLvE4k2aBY5c/edit#
>>>>
>>>>
>>>> On Sun, Jul 17, 2022 at 2:13 AM Jarek Potiuk <[email protected]> wrote:
>>>>
>>>>> Yep. Just outline your proposal on devlist, Pablo :).
>>>>>
>>>>> On Sun, Jul 17, 2022 at 10:35 AM Ash Berlin-Taylor <[email protected]>
>>>>> wrote:
>>>>> >
>>>>> > Hi Pablo,
>>>>> >
>>>>> > Could you describe at a high level what you are thinking of? It's
>>>>> entirely possible it doesn't need any changes to core Airflow, or isn't
>>>>> significant enough to need an AIP.
>>>>> >
>>>>> > Thanks,
>>>>> > Ash
>>>>> >
>>>>> > On 17 July 2022 07:43:54 BST, Pablo Estrada
>>>>> <[email protected]> wrote:
>>>>> >>
>>>>> >> Hi there!
>>>>> >> I would like to start a discussion of an idea that I had for a
>>>>> testing framework for airflow.
>>>>> >> I believe the first step would be to write up an AIP - so could I
>>>>> have access to write a new one on the cwiki?
>>>>> >>
>>>>> >> Thanks!
>>>>> >> -P.
>>>>>
>>>>

Reply via email to