Re: Call with Nielsen team demoing their DAG debugging feature

2024-06-13 Thread Jarek Potiuk
Also I would like to drag attention of those who were interested in the
subject on two related PRs:

* https://github.com/apache/airflow/pull/40010 by jannisko (sorry if you
see it - I do not know your real name :) ) -> where you can run a dag test
while skipping (or actually mark as success) some tasks (for example
sensors)
* https://github.com/apache/airflow/pull/40205 by Vincent -> where you can
run dag test using executor rather than `_run_raw_task`

I think - the debug feature that Nielsen showcased on the call falls in the
same pattern "make our airflow dags test` more powerful - and it would be
great if we could incorporate similar pattern - where you can recreate task
context from already executed dag_run - as part of the `airflow dags test`
CLI command and `dag.test()` method.

This is also I think a good opportunity to enhance documentation and
explain all those patterns on how you can debug dag - in
https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/debug.html#testing-dags-with-dag-test
- for now the documentation is rather bare-bone, but it would be great if
we explain some of the best practices and use cases how dag debugging might
be done using the `airflow dags test` command.

I think maybe other people have their own patterns of testing DAGs that
they could contribute here - both as documentation update and maybe new
features of our existing "airlfow dags test" command.

WDYT?

J.


On Thu, Jun 13, 2024 at 10:29 PM Stefan Krawczyk  wrote:

> +1 for the recording please.
>
> On Thu, Jun 13, 2024, 1:26 PM Jarek Potiuk  wrote:
>
> > Just summarizing the call:
> >
> > * we had demo from Nielsen showing their debugging feature
> > * It's based on environments created in the research environment where
> > users can run DAGs and debug individual tasks - basically replaying and
> > debugging the tasks based on existing DAG runs but without saving any
> > changes to state of the dag in the DB
> > * pretty useful thing is the way how they can use existing DAG run to
> > recreate the context of execution based on existing dag run
> > * Nielsen team used it with Airflow 1.10 and they will look into how the
> > new `dag.test()` feature from Airflow 2.5 can be plugged into it and come
> > back to it
> > * nice thing is that they hooked it up with VSCode plugin where they can
> > easily do all that within VSCode and can debug it out-of-the-box
> > * possibly they could generalise it either as a "what could be done by
> > others" description or maybe even having VSCode-from-airflow
> out-of-the-box
> > (the latter was my brainstorming idea).
> >
> > I have a recording - I do not want to publish it on the public devlist,
> but
> > If anyone is interested - let me know and I will share.
> >
> > J.
> >
> >
> > On Thu, Jun 13, 2024 at 6:49 AM Albert Okiri 
> > wrote:
> >
> > > Hi Jarek, I'm interested in joining this call.
> > >
> > > Regards,
> > > Albert.
> > >
> > > On Thu, 13 Jun 2024, 07:43 Poorvi Rohidekar, <
> > > poorvirohidekar@gmail.com>
> > > wrote:
> > >
> > > > Hi Jarek,
> > > >
> > > > I'd be interested in joining this call.
> > > >
> > > > Regards,
> > > > Poorvi
> > > >
> > > > On Tue, 11 Jun 2024 at 21:42, Jarek Potiuk  wrote:
> > > >
> > > > > And yes. I will check with them about recording :)
> > > > >
> > > > > On Tue, Jun 11, 2024 at 4:41 PM Jarek Potiuk 
> > wrote:
> > > > >
> > > > > > I think I have to warn Nielsen team that we are going to have a
> big
> > > > crowd
> > > > > > :)
> > > > > >
> > > > > > On Tue, Jun 11, 2024 at 4:40 PM Bishundeo, Rajeshwar
> > > > > >  wrote:
> > > > > >
> > > > > >> This sounds exciting! I would like to join as well.
> > > > > >>
> > > > > >> -- Rajesh
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On 2024-06-10, 11:12 PM, "Ephraim Anierobi" <
> > > > ephraimanier...@gmail.com
> > > > > >> > wrote:
> > > > > >>
> > > > > >>
> > > > > >> CAUTION: This email originated from outside of the organization.
> > Do
> > > > not
> > > > > >> click links or open attachments unless you can confirm the
> sender
> > > and
> > > > > know
> > > > > >> the content is safe.
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
> > > > > externe.
> > > > > >> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si
> vous
> > ne
> > > > > pouvez
> > > > > >> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas
> > > certain
> > > > > que
> > > > > >> le contenu ne présente aucun risque.
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> Hi Jarek,
> > > > > >>
> > > > > >>
> > > > > >> Awesome!! I’m interested to join too
> > > > > >>
> > > > > >>
> > > > > >> Ephraim
> > > > > >>
> > > > > >>
> > > > > >> On Mon, 10 Jun 2024 at 23:41, Mehta, Shubham
> >  > > > > >> lid>
> > > > > >> wrote

RE: [DISCUSS] AIP-69 Remote Executor

2024-06-13 Thread Scheffler Jens (XC-AS/EAE-ADA-T)
Hi Airflow Dev's,

After todays Airflow 3 planning and the "sneak preview" from Ash' AIP-72 I took 
some time today to update the AIP-69 Remote Executor with more technical 
details as I tried a PoC implementation.

The AIP document is now updated and I call a second round for feedback / 
discussion in 
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-69+Remote+Executor

Also as requested I shared my PoC code (which actually is now able to make the 
first task being called via HTTP) - I hope this gives an insight about the 
direction I am thinking - but also here feedback welcome: 
https://github.com/apache/airflow/pull/40224

Direct feedback to the email before: As I tried to look into task execution cut 
I saw a lot of Spaghetti and am convinced now that the AIP-69 must in todays 
world build on top of AIP-44, should not re-invent the wheel and will in future 
greatly benefit of AIP-72. I intend that we align across these three AIPs so 
that the pieces are fitting together. But this also means that not all targets 
can be met in a first release and dependencies very probably can only be cut 
off after AIP-72.

Implementation target will be that most of the logic is made into the provider 
package and as you can see in the Draft PR only minimal changes are needed in 
the core.

Mit freundlichen Grüßen / Best regards

Jens Scheffler

Alliance: Enabler - Tech Lead (XC-AS/EAE-ADA-T)
Robert Bosch GmbH | Hessbruehlstraße 21 | 70565 Stuttgart-Vaihingen | GERMANY | 
www.bosch.com
Tel. +49 711 811-91508 | Mobil +49 160 90417410 | jens.scheff...@de.bosch.com

Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000;
Aufsichtsratsvorsitzender: Prof. Dr. Stefan Asenkerschbaumer; 
Geschäftsführung: Dr. Stefan Hartung, Dr. Christian Fischer, Dr. Markus 
Forschner, 
Stefan Grosch, Dr. Markus Heyn, Dr. Frank Meyer, Dr. Tanja Rückert

-Original Message-
From: Bolke de Bruin  
Sent: Sunday, May 19, 2024 1:14 PM
To: dev@airflow.apache.org
Subject: Re: [VOTE] AIP-69 Remote Executor

Hi Jens,

I've read your proposal and I think I understand where you want to take it.
To me the gist is:

"Allow running airflow agents behind corporate firewalls that use a pull model 
to run tasks and do not accept incoming connections".

I agree with the other commenters that the current proposal is not strong 
enough. Particularly, because it does not establish a relationship with other 
proposals and established ways of working. For example, you mention that the 
agent should communicate over HTTPS. If you would connect this to AIP-44, which 
imho only covers part of what you are doing, I would at least expect you to 
mention GRPC and how it would extend the message format of AIP-44. 
Architecturally I would like to see a discussion why it would be embedded in 
the webserver (spoiler: I don't think it should).

As Jarek is mentioning, you are not clear whether you expect tasks to be able 
to communicate with the central service or this is done by the agent on behalf 
of the tasks. In a corporate environment I would expect the latter, which puts 
it a lot closer to future isolation.

In addition, you are not documenting how an agent is deciding whether it can 
run a certain task and how it establishes a 'lock' to make sure no race 
condition happens. How do you recover from failures? Did you consider something 
like using Temporal here?

So while I understand that some things are left open to implementation, now 
there are too many questions still open and those influence the direction.

Cheers
Bolke

On Sun, 19 May 2024 at 00:37, Jarek Potiuk  wrote:

> > P.S.: I'd like to take your feedback serious, the AIP process 
> > description
> in Confluence just tells: "Once you or someone else feels like there’s 
> a rough consensus on the idea and there’s no strong opposition, you 
> can move your proposal to the Vote phase." - neither this nor the 
> structure template mention that a technical spec or PR must be 
> provided prior vote. If you feel that an AIP should include this then 
> I assume the contribution docs need to be adjusted.
>
> Just one - comment here (I am at Pycon US now). I think it's hard to 
> describe in detail the requirements because it will be more of a 
> case-by-case, but I think you should simply compare this with other 
> AIPs where a lot of details have been worked out before voting - often 
> with code snippets, or detailed description of the APIs or communication 
> involved.
> It's just difficult to understand what exactly changes it will bring - 
> will it use AIP-44 or not? - not clear, will it change how the 
> heartbeat will work ? How are we going to handle retries, what the API 
> proposal will be in general (snippets?). Even if it is not a final 
> version, but first "pass" of how roughly all those things will look 
> like - even if not working, what changes it will impose on each 
> component is usually what you will find in recently created AIPs. I 
> think a LOT depends on the impl

Re: [DISCUSS] Restore the SQL server backend

2024-06-13 Thread James Duong
I haven’t looked that much into the Airflow code yet, but the backend code must 
already be pluggable since it supports multiple databases.

What if we were to maintain a separate repo and artifact for just an extension 
for the SQL Server backend? This seems like it’d be low-impact on the Airflow 
community from a maintenance perspective, but keeps MSSQL users using the 
official Airflow artifacts. (Maybe just have the Airflow documentation link to 
the extension).

From: Kaxil Naik 
Date: Thursday, June 13, 2024 at 9:53 AM
To: dev@airflow.apache.org 
Subject: Re: [DISCUSS] Restore the SQL server backend
The cost of maintaining it in Airflow repo (with CI/CI, GH issues etc) is,
unfortunately, just too much higher.

On Thu, 13 Jun 2024 at 17:11, James Duong 
wrote:

> Hi Jarek,
>
> Thanks for your response.
>
> I would prefer to make this work part of the main Airflow repository.
>
> My previous experience with maintaining forks of Apache projects has
> always been that it fragments the community unnecessarily. Users of a
> company-specific fork might ask for features (not just relating to MSSQL)
> that could benefit the community at large. The larger community can miss
> out on insights from the users and maintainers of the fork.
>
> There is a greater cost both getting updates from the main repo and
> upstreaming improvements to the main repo as the codebase diverges further.
>
> I could see confusion from end users if there are multiple sets of
> artifacts for different forks of Airflow about which one to get.
>
> What are your thoughts on this?
>
> From: Jarek Potiuk 
> Date: Monday, June 3, 2024 at 9:32 PM
> To: dev@airflow.apache.org 
> Cc: james.du...@improving.com.invalid 
> Subject: Re: [DISCUSS] Restore the SQL server backend
> I am not sure if you read what I wrote with full understanding.
>
> To be perfectly honest - If you secure enough resources, I think *STILL* it
> will be better if you maintain your own fork and apply necessary changes
> and offer it commercially to anyone who needs it. This is way easier for
> the community, and better for you commercially  - and if you are **really**
> committed for a long term to do MSSQL, then you should have no problem in
> maintaining the fork.
>
> On Mon, Jun 3, 2024 at 11:15 PM James Duong
>  wrote:
>
> > Thanks for all of your feedback and discussion.
> >
> > The interest and usage from the enterprise MSSQL community is very large
> -
> > it's unfortunate that numbers are difficult to gather.
> >
> > In terms of the support - I hear you that it should not be limited to
> only
> > CI improvements and PR support and a more active role needs to be taken.
> I
> > am working on a plan that would provide the necessary involvement in the
> > community.
> >
> > Please allow me some time to see what is possible.
> >
> > From: Wei Lee 
> > Date: Friday, May 31, 2024 at 8:45 AM
> > To: dev@airflow.apache.org 
> > Cc: james.du...@improving.com.invalid  .invalid>
> > Subject: Re: [DISCUSS] Restore the SQL server backend
> > I agree with Jed and the following comments. If my memory serves me
> right,
> > this topic has been discussed a few times in the past. 5% doesn't seem
> very
> > convincing. Even if it's biased, I'm still not persuaded that there are a
> > large number of users that are worth the community's effort. And Jarek
> > pointed out a great solution for forking Airflow and adding MSSQL support
> > to it.
> >
> > Best,
> > Wei
> >
> > > On May 31, 2024, at 7:50 PM, Elad Kalif  wrote:
> > >
> > > I agree with Jarek
> > >
> > > I am a bit worried about the mental model of this proposal as you are
> > > offering to deliver a feature but you are not offering being a
> community
> > > member.
> > > I had a lot of frustration with the MsSQL backend tests, it really
> caused
> > > me pain as a contributor. According to your mental model - will you
> > > actively review community PRs, triage Airflow issues and offer guidance
> > and
> > > help when needed about MsSQL or will the maintainers have to track
> these
> > > problems and actively tag you/your team for assistance?
> > >
> > > Let me give an example: User opens a Github issue about HA scheduler.
> > Will
> > > your team participate in the issue triage? Or do you expect the
> community
> > > to triage the issue and only after some discussion when it turns out
> that
> > > it's MsSQL specific issue then we need to notify you?
> > >
> > > On Fri, May 31, 2024 at 10:05 AM Jarek Potiuk 
> wrote:
> > >
> > >>> We also understand and are ready to address the concerns stated in
> the
> > >> vote about support and resolving CI issues
> > >>
> > >> Hello James,
> > >>
> > >> Could you please explain how exactly are you planning to help a number
> > of
> > >> maintainers who are working on developing new feature to make sure
> > >> they know and realise unobvious consequences of some of the DB changes
> > they
> > >> might have when some of the features of MYSQL are causing - for
> example
> > >> heavy slowdo

Re: Call with Nielsen team demoing their DAG debugging feature

2024-06-13 Thread Stefan Krawczyk
+1 for the recording please.

On Thu, Jun 13, 2024, 1:26 PM Jarek Potiuk  wrote:

> Just summarizing the call:
>
> * we had demo from Nielsen showing their debugging feature
> * It's based on environments created in the research environment where
> users can run DAGs and debug individual tasks - basically replaying and
> debugging the tasks based on existing DAG runs but without saving any
> changes to state of the dag in the DB
> * pretty useful thing is the way how they can use existing DAG run to
> recreate the context of execution based on existing dag run
> * Nielsen team used it with Airflow 1.10 and they will look into how the
> new `dag.test()` feature from Airflow 2.5 can be plugged into it and come
> back to it
> * nice thing is that they hooked it up with VSCode plugin where they can
> easily do all that within VSCode and can debug it out-of-the-box
> * possibly they could generalise it either as a "what could be done by
> others" description or maybe even having VSCode-from-airflow out-of-the-box
> (the latter was my brainstorming idea).
>
> I have a recording - I do not want to publish it on the public devlist, but
> If anyone is interested - let me know and I will share.
>
> J.
>
>
> On Thu, Jun 13, 2024 at 6:49 AM Albert Okiri 
> wrote:
>
> > Hi Jarek, I'm interested in joining this call.
> >
> > Regards,
> > Albert.
> >
> > On Thu, 13 Jun 2024, 07:43 Poorvi Rohidekar, <
> > poorvirohidekar@gmail.com>
> > wrote:
> >
> > > Hi Jarek,
> > >
> > > I'd be interested in joining this call.
> > >
> > > Regards,
> > > Poorvi
> > >
> > > On Tue, 11 Jun 2024 at 21:42, Jarek Potiuk  wrote:
> > >
> > > > And yes. I will check with them about recording :)
> > > >
> > > > On Tue, Jun 11, 2024 at 4:41 PM Jarek Potiuk 
> wrote:
> > > >
> > > > > I think I have to warn Nielsen team that we are going to have a big
> > > crowd
> > > > > :)
> > > > >
> > > > > On Tue, Jun 11, 2024 at 4:40 PM Bishundeo, Rajeshwar
> > > > >  wrote:
> > > > >
> > > > >> This sounds exciting! I would like to join as well.
> > > > >>
> > > > >> -- Rajesh
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> On 2024-06-10, 11:12 PM, "Ephraim Anierobi" <
> > > ephraimanier...@gmail.com
> > > > >> > wrote:
> > > > >>
> > > > >>
> > > > >> CAUTION: This email originated from outside of the organization.
> Do
> > > not
> > > > >> click links or open attachments unless you can confirm the sender
> > and
> > > > know
> > > > >> the content is safe.
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
> > > > externe.
> > > > >> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous
> ne
> > > > pouvez
> > > > >> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas
> > certain
> > > > que
> > > > >> le contenu ne présente aucun risque.
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> Hi Jarek,
> > > > >>
> > > > >>
> > > > >> Awesome!! I’m interested to join too
> > > > >>
> > > > >>
> > > > >> Ephraim
> > > > >>
> > > > >>
> > > > >> On Mon, 10 Jun 2024 at 23:41, Mehta, Shubham
>  > > > >> lid>
> > > > >> wrote:
> > > > >>
> > > > >>
> > > > >> > This is great, thanks for working with them to share this with
> the
> > > > >> > community. Interested to join as well.
> > > > >> >
> > > > >> > Shubham
> > > > >> >
> > > > >> > On 2024-06-07, 11:56 PM, "Jarek Potiuk"  >  > > > >> ja...@potiuk.com>  > > > >> > ja...@potiuk.com >> wrote:
> > > > >> >
> > > > >> >
> > > > >> > CAUTION: This email originated from outside of the organization.
> > Do
> > > > not
> > > > >> > click links or open attachments unless you can confirm the
> sender
> > > and
> > > > >> know
> > > > >> > the content is safe.
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
> > > > >> externe.
> > > > >> > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si
> vous
> > ne
> > > > >> pouvez
> > > > >> > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas
> > > certain
> > > > >> que
> > > > >> > le contenu ne présente aucun risque.
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > Hello here,
> > > > >> >
> > > > >> >
> > > > >> > At PyCon US I met a few people from Nielsen who had developed
> > > > internally
> > > > >> > tooling for IDE/Python debugger integrated debugging of Airflow
> > > DAGs.
> > > > >> >
> > > > >> >
> > > > >> > They are thrilled with the opportunity of sharing what they've
> > done
> > > > and
> > > > >> > possibly maybe even bringing it to Airflow. As one of the
> Airflow
> > 3
> > > > >> > workstreams I am particularly interested in is to "*Simplify the
> > > > >> learning
> > > > >> > curve*" [1] - this sounds pretty interesting in th

Re: Call with Nielsen team demoing their DAG debugging feature

2024-06-13 Thread Jarek Potiuk
Just summarizing the call:

* we had demo from Nielsen showing their debugging feature
* It's based on environments created in the research environment where
users can run DAGs and debug individual tasks - basically replaying and
debugging the tasks based on existing DAG runs but without saving any
changes to state of the dag in the DB
* pretty useful thing is the way how they can use existing DAG run to
recreate the context of execution based on existing dag run
* Nielsen team used it with Airflow 1.10 and they will look into how the
new `dag.test()` feature from Airflow 2.5 can be plugged into it and come
back to it
* nice thing is that they hooked it up with VSCode plugin where they can
easily do all that within VSCode and can debug it out-of-the-box
* possibly they could generalise it either as a "what could be done by
others" description or maybe even having VSCode-from-airflow out-of-the-box
(the latter was my brainstorming idea).

I have a recording - I do not want to publish it on the public devlist, but
If anyone is interested - let me know and I will share.

J.


On Thu, Jun 13, 2024 at 6:49 AM Albert Okiri  wrote:

> Hi Jarek, I'm interested in joining this call.
>
> Regards,
> Albert.
>
> On Thu, 13 Jun 2024, 07:43 Poorvi Rohidekar, <
> poorvirohidekar@gmail.com>
> wrote:
>
> > Hi Jarek,
> >
> > I'd be interested in joining this call.
> >
> > Regards,
> > Poorvi
> >
> > On Tue, 11 Jun 2024 at 21:42, Jarek Potiuk  wrote:
> >
> > > And yes. I will check with them about recording :)
> > >
> > > On Tue, Jun 11, 2024 at 4:41 PM Jarek Potiuk  wrote:
> > >
> > > > I think I have to warn Nielsen team that we are going to have a big
> > crowd
> > > > :)
> > > >
> > > > On Tue, Jun 11, 2024 at 4:40 PM Bishundeo, Rajeshwar
> > > >  wrote:
> > > >
> > > >> This sounds exciting! I would like to join as well.
> > > >>
> > > >> -- Rajesh
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On 2024-06-10, 11:12 PM, "Ephraim Anierobi" <
> > ephraimanier...@gmail.com
> > > >> > wrote:
> > > >>
> > > >>
> > > >> CAUTION: This email originated from outside of the organization. Do
> > not
> > > >> click links or open attachments unless you can confirm the sender
> and
> > > know
> > > >> the content is safe.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
> > > externe.
> > > >> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
> > > pouvez
> > > >> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas
> certain
> > > que
> > > >> le contenu ne présente aucun risque.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> Hi Jarek,
> > > >>
> > > >>
> > > >> Awesome!! I’m interested to join too
> > > >>
> > > >>
> > > >> Ephraim
> > > >>
> > > >>
> > > >> On Mon, 10 Jun 2024 at 23:41, Mehta, Shubham  > > >> lid>
> > > >> wrote:
> > > >>
> > > >>
> > > >> > This is great, thanks for working with them to share this with the
> > > >> > community. Interested to join as well.
> > > >> >
> > > >> > Shubham
> > > >> >
> > > >> > On 2024-06-07, 11:56 PM, "Jarek Potiuk"   > > >> ja...@potiuk.com>  > > >> > ja...@potiuk.com >> wrote:
> > > >> >
> > > >> >
> > > >> > CAUTION: This email originated from outside of the organization.
> Do
> > > not
> > > >> > click links or open attachments unless you can confirm the sender
> > and
> > > >> know
> > > >> > the content is safe.
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
> > > >> externe.
> > > >> > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous
> ne
> > > >> pouvez
> > > >> > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas
> > certain
> > > >> que
> > > >> > le contenu ne présente aucun risque.
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > Hello here,
> > > >> >
> > > >> >
> > > >> > At PyCon US I met a few people from Nielsen who had developed
> > > internally
> > > >> > tooling for IDE/Python debugger integrated debugging of Airflow
> > DAGs.
> > > >> >
> > > >> >
> > > >> > They are thrilled with the opportunity of sharing what they've
> done
> > > and
> > > >> > possibly maybe even bringing it to Airflow. As one of the Airflow
> 3
> > > >> > workstreams I am particularly interested in is to "*Simplify the
> > > >> learning
> > > >> > curve*" [1] - this sounds pretty interesting in this context.
> > > >> >
> > > >> >
> > > >> > I will have a call with them next week - immediately after the
> > > Airflow 3
> > > >> > dev call (Thu, 13th of June, 6pm CEST), where they will demo what
> > they
> > > >> have
> > > >> > - so if you would like to join it - let me know, I will invite you
> > to
> > > >> the
> > > >> > call.
> > > >> >
> > > >> >
> > > >> > [1]
> > > >> >
> > > >> >
> > > >>
> > >
> >
> https

Re: [DISCUSS] Restore the SQL server backend

2024-06-13 Thread James Duong
Hi Jarek,

Thanks for your response.

I would prefer to make this work part of the main Airflow repository.

My previous experience with maintaining forks of Apache projects has always 
been that it fragments the community unnecessarily. Users of a company-specific 
fork might ask for features (not just relating to MSSQL) that could benefit the 
community at large. The larger community can miss out on insights from the 
users and maintainers of the fork.

There is a greater cost both getting updates from the main repo and upstreaming 
improvements to the main repo as the codebase diverges further.

I could see confusion from end users if there are multiple sets of artifacts 
for different forks of Airflow about which one to get.

What are your thoughts on this?

From: Jarek Potiuk 
Date: Monday, June 3, 2024 at 9:32 PM
To: dev@airflow.apache.org 
Cc: james.du...@improving.com.invalid 
Subject: Re: [DISCUSS] Restore the SQL server backend
I am not sure if you read what I wrote with full understanding.

To be perfectly honest - If you secure enough resources, I think *STILL* it
will be better if you maintain your own fork and apply necessary changes
and offer it commercially to anyone who needs it. This is way easier for
the community, and better for you commercially  - and if you are **really**
committed for a long term to do MSSQL, then you should have no problem in
maintaining the fork.

On Mon, Jun 3, 2024 at 11:15 PM James Duong
 wrote:

> Thanks for all of your feedback and discussion.
>
> The interest and usage from the enterprise MSSQL community is very large -
> it's unfortunate that numbers are difficult to gather.
>
> In terms of the support - I hear you that it should not be limited to only
> CI improvements and PR support and a more active role needs to be taken. I
> am working on a plan that would provide the necessary involvement in the
> community.
>
> Please allow me some time to see what is possible.
>
> From: Wei Lee 
> Date: Friday, May 31, 2024 at 8:45 AM
> To: dev@airflow.apache.org 
> Cc: james.du...@improving.com.invalid 
> Subject: Re: [DISCUSS] Restore the SQL server backend
> I agree with Jed and the following comments. If my memory serves me right,
> this topic has been discussed a few times in the past. 5% doesn't seem very
> convincing. Even if it's biased, I'm still not persuaded that there are a
> large number of users that are worth the community's effort. And Jarek
> pointed out a great solution for forking Airflow and adding MSSQL support
> to it.
>
> Best,
> Wei
>
> > On May 31, 2024, at 7:50 PM, Elad Kalif  wrote:
> >
> > I agree with Jarek
> >
> > I am a bit worried about the mental model of this proposal as you are
> > offering to deliver a feature but you are not offering being a community
> > member.
> > I had a lot of frustration with the MsSQL backend tests, it really caused
> > me pain as a contributor. According to your mental model - will you
> > actively review community PRs, triage Airflow issues and offer guidance
> and
> > help when needed about MsSQL or will the maintainers have to track these
> > problems and actively tag you/your team for assistance?
> >
> > Let me give an example: User opens a Github issue about HA scheduler.
> Will
> > your team participate in the issue triage? Or do you expect the community
> > to triage the issue and only after some discussion when it turns out that
> > it's MsSQL specific issue then we need to notify you?
> >
> > On Fri, May 31, 2024 at 10:05 AM Jarek Potiuk  wrote:
> >
> >>> We also understand and are ready to address the concerns stated in the
> >> vote about support and resolving CI issues
> >>
> >> Hello James,
> >>
> >> Could you please explain how exactly are you planning to help a number
> of
> >> maintainers who are working on developing new feature to make sure
> >> they know and realise unobvious consequences of some of the DB changes
> they
> >> might have when some of the features of MYSQL are causing - for example
> >> heavy slowdown of  inserts because of rebalancing B-TREES on UUID index
> for
> >> databases (that unlike Postgres and MariaDB) lack native UUID support
> (see
> >> . How would you help with discovering similar type of issues see here
> >> https://lists.apache.org/thread/7235o1bc3w4694sw8q9m4p58g3tdcjj7
> >>
> >> Could you please explain how many people, effort and dedicated resources
> >> (i.e. continuous testing of stability and performance you are going to
> >> spend on fixing those)?
> >>
> >> IMHO. If you see a LOT of users that want MsSQL support - you are
> >> absolutely free to spend those money, effort and resources on making a
> fork
> >> of Airflow with MsSQL support and charge a premium for that (and a large
> >> one). That seems like a very good business model to make if you see a
> lot
> >> of interest there.
> >>
> >> This is all perfectly fine according to our licence and community would
> be
> >> really thankful for someone who would take the burden of maintaining
> MSSQL
>

Re: [DISCUSS] Restore the SQL server backend

2024-06-13 Thread Jarek Potiuk
Yep what Kaxil said and a separate repo does not change anything. Anything
that we (Apache Airflow PMC) releases is something that we (Apache
Airflow PMC maintain, contribute to and only committers can approve and
merge the code)  - so no matter if it is a separate repo, the cost is the
same .

The community already decided that we do not want to bear the cost.
That's done. If there is someone who wants to bear the cost (and then get
it back by charging their customers for this extra service maybe), they are
free to do so, and what you are proposing is that we again start bearing
the cost. Which we already decided not to do (by voting).

I think the discussion is long and exhaustive enough, so if someone wants
to revert that decision - they would have to start a vote thread. Just
following the regular process we have in Apache project:
https://www.apache.org/foundation/voting#votes-on-code-modification - but
taking into account quite clear "no" from a few people - committers - it's
not going to pass. From the discussion above it looks like there will be a
VETO from a few people and even one VETO is enough for it to not be
accepted.

J.

On Thu, Jun 13, 2024 at 7:01 PM James Duong
 wrote:

> I haven’t looked that much into the Airflow code yet, but the backend code
> must already be pluggable since it supports multiple databases.
>
> What if we were to maintain a separate repo and artifact for just an
> extension for the SQL Server backend? This seems like it’d be low-impact on
> the Airflow community from a maintenance perspective, but keeps MSSQL users
> using the official Airflow artifacts. (Maybe just have the Airflow
> documentation link to the extension).
>
> From: Kaxil Naik 
> Date: Thursday, June 13, 2024 at 9:53 AM
> To: dev@airflow.apache.org 
> Subject: Re: [DISCUSS] Restore the SQL server backend
> The cost of maintaining it in Airflow repo (with CI/CI, GH issues etc) is,
> unfortunately, just too much higher.
>
> On Thu, 13 Jun 2024 at 17:11, James Duong  .invalid>
> wrote:
>
> > Hi Jarek,
> >
> > Thanks for your response.
> >
> > I would prefer to make this work part of the main Airflow repository.
> >
> > My previous experience with maintaining forks of Apache projects has
> > always been that it fragments the community unnecessarily. Users of a
> > company-specific fork might ask for features (not just relating to MSSQL)
> > that could benefit the community at large. The larger community can miss
> > out on insights from the users and maintainers of the fork.
> >
> > There is a greater cost both getting updates from the main repo and
> > upstreaming improvements to the main repo as the codebase diverges
> further.
> >
> > I could see confusion from end users if there are multiple sets of
> > artifacts for different forks of Airflow about which one to get.
> >
> > What are your thoughts on this?
> >
> > From: Jarek Potiuk 
> > Date: Monday, June 3, 2024 at 9:32 PM
> > To: dev@airflow.apache.org 
> > Cc: james.du...@improving.com.invalid  .invalid>
> > Subject: Re: [DISCUSS] Restore the SQL server backend
> > I am not sure if you read what I wrote with full understanding.
> >
> > To be perfectly honest - If you secure enough resources, I think *STILL*
> it
> > will be better if you maintain your own fork and apply necessary changes
> > and offer it commercially to anyone who needs it. This is way easier for
> > the community, and better for you commercially  - and if you are
> **really**
> > committed for a long term to do MSSQL, then you should have no problem in
> > maintaining the fork.
> >
> > On Mon, Jun 3, 2024 at 11:15 PM James Duong
> >  wrote:
> >
> > > Thanks for all of your feedback and discussion.
> > >
> > > The interest and usage from the enterprise MSSQL community is very
> large
> > -
> > > it's unfortunate that numbers are difficult to gather.
> > >
> > > In terms of the support - I hear you that it should not be limited to
> > only
> > > CI improvements and PR support and a more active role needs to be
> taken.
> > I
> > > am working on a plan that would provide the necessary involvement in
> the
> > > community.
> > >
> > > Please allow me some time to see what is possible.
> > >
> > > From: Wei Lee 
> > > Date: Friday, May 31, 2024 at 8:45 AM
> > > To: dev@airflow.apache.org 
> > > Cc: james.du...@improving.com.invalid  > .invalid>
> > > Subject: Re: [DISCUSS] Restore the SQL server backend
> > > I agree with Jed and the following comments. If my memory serves me
> > right,
> > > this topic has been discussed a few times in the past. 5% doesn't seem
> > very
> > > convincing. Even if it's biased, I'm still not persuaded that there
> are a
> > > large number of users that are worth the community's effort. And Jarek
> > > pointed out a great solution for forking Airflow and adding MSSQL
> support
> > > to it.
> > >
> > > Best,
> > > Wei
> > >
> > > > On May 31, 2024, at 7:50 PM, Elad Kalif  wrote:
> > > >
> > > > I agree with Jarek
> > > >
> > > > I am a bit worried about th

Re: [DISCUSS] Restore the SQL server backend

2024-06-13 Thread Kaxil Naik
The cost of maintaining it in Airflow repo (with CI/CI, GH issues etc) is,
unfortunately, just too much higher.

On Thu, 13 Jun 2024 at 17:11, James Duong 
wrote:

> Hi Jarek,
>
> Thanks for your response.
>
> I would prefer to make this work part of the main Airflow repository.
>
> My previous experience with maintaining forks of Apache projects has
> always been that it fragments the community unnecessarily. Users of a
> company-specific fork might ask for features (not just relating to MSSQL)
> that could benefit the community at large. The larger community can miss
> out on insights from the users and maintainers of the fork.
>
> There is a greater cost both getting updates from the main repo and
> upstreaming improvements to the main repo as the codebase diverges further.
>
> I could see confusion from end users if there are multiple sets of
> artifacts for different forks of Airflow about which one to get.
>
> What are your thoughts on this?
>
> From: Jarek Potiuk 
> Date: Monday, June 3, 2024 at 9:32 PM
> To: dev@airflow.apache.org 
> Cc: james.du...@improving.com.invalid 
> Subject: Re: [DISCUSS] Restore the SQL server backend
> I am not sure if you read what I wrote with full understanding.
>
> To be perfectly honest - If you secure enough resources, I think *STILL* it
> will be better if you maintain your own fork and apply necessary changes
> and offer it commercially to anyone who needs it. This is way easier for
> the community, and better for you commercially  - and if you are **really**
> committed for a long term to do MSSQL, then you should have no problem in
> maintaining the fork.
>
> On Mon, Jun 3, 2024 at 11:15 PM James Duong
>  wrote:
>
> > Thanks for all of your feedback and discussion.
> >
> > The interest and usage from the enterprise MSSQL community is very large
> -
> > it's unfortunate that numbers are difficult to gather.
> >
> > In terms of the support - I hear you that it should not be limited to
> only
> > CI improvements and PR support and a more active role needs to be taken.
> I
> > am working on a plan that would provide the necessary involvement in the
> > community.
> >
> > Please allow me some time to see what is possible.
> >
> > From: Wei Lee 
> > Date: Friday, May 31, 2024 at 8:45 AM
> > To: dev@airflow.apache.org 
> > Cc: james.du...@improving.com.invalid  .invalid>
> > Subject: Re: [DISCUSS] Restore the SQL server backend
> > I agree with Jed and the following comments. If my memory serves me
> right,
> > this topic has been discussed a few times in the past. 5% doesn't seem
> very
> > convincing. Even if it's biased, I'm still not persuaded that there are a
> > large number of users that are worth the community's effort. And Jarek
> > pointed out a great solution for forking Airflow and adding MSSQL support
> > to it.
> >
> > Best,
> > Wei
> >
> > > On May 31, 2024, at 7:50 PM, Elad Kalif  wrote:
> > >
> > > I agree with Jarek
> > >
> > > I am a bit worried about the mental model of this proposal as you are
> > > offering to deliver a feature but you are not offering being a
> community
> > > member.
> > > I had a lot of frustration with the MsSQL backend tests, it really
> caused
> > > me pain as a contributor. According to your mental model - will you
> > > actively review community PRs, triage Airflow issues and offer guidance
> > and
> > > help when needed about MsSQL or will the maintainers have to track
> these
> > > problems and actively tag you/your team for assistance?
> > >
> > > Let me give an example: User opens a Github issue about HA scheduler.
> > Will
> > > your team participate in the issue triage? Or do you expect the
> community
> > > to triage the issue and only after some discussion when it turns out
> that
> > > it's MsSQL specific issue then we need to notify you?
> > >
> > > On Fri, May 31, 2024 at 10:05 AM Jarek Potiuk 
> wrote:
> > >
> > >>> We also understand and are ready to address the concerns stated in
> the
> > >> vote about support and resolving CI issues
> > >>
> > >> Hello James,
> > >>
> > >> Could you please explain how exactly are you planning to help a number
> > of
> > >> maintainers who are working on developing new feature to make sure
> > >> they know and realise unobvious consequences of some of the DB changes
> > they
> > >> might have when some of the features of MYSQL are causing - for
> example
> > >> heavy slowdown of  inserts because of rebalancing B-TREES on UUID
> index
> > for
> > >> databases (that unlike Postgres and MariaDB) lack native UUID support
> > (see
> > >> . How would you help with discovering similar type of issues see here
> > >> https://lists.apache.org/thread/7235o1bc3w4694sw8q9m4p58g3tdcjj7
> > >>
> > >> Could you please explain how many people, effort and dedicated
> resources
> > >> (i.e. continuous testing of stability and performance you are going to
> > >> spend on fixing those)?
> > >>
> > >> IMHO. If you see a LOT of users that want MsSQL support - you are
> > >> absolutely free to spen

Re: [DISCUSS] common.compat provider (WAS: Common.util provider?)

2024-06-13 Thread Jarek Potiuk
> The only known problem with that idea is that the common code has to live
"forever" - as long as someone can use the older providers (or older
Airflow version).

Here Jakub is right - it's only 'older providers" not "older airlfow" - the
provider will have "apache-airflow>=2.7.0" so older airflow version is not
a problem. It is for older providers version. But. I think this back-compat
code should be essentially "frozen" and we should (we can automate it) make
sure that when we bump min airflow version, we stop using that old code.
Having the old compatibility code laying in a separate provider is a very
little issue comparing to having the same code copied across multiple
providers for 2-3 years, like it would be in case of
https://github.com/apache/airflow/pull/39530

Does anyone have something against it? I think it very much fulfills the
polyfill idea (unless someone can come up with a better approach). I will
start a LAZY CONSENSUS in a few days if I do not hear more about it :)

J,



On Mon, Jun 10, 2024 at 4:06 PM Amogh Desai 
wrote:

> I like the idea of having a common.compat provider too. (common.util was
> actually kinda confusing)
>
> * when providers get >= airflow 2.10 - we change them to import from
> > `airflow.openlineage` rather than from "airflow.providers.common.compat".
> >
> I am not so sure why this is the case. Can you elaborate?
>
>
> Thanks & Regards,
> Amogh Desai
>
>
> On Mon, Jun 10, 2024 at 4:15 PM Jakub Dardziński 
> wrote:
>
> > As the author of https://github.com/apache/airflow/pull/39530 I love the
> > idea.
> >
> > * when providers get >= airflow 2.10 - we change them to import from
> > > `airflow.openlineage` rather than from
> "airflow.providers.common.compat".
> > >
> > What's the reasoning behind that? How would Airflow core release impact
> > providers dependencies?
> >
> > pon., 10 cze 2024 o 11:08 Maciej Obuchowski  >
> > napisał(a):
> >
> > > I think it's a good solution.
> > > The only known problem with that idea is that the common code has to
> live
> > > "forever" - as long as someone can use the older providers (or older
> > > Airflow version).
> > > The solution would be to introduce some explicit deprecation or
> > versioning
> > > for provider dependencies - but that's not really possible due to lack
> of
> > > constraints
> > > for optional dependencies.
> > >
> > > sob., 8 cze 2024 o 22:00 Jarek Potiuk  napisał(a):
> > >
> > > > I have an idea about that one, and probably that one will fulfill the
> > > > "polyfill" approach discussed earlier.
> > > >
> > > > I think we should not name the provider "common.util" but
> > > "common.compat" -
> > > > because all the code that we need to put there is really about
> keeping
> > > > compatibility.
> > > >
> > > > For example look here https://github.com/apache/airflow/pull/39530
> > > >
> > > > We have a need to have a "compatibility" code somewhere that a number
> > of
> > > > providers could use in case we want to keep some backwards
> > compatibility.
> > > >
> > > > So having a "common.compat" provider would likely nicely full-fill
> the
> > > > polyfill approach - It should only contain the code that we aim to
> keep
> > > > backwards compatibility
> > > >
> > > > Example for https://github.com/apache/airflow/pull/39530
> > > >
> > > > * we add the complex compatibility code (see
> > > > https://github.com/apache/airflow/pull/39530#issuecomment-2145670785
> )
> > in
> > > > the "common.compat" provider - and to airflow.openlineage in this
> case
> > > > * we import it from there in all providers that need it (this will
> > > > automatically add dependency)
> > > > * when providers get >= airflow 2.10 - we change them to import from
> > > > `airflow.openlineage` rather than from
> > "airflow.providers.common.compat".
> > > >
> > > > We could apply similar approach for other "compatibility" code
> > > >
> > > > J.
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Apr 11, 2024 at 10:22 AM Jarek Potiuk 
> > wrote:
> > > >
> > > > > Any other  ideas or suggestions here? Can someone explain how the
> > > > > "polypill" approach would look like, maybe? How do we imagine this
> > > > working?
> > > > >
> > > > > Just to continue this discussion - another example.
> > > > >
> > > > > Small thing that David wanted to add for changes in some sql
> > providers:
> > > > >
> > > > > @contextmanager
> > > > > def suppress_and_warn(*exceptions: type[BaseException]):
> > > > > """Context manager that suppresses the given exceptions and
> logs
> > a
> > > > > warning message."""
> > > > > try:
> > > > > yield
> > > > > except exceptions as e:
> > > > > warnings.warn(f"Exception suppressed:
> > > > > {e}\n{traceback.format_exc()}", category=UserWarning)
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/airflow/pull/38707/files#diff-6e1b2f961cb951d05d66d2d814ef5f6d8f8bf8f43c40fb5d40e27a031fed8dd7R115
> > > > >
> > > > > This is a small thing - but adding it in 

Proposal for Enhanced Data Awareness in Airflow

2024-06-13 Thread Constance Martineau
Hi Airflow Dev Community!

I am excited to share a new proposal written by TP and I titled "Enhanced
Data Awareness in Airflow
"
that I believe will significantly advance our capabilities in data
orchestration.

The proposal aims to bridge the gap between task management and data
management within Airflow integrating enhanced data awareness features.
This evolution unlocks Airflow's ability to make informed orchestration
decisions based on actual data that is produced/manipulated by Airflow and
provide actionable insights about the data as it moves through workflows,
ultimately improving data reliability and data quality.

Key highlights of the proposal include:

   - *Introducing Assets:* Redefining datasets as assets, allowing for more
   comprehensive data management and better alignment with modern data
   engineering practices.
   - *Progressive Adoptability:* Ensuring that enhancements can be
   integrated incrementally without disrupting existing workflows.
   - *Handling Incremental Load Strategies:* Providing first-class support
   for incremental processes to provide visibility on data freshness, set the
   stage for targeted backfills, and ultimately improve data reliability

For more details, please refer to the attached document. I am eager to hear
your thoughts and feedback on this proposal, as well as any suggestions for
improvement. We will follow up with a set of formal AIPs.

Constance
-- 

Constance Martineau

Senior Product Manager

Email: consta...@astronomer.io

Time zone: US Eastern (EST UTC-5 / EDT UTC-4)


Re: [Airflow 3] The Airflow actors

2024-06-13 Thread Kaxil Naik
Awesome, thanks Elad

On Thu, 13 Jun 2024 at 07:50, Jarek Potiuk  wrote:

> Nice one.
>
> On Thu, Jun 13, 2024 at 8:36 AM Elad Kalif  wrote:
>
> > Hello everyone,
> >
> > As discussed in the last Airflow meeting I published a doc about the
> > Airflow actors
> > https://cwiki.apache.org/confluence/x/1gmTEg
> >
> >
> > Thanks,
> > Elad
> >
>


[REMINDER] Airflow 3 Dev call today

2024-06-13 Thread Kaxil Naik
Hello all,

Just a reminder that we will have our dev call today at 4 PM BST (3 PM
GMT/UTC | 11 AM EST | 8 AM PST).

*Proposed Agenda*:
1) Check in on the action items from the last call
2) Discussion: How do we collaborate and discuss efficiently (e.g. Email,
cwiki, Google docs)?
3) Discussion: Airflow Actors doc[1] (Elad)
4) Discussion: Task Context aka Task SDK (Ash)
5) Discussion: Enhanced Data Awareness in Airflow (Constance & TP)
6) Things to follow up via mailing list & Agenda for the next call

Please try to read Elad's document [1] before the meeting if possible. Elad
will also present the document on the call and give folks ~5 minutes to
read it through.

If anyone wants to add an item to the agenda, please add a comment here [2].

A summary of the call will be sent to the mailing list and also posted on
the wiki

[2].
I have updated the main page for Airflow 3 on the confluence wiki [3] with
the principles & guidelines agreed last week. I will keep that page
up-to-date as we progress.

If anyone wants to get early feedback on an AIP they are planning for
Airflow 3, please let me know, and I can add it to the agenda for the dev
call. This is not meant to replace the discussion after the AIP is
published on the mailing list but instead to accelerate the early feedback
so that the AIP author can iterate accordingly.

Regards,
Kaxil

[1] https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Actors
[2]
https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+3+Dev+call%3A+Meeting+Notes
[3] https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+3.0