Re: Naming things: What should the imports in dag files for DAG etc. be?

2024-08-30 Thread Constance Martineau
I'm partial to everything that we expect users to use to be importable from
`airflow`, but would love to hear other people's thoughts.

On Fri, Aug 30, 2024 at 5:48 AM Ash Berlin-Taylor  wrote:

> Hi everyone,
>
> It’s time to have a another discussion about everyone's favourite
> discussion - naming things!
>
> Tl;dr if you have all of AIP-72 and its implications loaded in your head
> already:
>
> ##
> Where should DAG, TaskGroup, Labels, decorators etc for authoring be
> imported from inside the DAG files? Similarly for DagRun, TaskInstance etc.
> (these likely won’t be created directly by users, but just used for
> reference docs/type hints/editor completion)
> ##
>
> Assuming most people don’t fall in to that category, read on :)
>
> Right now users import things into their DAG files from a few places.
> Some/most of these are (now) documented in
> https://airflow.apache.org/docs/apache-airflow/stable/public-airflow-interface.html
>
> ```
> from airflow import DAG
> from airflow.decorators import task, task_group
> from airflow.utils.task_group import TaskGroup
> from airflow.utils.edgemodifier import Label # For adding labels between
> nodes on graph
> ```
>
> The following packages are linked to from that doc too, so they are I
> guess considered quasi-public:
>
>
> airflow.exceptions
> airflow.models.dag
> airflow.models.dagbag
> airflow.models.param
> airflow.models.dagrun
> airflow.models.connection
> airflow.models.variable
> airflow.models.xcom
> airflow.utils.state
> airflow.hooks
>
> So as part of my work on AIP-72/Task Execution interface and SDK I want to
> tidy these up and “unify” the imports.
>
> My thinking is as follows:
>
> 1. Users should never import things from airflow.models (and in Airflow 3
> it will be impossible to do so outside of compatibility shims)
> 2. “TaskGroup” and the state enums should not be imported by users from
> utils (More generally I don’t like “utils” as a namespace/package as I find
> it’s where code just get’s dumped, but that’s a separate point.)
>
>
> On the subject of Hooks, I think we should consider moving
> `get_connection` off of BaseHook anyway (it’ll be implemented totally
> differently behind an API anyway) on to a class method on Connection.
>
> So now to the crux of the naming debate, and repeating the question from
> the top:
>
> Where should DAG, TaskGroup, Labels, decorators etc for authoring be
> imported from inside the DAG files? Similarly for DagRun, TaskInstance
> (these two likely won’t be created directly by users, but just used for
> reference docs/type hints)
>
> We don’t have to worry about breaking things or needing every dag to be
> re-written as I already have a way of maintaining backwards-compatibility
> via a shim, so the please think of this as “Given a Greenfield, where
> should these imports live for our users”/“What makes most sense to see in
> DAG files”.
>
> I have some rough ideas but would like to get other people's views here
> first.
>
> Cheers,
> Ash


Re: [DISCUSS] AIP-78 scheduler-managed backfill

2024-07-10 Thread Constance Martineau
Seems valid for default behaviour, but if I backfill for a year and realize
there was something wrong with the code, I don't want to manually fail each
dag run that is running. How about a force kill option?

On Wed, Jul 10, 2024 at 9:28 AM Daniel Standish
 wrote:

> Yup that's true @Tzu-ping Chung  .  There will need to
> be
> something in the database.  I think a natural choice for the behavior
> would be like pausing a dag -- anything already scheduled would continue to
> run but nothing new would be scheduled.
>
> On Tue, Jul 9, 2024 at 7:08 PM Tzu-ping Chung 
> wrote:
>
> > How does the user cancel or pause the entire backfill process? The
> > proposal only says this should be possible, but does not touch on how
> > exactly.
> >
> > My intuition while reading the document was to have a flag on
> BackfillRun,
> > but that does not seem to be the case in your illustrative code.
> >
> > TP
> >
> >
> > > On Jul 9, 2024, at 22:12, Daniel Standish
> >  wrote:
> > >
> > > I put up a draft AIP for scheduler-managed backfill here:
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-78+Scheduler-managed+backfill
> > >
> > > Quick summary:
> > >
> > > TLDR: move backfill from CLI process to the scheduler
> > >
> > > Backfill currently is a CLI-only feature that in effect runs a
> scheduler
> > > locally in the CLI process.  We don't have good visibility of backfill
> > jobs
> > > in the web UI, and users without CLI access cannot access the feature.
> > > Additionally, it's not ideal to have a "second scheduler" from a
> project
> > > maintenance perspective.
> > >
> > > This AIP focuses specifically on moving management of backfill jobs to
> > the
> > > scheduler.  This will take something away from users.  Previously you
> > could
> > > run backfill in local mode which would not only schedule the backfill
> > > locally but run all the tasks locally as well.  This will go away.  And
> > the
> > > scheduler will of course have more to do, to the extent that backfill
> is
> > > used.  The scheduler will become somewhat more complex since it will
> have
> > > to manage backfill runs too.
> > >
> > > There are some interactions with other AIPs.  E.g. backfill is
> > > fundamentally about data completeness.  And the data awareness AIPs may
> > > change what that can mean in Airflow.
> > >
> > > I look forward to your feedback.
> > >
> > > Thanks
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> > For additional commands, e-mail: dev-h...@airflow.apache.org
> >
> >
>


Re: [DISCUSS][AIP-38 Modern Web Application]

2024-06-26 Thread Constance Martineau
I love it and 100% agree. Thinking "Dag Groups", where you can group dags
(static & dynamic) into a subfolder. Tags are great for filtering, but they
aren't a replacement for dirs especially at a large scale. We have some
deployments with 20k dags and as designed today, it's not navigable at that
scale. This could help

On Wed, Jun 26, 2024 at 9:04 AM Blain David  wrote:

> Beside the new interface and getting rid off FAB in Airflow 3.0, a cool
> and handy feature would be to be able to group multiple DAG's so you could
> order them by like domain or whatever grouping you want to achieve.
> Okay, you can achieve the same with filtering, and maybe the we could use
> that feature to achieve the grouping but still it would make the UI more
> convenient to use, especially if you have to manage multiple dynamic DAG's
> which are related to the same domain. It would be nice I you could like
> create a group which always apply the filtering in a stateful manner.  Or
> we could opt to really implement a dedicated grouping mechanism so that you
> could for example specify in your DAG to which group it belongs.  What do
> you guys think?  I would be willing to help and contribute of course.
>
>
>


Re: [DISCUSS] Number of queries to Airflow database in "DAG File Processing Stats"

2024-06-14 Thread Constance Martineau
I love the idea. If we were to store it in the DB, would we keep a history,
or only the latest stats from the most recent dag parsing loop? DAG parsing
by default is every 30s right?

On Fri, Jun 14, 2024 at 6:53 AM Jarek Potiuk  wrote:

> > I think we still need to enable the ability for DAGs at parse time to
> access Variables.
>
> It's actually DISCOURAGED ;) to access Variables at parse time - though we
> have an experimental feature to make it efficient and we discussed whether
> to treat it as "good practice" and there was strong opposition to that 
> So I am not actually sure what will be our position on it.
>
> On Fri, Jun 14, 2024 at 12:50 PM Ash Berlin-Taylor  wrote:
>
> >
> >
> > > On 14 Jun 2024, at 10:22, Jarek Potiuk  wrote:
> > >
> > >  think in the future of Airflow 3 where
> > > we will have task isolation, having `0` for all the DAGs will be a
> > > prerequisite for switching to "task isolation" mode and this could be
> > > actually verified in a migration tool.
> >
> > I think we still need to enable the ability for DAGs at parse time to
> > access Variables.
> >
> > Or at least I am not proposing we remove that ability. (I wouldn’t be
> > against it, but I was planning on continuing to support that for now)
> >
> > -ash
>


Proposal for Enhanced Data Awareness in Airflow

2024-06-13 Thread Constance Martineau
Hi Airflow Dev Community!

I am excited to share a new proposal written by TP and I titled "Enhanced
Data Awareness in Airflow
<https://docs.google.com/document/d/1Sra65yjbAIZ2mZIbSUL9YMPrW73ltDEPWTCD4J3j2hQ/edit#heading=h.f9eh19p4yqfw>"
that I believe will significantly advance our capabilities in data
orchestration.

The proposal aims to bridge the gap between task management and data
management within Airflow integrating enhanced data awareness features.
This evolution unlocks Airflow's ability to make informed orchestration
decisions based on actual data that is produced/manipulated by Airflow and
provide actionable insights about the data as it moves through workflows,
ultimately improving data reliability and data quality.

Key highlights of the proposal include:

   - *Introducing Assets:* Redefining datasets as assets, allowing for more
   comprehensive data management and better alignment with modern data
   engineering practices.
   - *Progressive Adoptability:* Ensuring that enhancements can be
   integrated incrementally without disrupting existing workflows.
   - *Handling Incremental Load Strategies:* Providing first-class support
   for incremental processes to provide visibility on data freshness, set the
   stage for targeted backfills, and ultimately improve data reliability

For more details, please refer to the attached document. I am eager to hear
your thoughts and feedback on this proposal, as well as any suggestions for
improvement. We will follow up with a set of formal AIPs.

Constance
-- 

Constance Martineau

Senior Product Manager

Email: consta...@astronomer.io

Time zone: US Eastern (EST UTC-5 / EDT UTC-4)


Re: Call with Nielsen team demoing their DAG debugging feature

2024-06-10 Thread Constance Martineau
Hello again,

Given all the enthusiasm - assuming Nielsen is ok with this - what if
someone recorded the meeting so that it could be shared with those that are
interested?

Constance

On Mon, Jun 10, 2024 at 1:48 PM Constance Martineau 
wrote:

> Hi Jarek,
>
> Same :)
>
> Thanks,
> Constance
>
> On Mon, Jun 10, 2024 at 9:57 AM Amogh Desai 
> wrote:
>
>> Hello Jarek,
>>
>> Please add me to the invite as well.
>>
>> Thanks & Regards,
>> Amogh Desai
>>
>>
>> On Mon, Jun 10, 2024 at 11:22 AM Abhishek Bhakat
>>  wrote:
>>
>> > Hi Jarek,
>> >
>> > I would also like to join as well, please.
>> >
>> > Thanks,
>> > Avi
>> >
>> > On Sat, Jun 8, 2024 at 3:32 PM Buğra Öztürk 
>> > wrote:
>> >
>> > > Hello Jarek,
>> > >
>> > > Thanks for sharing! It sounds very interesting. I would like to join.
>> > Could
>> > > you please forward to me as well?
>> > >
>> > > Thanks!
>> > >
>> > > On Sat, 8 Jun 2024, 17:28 Jed Cunningham, 
>> > > wrote:
>> > >
>> > > > Interesting. Can you forward to me as well Jarek? Thanks!
>> > > >
>> > >
>> >
>>
>


Re: Call with Nielsen team demoing their DAG debugging feature

2024-06-10 Thread Constance Martineau
Hi Jarek,

Same :)

Thanks,
Constance

On Mon, Jun 10, 2024 at 9:57 AM Amogh Desai 
wrote:

> Hello Jarek,
>
> Please add me to the invite as well.
>
> Thanks & Regards,
> Amogh Desai
>
>
> On Mon, Jun 10, 2024 at 11:22 AM Abhishek Bhakat
>  wrote:
>
> > Hi Jarek,
> >
> > I would also like to join as well, please.
> >
> > Thanks,
> > Avi
> >
> > On Sat, Jun 8, 2024 at 3:32 PM Buğra Öztürk 
> > wrote:
> >
> > > Hello Jarek,
> > >
> > > Thanks for sharing! It sounds very interesting. I would like to join.
> > Could
> > > you please forward to me as well?
> > >
> > > Thanks!
> > >
> > > On Sat, 8 Jun 2024, 17:28 Jed Cunningham, 
> > > wrote:
> > >
> > > > Interesting. Can you forward to me as well Jarek? Thanks!
> > > >
> > >
> >
>


Re: [VOTE] May 2024 PR of the Month

2024-05-28 Thread Constance Martineau
While I know that #39336 was a lot of work and big from a dev perspective
(thanks @Daniel Standish !), my vote goes to
#39650 as task-level CPU and memory metrics are a long-standing feature
request.

On Tue, May 28, 2024 at 1:42 PM Jarek Potiuk  wrote:

> #39336 hands down
>
> On Tue, May 28, 2024 at 7:02 PM Briana Okyere
>  wrote:
>
> > Hey All,
> >
> > It’s once again time to vote for the PR of the Month!
> >
> > With the help of the `get_important_pr_candidates` script in dev/stats,
> > we've identified the following candidates:
> >
> > PR #39510: Add Scarf based telemetry <
> > https://github.com/apache/airflow/pull/39510>
> >
> > PR #39513: Run unit tests with airflow installed from packages <
> > https://github.com/apache/airflow/pull/39513>
> >
> > PR #39365: Fix the pinecone system test <
> > https://github.com/apache/airflow/pull/39365>
> >
> > PR #39336: Scheduler to handle incrementing of try_number
> > 
> >
> > PR #39650: Add metrics about task CPU and memory usage <
> > https://github.com/apache/airflow/pull/39650>
> >
> > Please reply to this thread with your selection or offer your own
> > nominee(s).
> >
> > Voting will close on Friday May 31st at 9 AM PST. The winner(s) will be
> > featured in the next issue of the Airflow newsletter.
> >
> > Also, if there’s an article or event that you think should be included in
> > this or a future issue of the newsletter, please drop me a line at <
> > briana.oky...@astronomer.io>
> >
> > --
> > Briana Okyere
> > Community Manager
> > Astronomer
> >
>


Re: [DISCUSS] AIP-63, AIP-64, and AIP-65: DAG Versioning

2024-05-28 Thread Constance Martineau
Agreed.  When Jed and team wrote the AIP, we intentionally limited the
scope to DAGs since the AIPs were already really large, but the intention
is to extend the concept to datasets.

Funny that you bring up point #2. A few of us met last week to talk about
DAG Versioning, and that use-case came up. Not only should you be allowed
to declare the state of each version, you should also be able to pick a
version for normally scheduled runs that is not necessarily the most recent
(for example the most recent version tagged as prod), while also running
other versions adhoc, such as the draft version that may have just been
deployed. Like Kaxil said, this will be covered by AIP-66.

On Tue, May 28, 2024 at 5:52 AM Kaxil Naik  wrote:

> Yes to both the below questions @Elad Kalif . The
> upcoming Data-Awareness AIPs the first one and the 2nd should be covered by
> AIP-66 once it is out of draft.
>
> 1. Should datasets be also versioned?
> > 2. Should we support executing more than 1 DAG version at a given time?
>
>
> On Tue, 28 May 2024 at 10:07, Elad Kalif  wrote:
>
> > I have a general question about (maybe somehow related to the DAG Bundle
> > concept introduced in the AIPs)
> > The way I see it DAGs are tightly coupled with Datasets. Tasks take
> > dependency on dataset or/and produce a dataset.
> > We are focused on the versions of the code (DAG) but to make this play
> > nicely we should consider also applying versions to datasets.
> > Granted not every change to DAG code means change in dataset version but
> we
> > should consider if we want to leave datasets versionless.
> >
> > I previously worked with some data products that allow versioning of
> tables
> > and it was really nice! It enabled the concept of Data Contract (treating
> > tables much like you treat API) and it made things much easier.
> > I sometimes even had two versions of the same workflow running one for
> the
> > new version and one for the deprecated version thus allowing my customers
> > the flexibility to migrate between the table versions before the
> deprecated
> > version is discontinued.
> >
> > I am raising two main questions here:
> > 1. Should datasets be also versioned?
> > 2. Should we support executing more than 1 DAG version at a given time?
> > (allow user to declare Draft/Production/Deprecated/Deleted) state for
> each
> > version.
> >
> > On Wed, Mar 6, 2024 at 1:58 AM Jed Cunningham 
> > wrote:
> >
> > > Hello everyone!
> > >
> > > I'm excited to start a discussion around DAG Versioning in Airflow.
> It's
> > > been the most requested feature in the last 3 community surveys!
> > >
> > > AIP-63: DAG Versioning
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-63%3A+DAG+Versioning
> > > >
> > >
> > > As this topic quickly becomes rather large, I've made AIP-63 an
> umbrella
> > > AIP and split the specifics into separate AIPs:
> > >
> > > AIP-64: Keep TaskInstance try history
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-64%3A+Keep+TaskInstance+try+history
> > > >
> > > AIP-65: Improve DAG history in UI
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-65%3A+Improve+DAG+history+in+UI
> > > >
> > > [WIP] AIP-66: Execution of specific DAG code versions
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/%5BWIP%5D+AIP-66%3A+Execution+of+specific+DAG+versions
> > > >
> > >
> > > AIP-64 and AIP-65 are ready to be discussed in depth, while AIP-66 is
> > there
> > > to provide an intentionally high level vision of what we may want to
> > tackle
> > > before Airflow's "DAG versioning" story is complete.
> > >
> > > Thanks,
> > > Jed
> > >
> >
>


Re: [VOTE] Proposal for adding Telemetry via Scarf

2024-05-09 Thread Constance Martineau
+1 non-binding

On Thu, May 9, 2024 at 7:17 AM Tomasz Urbaszek  wrote:

> +1 binding
>
> On Thu, 9 May 2024 at 12:40, Andrey Anshin 
> wrote:
>
> > +1 binding
> >
> >
> >
> >
> >
> > On Thu, 9 May 2024 at 13:25, Wei Lee  wrote:
> >
> > > Got it. Thanks Jarek for pointing out!
> > >
> > > Best,
> > > Wei
> > >
> > > > On May 9, 2024, at 3:59 PM, Ankit Chaurasia 
> > wrote:
> > > >
> > > > +1 non-binding
> > > >
> > > > *Ankit Chaurasia*
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, May 9, 2024 at 11:16 AM Aritra Basu <
> aritrabasu1...@gmail.com>
> > > > wrote:
> > > >
> > > >> +1 non-binding
> > > >>
> > > >> --
> > > >> Regards,
> > > >> Aritra Basu
> > > >>
> > > >> On Thu, May 9, 2024, 10:36 AM Amogh Desai  >
> > > >> wrote:
> > > >>
> > > >>> +1 binding
> > > >>>
> > > >>> Thanks & Regards,
> > > >>> Amogh Desai
> > > >>>
> > > >>>
> > > >>> On Thu, May 9, 2024 at 10:29 AM Jarek Potiuk 
> > wrote:
> > > >>>
> > >  Short reminder and correction :).
> > > 
> > >  Wei Lee - as a committer, your vote is binding for any votes
> except
> > >  releases. Releases are special - they are a legal act of the
> > >  Apache Software Foundation, so when you vote on releases, only PMC
> > > >> votes
> > >  are binding.
> > > 
> > >  Generally "releasing software" is what the ASF Foundation does.
> The
> > >  foundation does not create software, only releases it ("for the
> > public
> > >  good").
> > >  The PMC member is an official role in the ASF bylaws
> > >  https://www.apache.org/foundation/bylaws.html  (Apache Airflow
> is a
> > > >>> PMC).
> > >  This is according to Delaware laws - that's where Foundation is
> > > >>> registered
> > >  as https://en.wikipedia.org/wiki/501(c)(3)_organization
> non-profit
> > >  organisation,
> > > 
> > >  This allows us to release software without the fear that someone
> > will
> > > >> sue
> > >  us personally if they are harmed by it, because if - as PMC
> members
> > -
> > > >> we
> > >  follow ASF rules (minimum 3 PMC members, reproducibility check,
> > > >>> signatures,
> > >  etc.) ASF indemnifies us personally from any harm done to anyone
> > using
> > > >>> that
> > >  released software.
> > > 
> > >  But all the other decisions in Airflow are voted by committers:
> > > 
> https://github.com/apache/airflow?tab=readme-ov-file#voting-policy
> > -
> > > >> so
> > >  committer votes are binding (except releases).
> > > 
> > >  J.
> > > 
> > >  On Thu, May 9, 2024 at 6:46 AM Jarek Potiuk 
> > wrote:
> > > 
> > > > +1 (binding)
> > > >
> > > > On Thu, May 9, 2024 at 4:47 AM Wei Lee 
> > wrote:
> > > >
> > > >> +1 non-binding
> > > >>
> > > >> Best,
> > > >> Wei
> > > >>
> > > >>> On May 9, 2024, at 10:39 AM, Phani Kumar <
> > > >> phani.ku...@astronomer.io
> > >  .INVALID>
> > > >> wrote:
> > > >>>
> > > >>> +1 binding, looking forward to add Scarf
> > > >>>
> > > >>> On Thu, May 9, 2024 at 7:42 AM Kaxil Naik <
> kaxiln...@apache.org>
> > >  wrote:
> > > >>>
> > >  Hi all,
> > > 
> > >  Discussion thread:
> > > 
> > https://lists.apache.org/thread/7f6qyr8w2n8w34g63s7ybhzphgt8h43m
> > > 
> > >  I would like to officially call for a vote to add Scarf as a
> > >  Telemetry
> > >  tool. Some other things:
> > > 
> > >   - Opt-in by default
> > >   - Explicit documentation that we collect the telemetry data
> > >   - Opt-out via airflow.cfg and env var
> > >   - Works for air-gapped environments
> > >   - Initial access to only PMC members
> > > 
> > >  I have created a free account on Scarf and added it to the
> > shared
> > > >> 1password
> > >  (only PMC members have access to it). For now, I am just
> playing
> > >  around
> > >  with how the information can be shown.
> > > 
> > >  I have a draft PR:
> https://github.com/apache/airflow/pull/39510
> > > >>> that
> > >  collects some basic info, adds docs & tests.
> > > 
> > >  Looking forward to releasing this for Airflow 2.10.
> > > 
> > >  Consider this my +1 binding vote.
> > > 
> > >  The vote will last until 04:20 GMT/UTC on May 16, 2024, and
> > until
> > > >>> at
> > >  least 3 binding votes have been cast.
> > > 
> > >  Please vote accordingly:
> > > 
> > >  [ ] + 1 approve
> > >  [ ] + 0 no opinion
> > >  [ ] - 1 disapprove with the reason
> > > 
> > >  Only votes from PMC members and committers are binding, but
> > other
> > > >> members
> > >  of the community are encouraged to check the AIP and vote with
> > >  "(non-binding)".
> > > 
> > >  Regards,
> > >  Kaxil
> > > 
> 

Re: [HUGE DISCUSSION] Airflow3 and tactical (Airflow 2) vs strategic (Airflow 3) approach

2024-05-07 Thread Constance Martineau
g.
>
> My candidates (and yes, some are bold):
>
> * *Drop MySQL*. If we have a single thing that makes us avoid our schema
> and DB migration - this is the case. Let's choose Postgres 15+ and use some
> of the great features there. This will also enable much faster async SQL
> implementation and a number of other optimisations - not to mention cutting
> every single change in development and testing time by literally half. And
> we should not look back to adding MySQL.
> * *Drop Celery/Sequential Executor* and start with Local + K8S only (and
> AWS/Google others can continue developing theirs of course in parallel and
> continue Hybrid executor work). Later - we figure out a better solution to
> support "small" tasks using some new K8S features and possibly non-k8s
> solutions (Ray-based?)
> * *Cut Connection and Variable Management from DB/UI*. Leave only Secrets
> Management. Later when we have a 100% extensible React UI, we can add a
> "local DB secrets manager" add-on
> * *Choose a single way for DAG storage that will support versioning from
> day one*. Bear in mind we can add others later. Bolke's idea of using
> FSspec is an interesting one, we should see if it is feasible.
> * *Drop FAB completely (including custom plugins) and invest in
> implementing Auth Manager based on a dedicated, external solution*
> (KeyCloak
> that we've discussed before as a likely candidate)
> * *Leave Providers with Airflow 2 and add tests to make sure they are
> Airflow 3 future-compatible *- develop a way where we continue development
> and contributions for Providers with Airflow 2 and add complete tests to
> run them with Airflow 3. This way we can continue developing Provider
> features independently, and make them work for Airflow 2 (and continue
> adding features for Airflow 2 users alongside Airflow 2 bugfixes), while
> also gradually fix any Airflow3 incompatibilities and instead of
> "back-compatibility" tests make provider "forward-compatibility" tests so
> that future Providers are tested and work on Airflow 3. Also it will make
> it easiest to continue Airflow 2 (bugfixes) + Providers tested without
> investing in changing the current CI / test harness.
> * *Simplify Test Harness for Airflow 3 from the start *- without providers
> and 790+ dependencies, we could vastly simplify Airflow3 testing (basically
> make CI jobs from scratch) using mostly standard Python tooling (while we
> can continue making use of the current test harness for Airflow 2 +
> Providers and extend it with Airflow 3 future-compatibility tests). That
> means Breeze would be only staying in Airflow 2 + Providers repo as we
> should be able to achieve most of what we have there with local venv/
> tooling (especially with uv as underlying tooling).
>
> 2) *I think we only add very few new "important" features. *Absolute
> minimum to make Airflow 3 appealing and add them only in Airflow 3:
> versioning, multi-team, pluggable UI should only be Airflow 3 - it makes no
> sense to invest into Airflow 2 if we already know Airflow 3 is coming -
> that generally triples effort needed to get them out. We should drop new
> features development in Airflow 2. This will give users incentive to move
> to 3 if the new features will be worth it. Even paying
> compatibility/migration price.
>
> Versionig, for example: I believe if we decide to go only with Airflow 3
> and cut some of the above (Postgres only, Single versioning DAG storage) we
> can make bolder decisions in versioning and support simpler models from the
> get go (and deliver it faster). And we should add only a few - but
> important - features that our users clearly asked for and focus on
> delivering Airflow 3 as soon as possible (instead of Airflow 2.10 or 2.11).
> Similarly - multi-team can be simplified if we cut things from the list
> above and have Task isolation as first-class citizens in Airflow (and the
> only option).
>
> My candidates very much concur with the list shared by Kaxil in the doc +
> I'd add multi-team (but simplified thanks to the cuts). But I also here
> would mostly revert to Astronomer, Google. AWS team to define collectively
> what is the absolute minimum set of features that would get the "target"
> part of their customers happy. And ONLY do that.
>
> So in short - I think the big part of our discussion should be what we are
> ready to drop when we start airflow 3 and be very bold. Once we know we
> should figure out the absolute minimum of things that we can add that will
> benefit a significant part of our users (and make use of increased speed
> because we dropped things).
>
> J.
>
>
> On Mon, May 6, 2024 at 8:40 PM Constance Martineau
>  wrote:

Re: [HUGE DISCUSSION] Airflow3 and tactical (Airflow 2) vs strategic (Airflow 3) approach

2024-05-06 Thread Constance Martineau
Hi Michal,

Thanks for your thoughts on the Airflow 3 proposal. I appreciate your
concerns about the migration overhead for our users with a major new
version and see the appeal in your suggestion to integrate many of the
proposed changes into Airflow 2 through separate AIPs. It’s a valid point
and certainly aligns with the value of making incremental improvements.

However, after looking closely at the enhancements outlined for Airflow 3,
I'm convinced they warrant a new major release. Here’s why:

   1. *Core Architectural Changes:* We’re looking at foundational changes
   with Airflow 3—like redefining task priorities, separating task definition
   and task execution, and new AIPs like DAG versioning. remote execution
   and restricting database access from workers. These aren’t just incremental
   improvements but major shifts that will set the stage for the next decade
   of Airflow’s architecture. Grouping these changes into a major release will
   help us make these transitions more cleanly and with fewer constraints from
   past decisions.
   2. *Code Clean-Up*: Our main branch has accumulated over 140 deprecated
   issues, and this will only grow if we continue without a major cleanup.
   This makes it increasingly difficult to implement new features effectively
   while maintaining backward compatibility. A major release allows us to
   address these issues head-on, reducing technical debt and paving the way
   for a more robust platform.
   3. *Managing Breaking Changes:* Let’s take the example of restricting
   database access from workers. It’s a necessary move for better security and
   also potentially scalability reasons (reduces DB load). Many users have
   workflows that interact with the DB, either by using raw sql or by
   leveraging a session object. We could implement this feature in Airflow 2
   and avoid breaking existing workflows by continuing to have the old
   standard mode as default - much of the work is already done - but that
   would mean supporting both the new secure mode and the old standard mode
   indefinitely and design new features with the assumption that most will
   continue using the old standard mode. With Airflow 3, we can make secure
   mode the default or even the only option, simplifying implementation and
   future development. This is just one example where it is feasible to
   implement in Airflow 2, but is better if we release it under the context of
   Airflow 3.
   4. *Future-Proofing for New Features:* Airflow 3 will open up
   possibilities for handling workflows beyond batch processing. Features like
   real-time DAG execution through API and multi-language task support are big
   steps forward, significantly expanding Airflow’s utility.


While integrating these updates into Airflow 2 might look less disruptive
initially, the scale and nature of the required changes really support a
move to Airflow 3. It’s not just about adding new features; it’s about
setting up Airflow so that it continues to remain relevant for the next ten
years.

Constance

On Mon, May 6, 2024 at 2:10 PM Ash Berlin-Taylor  wrote:

> There's a lot of technical debt hiding in Airflow, especially the
> scheduler that makes it harder and harder to efficiently add new features.
>
> At some point, very soon, we are going to have to remove some very
> infrequently used back compat shims that negatively affect performance.
> Without doing that the pace at which we can realistically add some of the
> more exciting features tends towards zero. Developer speed of contributors
> is a factor here too!
>
> So while we are still using SemVer, that necessitates v3.
>
> Ash
>
> On 6 May 2024 15:30:49 BST, "Michał Modras" 
> wrote:
> >+1 to Jens's & Bolke's points here and in the doc
> >
> >I agree we should work on clarifying the directions we would like Airflow
> >to go. Introducing a new major Airflow version is a massive overhead for
> >users, who would need to plan for migrations, onboarding the new Airflow
> >(with a slightly different architecture), etc., and effectively Airflow 2
> >would live in parallel for a long time.
> >
> >Personally, I think most of the points in Kaxil's/Vikram's doc are
> valuable
> >projects of their own, and I could imagine all of them being delivered as
> >separate AIPs within Airflow 2 (surely new minor versions of Airflow 2). I
> >am not sure if the scope of changes and the goal we want to achieve is a)
> >clear enough b) broad enough to call for a new major version.
> >
> >Best,
> >Michal
> >
> >On Sun, May 5, 2024 at 10:10 AM Scheffler Jens (XC-AS/EAE-ADA-T)
> > wrote:
> >
> >> Thanks for the document write-up, Kaxil. I assume this is mostly a
> vision
> >> statement.
> >>
> >> Looking forward for a larger addendum where we can collect things that
> we
> >> all can vote and agree on as targets.
> >>
> >> As I started earlier with a confluence page and it seems this is not
> >> accessible to all, shall we convert this to a Google Doc for better
> >> collab

Re: [DISCUSS] Rename channels on slack

2024-02-08 Thread Constance Martineau
> Maybe we restrict who can post in development for a
period of time with a message directing folks to the right places?

As long as we don't make it committer only. If you're contributing
something and want some help/feedback, it's not welcoming to find out that
you're to be restricted from the development/contribution channel

On Thu, Feb 8, 2024 at 12:23 PM Ferruzzi, Dennis
 wrote:

> I'm all for the shorter names.  I can never understand why so many people
> ask the questions they do in #development, its purpose seems pretty obvious
> to me but perhaps a rename is good for it; I'm -0 on that one.  I also
> agree that #troubleshooting is clear and nothing is gained from making it
> longer.  For the #best-practices one, I like the concept.  I wonder if
> something like #configuration-questions (I know, it's not short...) may be
> a broader category serving the same purpose?
>
>
>
>  - ferruzzi
>
>
> 
> From: Jed Cunningham 
> Sent: Thursday, February 8, 2024 8:36 AM
> To: dev@airflow.apache.org
> Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] [DISCUSS] Rename channels on
> slack
>
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
>
> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe.
> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez
> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que
> le contenu ne présente aucun risque.
>
>
>
> Sounds good to me. We already have much more niche channels than best
> practices would be. Worst case no one uses it and we can purge it down the
> road, no harm no foul.
>
> One thing we should consider is not renaming development, but starting with
> a fresh channel for contributing. There are nearly 14k people in
> development today. Maybe we restrict who can post in development for a
> period of time with a message directing folks to the right places?
>


Re: [DISCUSS] Rename channels on slack

2024-02-08 Thread Constance Martineau
+1 for #contributing and leaving #troubleshooting. Shorter names in slack
are nice where possible.

No strong opinion on the actual names. Agree that #development needs to be
renamed to something more obvious though.

On Thu, Feb 8, 2024 at 9:30 AM Vincent Beck  wrote:

> I am +1 in renaming these channels because, as said, most of messages in
> @development are nothing to do there.
>
> Though, I would just rename #development to #contributing. To me,
> #troubleshooting is already a good name and clear. But this is only my
> personal opinion. I am not against the names Jarek suggested.
>
> On 2024/02/08 11:54:34 Jarek Potiuk wrote:
> > Hey here,
> >
> > The number of "troubleshooting/best-practices" questions we have in
> > #development channel on Slack reached the level where we have more of
> those
> > than discussion about airflow development.
> >
> > There were few slack proposals, that we should change the names, but it's
> > quite a significant change for everyone, so we need to discuss and have
> > [LAZY CONSENSUS] here.
> >
> > My proposal is to rename them like that (but I am totally open to other
> > ideas):
> >
> > #development -> #contriuting-to-airflow
> > #troubleshooting -> #troubleshooting-questions
> >
> > And I propose to create a new channel:
> >
> > #best-practices
> >
> > There users will be able to discuss best-practices on how to use airflow
> > (it's not troubleshooting, it's more "please advise how I should do
> that).
> > We should make some announcements there, update our community pages and
> > welcome messages on slack to mention this channel.
> >
> > WDYT?
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> For additional commands, e-mail: dev-h...@airflow.apache.org
>
>


Re: Idea for Discussion: custom TI dependencies

2024-02-02 Thread Constance Martineau
Not missing anything. It is mainly used for deferrable operators, but feels
like an acceptable place to run user-defined/non-Airflow code portion of
this, and carry out the custom dependency checks. Assuming of course that
these checks are meant to check external conditions and will never need
access to Airflow's DB. Once the condition is met, then the scheduler could
carry on scheduling the task.

I like the idea, and it was just a thought on how to get around the
security concern Pierre brought up.

On Fri, Feb 2, 2024 at 4:35 PM Xiaodong (XD) DENG 
wrote:

> Thanks both for your inputs!
>
>
> Hi Pierre,
>
> I think the key difference here is: by doing this, we are not allowing
> Airflow “users” to run their code in scheduler. We are only allowing
> Airflow “Admins” to deploy a plugin to run in scheduler, just the same as
> dag_policy/task_policy/task_instance_mutation_hook/pod_mutation_hook.
>
> So I do not think this violates our current preference in terms of
> security.
>
>
> Hi Constance,
>
> I thought the trigger is mainly for deferrable operator cases? It’s quite
> different scenario from what I’m trying to cover here IMHO.
> Did I miss anything? Please let me know.
>
>
> Thanks again! Looking forward to more questions/comments!
>
>
> XD
>
>
> > On Feb 2, 2024, at 13:29, Constance Martineau 
> > 
> wrote:
> >
> > Naive question: Instead of running the code on the scheduler - could the
> > condition check be delegated to the triggerer?
> >
> > On Fri, Feb 2, 2024 at 2:33 PM Pierre Jeambrun 
> > wrote:
> >
> >> But maybe it’s time to reconsider that :), curious to see what others
> >> think.
> >>
> >> On Fri 2 Feb 2024 at 20:30, Pierre Jeambrun 
> wrote:
> >>
> >>> I like the idea and I understand that it might help in some use cases.
> >>>
> >>> The first concern that I have is that it would allow user code to run
> in
> >>> the scheduler, if I understand correctly. This would have big
> >> implications
> >>> in terms of security and how our security model works. (For instance
> the
> >>> scheduler is a trusted component and has direct access to the DB,
> AIP-44
> >>> assumption)
> >>>
> >>> If I remember correctly this is a route that we specifically tried to
> >> stay
> >>> away from.
> >>>
> >>> On Fri 2 Feb 2024 at 20:03, Xiaodong (XD) DENG
>  >>>
> >>> wrote:
> >>>
> >>>> Hi folks,
> >>>>
> >>>> I’m writing to share my thought regarding the possibility of
> supporting
> >>>> “custom TI dependencies”.
> >>>>
> >>>> Currently we maintain the dependency check rules under
> >>>> “airflow.ti_deps.deps". They cover the dependency checks like if there
> >> are
> >>>> available pool slot/if the concurrency allows/TI trigger rules/if the
> >> state
> >>>> is valid, etc., and play essential role in the scheduling process.
> >>>>
> >>>> One idea was brought up in our team's internal discussion: why
> shouldn’t
> >>>> we support custom TI dependencies?
> >>>>
> >>>> In details: just like the cluster policies
> >>>>
> (dag_policy/task_policy/task_instance_mutation_hook/pod_mutation_hook),
> >> if
> >>>> we support users add their own dependency checks as custom classes
> (and
> >>>> also put under airflow_local_settings.py), it will allow users to have
> >> much
> >>>> higher flexibility in the TI scheduling. These custom TI deps should
> be
> >>>> added as additions to the existing default deps (not replacing or
> >> removing
> >>>> any of them).
> >>>>
> >>>> For example: similar to check for pool availability/concurrency, the
> job
> >>>> may need to check for user’s infra-specific conditions, like if a GPU
> is
> >>>> available right now (instead of competing with other jobs randomly),
> or
> >> if
> >>>> an external system API is ready to be called (otherwise wait a bit ).
> >> And a
> >>>> lot more other possibilities.
> >>>>
> >>>> Why cluster policies won’t help here?  task_instance_mutation_hook is
> >>>> executed in a “worker”, not in the DAG file processor, just before the
> >> TI
> >>>> is executed. What we are trying to gain some control here, though, is
> in
> >>>> the scheduling process (based on custom rules, to decide if the TI
> state
> >>>> should be updated so it can be scheduled for execution).
> >>>>
> >>>> I would love to know how community finds this idea, before we start to
> >>>> implement anything. Any quesiton/suggestion would be greatly
> >> appreciated.
> >>>> Many thanks!
> >>>>
> >>>>
> >>>> XD
> >>>>
> >>>>
> >>>>
> >>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> For additional commands, e-mail: dev-h...@airflow.apache.org
>
>


Re: Idea for Discussion: custom TI dependencies

2024-02-02 Thread Constance Martineau
Naive question: Instead of running the code on the scheduler - could the
condition check be delegated to the triggerer?

On Fri, Feb 2, 2024 at 2:33 PM Pierre Jeambrun 
wrote:

> But maybe it’s time to reconsider that :), curious to see what others
> think.
>
> On Fri 2 Feb 2024 at 20:30, Pierre Jeambrun  wrote:
>
> > I like the idea and I understand that it might help in some use cases.
> >
> > The first concern that I have is that it would allow user code to run in
> > the scheduler, if I understand correctly. This would have big
> implications
> > in terms of security and how our security model works. (For instance the
> > scheduler is a trusted component and has direct access to the DB, AIP-44
> > assumption)
> >
> > If I remember correctly this is a route that we specifically tried to
> stay
> > away from.
> >
> > On Fri 2 Feb 2024 at 20:03, Xiaodong (XD) DENG  >
> > wrote:
> >
> >> Hi folks,
> >>
> >> I’m writing to share my thought regarding the possibility of supporting
> >> “custom TI dependencies”.
> >>
> >> Currently we maintain the dependency check rules under
> >> “airflow.ti_deps.deps". They cover the dependency checks like if there
> are
> >> available pool slot/if the concurrency allows/TI trigger rules/if the
> state
> >> is valid, etc., and play essential role in the scheduling process.
> >>
> >> One idea was brought up in our team's internal discussion: why shouldn’t
> >> we support custom TI dependencies?
> >>
> >> In details: just like the cluster policies
> >> (dag_policy/task_policy/task_instance_mutation_hook/pod_mutation_hook),
> if
> >> we support users add their own dependency checks as custom classes (and
> >> also put under airflow_local_settings.py), it will allow users to have
> much
> >> higher flexibility in the TI scheduling. These custom TI deps should be
> >> added as additions to the existing default deps (not replacing or
> removing
> >> any of them).
> >>
> >> For example: similar to check for pool availability/concurrency, the job
> >> may need to check for user’s infra-specific conditions, like if a GPU is
> >> available right now (instead of competing with other jobs randomly), or
> if
> >> an external system API is ready to be called (otherwise wait a bit ).
> And a
> >> lot more other possibilities.
> >>
> >> Why cluster policies won’t help here?  task_instance_mutation_hook is
> >> executed in a “worker”, not in the DAG file processor, just before the
> TI
> >> is executed. What we are trying to gain some control here, though, is in
> >> the scheduling process (based on custom rules, to decide if the TI state
> >> should be updated so it can be scheduled for execution).
> >>
> >> I would love to know how community finds this idea, before we start to
> >> implement anything. Any quesiton/suggestion would be greatly
> appreciated.
> >> Many thanks!
> >>
> >>
> >> XD
> >>
> >>
> >>
>


Re: [DISCUSSION] Enhanced Multi-Tenant Dataset Management in Airflow: Potential First Steps

2024-01-29 Thread Constance Martineau
I've had a few conversations with Astronomer customers within the past few
days who are looking for an approved way to create datasets outside of the
dag parsing process. They are already - or are considering - using some
sort of custom process similar to what Steve suggested in the github
discussion
<https://github.com/apache/airflow/discussions/36723#discussioncomment-8243269>.


Given those conversations and the feedback from PRs, Github Discussions,
and this dev thread, I appreciate that there's a need that Airflow isn't
filling today. To gather more support, we need a proper answer about how we
will deal with clashes between the imperative and declarative approach. As
a Product Manager - I do not have the skillset to figure this out on my own
- but would be happy to work with someone in the community on this.

On Thu, Jan 25, 2024 at 9:58 AM Eduardo Nicastro 
wrote:

> Thanks, Potiuk, for highlighting the importance of aligning new features
> with Airflow's roadmap. I agree we need to be cautious about expanding
> dataset functionalities in ways that might conflict with existing or
> planned features. However, this approach doesn't necessarily transform
> Airflow into a 'dataset metadata storage' but rather enhances its role as a
> centralized orchestrator, making datasets more visible and manageable.
>
> Tornike G., you raise a valid concern about mixing declarative and
> imperative approaches. We need to think carefully about how API-created
> datasets would coexist with those defined in DAG files. However, in my
> opinion, this is a natural transition that will likely become necessary as
> Airflow is used in increasingly diverse environments and organizations, a
> shift that seems inevitable.
>
> Constance M., your perspective on enabling API/UI management for datasets
> is spot-on. It adds a layer of flexibility and visibility that's crucial
> for modern data orchestration, aligning well with Airflow's goals of being
> a comprehensive workflow platform without overstepping its primary
> functions.
>
> To add my perspective, echoing some of what I posted in the GH discussion (
> https://github.com/apache/airflow/discussions/36723): Data-aware
> scheduling
> was a transformative step for Airflow because it acknowledged data as the
> primary workflow trigger. This proposal is essentially an extension of that
> concept, further decoupling Airflow from the assumption that only DAGs can
> influence datasets. I also believe it aligns with the modern data
> engineering practices where workflows are increasingly driven by data
> events and think this is particularly interesting for larger organizations
> where datasets frequently span across various systems and teams.
>
>
> On Wed, Jan 24, 2024 at 8:53 PM Tornike Gurgenidze <
> togur...@freeuni.edu.ge>
> wrote:
>
> > What I meant by update/delete operations was referring to Dataset objects
> > themselves, not DatasetEvents. I also see no issue in allowing dataset
> > changes to be registered externally. I admit that deleting datasets is
> > probably irrelevant as even now they are not deleted, but instead
> orphaned
> > after reference counting, but U in CRUD is still very much relevant imho.
> > There's a field called extra in DatasetModel for example which has no use
> > inside airflow, but it still might be used from user code in all sorts of
> > ways.
> >
> > I'm not saying it's impossible for these interfaces to coexist if you
> > isolate them from one another, especially when multiple dag-processors
> > already do something similar for dags even now (isolating sets of objects
> > from one another using processor_subdir value), it just feels unnatural
> to
> > have a declarative (dag code) and imperative (API/UI) interfaces for
> > interacting with one type of objects.
> >
> > On Wed, Jan 24, 2024 at 11:35 PM Constance Martineau
> >  wrote:
> >
> > > You're right. I didn't mean to say that the Connections and Datasets
> > > facilitate the same thing - they don't. I meant that Connections are
> also
> > > "useless" if no task is using that Connection - but we allow them to be
> > > created independently of dags. From that angle - I don't see how
> allowing
> > > Datasets to be created independently is any different.
> > >
> > > Also happy to hear from others about this.
> > >
> > > On Wed, Jan 24, 2024 at 1:55 PM Jarek Potiuk  wrote:
> > >
> > > > I'd love to hear what others - especially those who are involved in
> > > dataset
> > > > creation and discussion more than me. I personally believe t

Re: [PROPOSE] Add A Code of Conduct for Slack and Meetups

2024-01-26 Thread Constance Martineau
Indeed, not having groups is a limitation of free slack. Maybe the
compromise is bookmarking the individuals who belong to that committee
somewhere?

On Fri, Jan 26, 2024 at 1:26 PM Briana Okyere
 wrote:

> I'm with you on having a committee take it over- although I'm not sure how
> we can ensure folks can anonymously submit a post as "breaking the
> guidelines" if they cannot DM an individual.
>
> On Fri, Jan 26, 2024 at 10:16 AM Constance Martineau
>  wrote:
>
> > Wow Briana! This is fantastic, what a great idea! I added a few comments.
> >
> > I also had a similar question as Jarek that I think merits a discussion:
> > Should we have a committee or group to handle reported guideline
> > violations? If we single out one person to report violations to, we'll
> have
> > to continuously update the guidelines whenever someone else takes up the
> > job.
> >
> > On Thu, Jan 25, 2024 at 5:26 PM Briana Okyere
> >  wrote:
> >
> > > Hey All,
> > >
> > > While we currently have a Code of Conduct by the ASF. But, we do not
> have
> > > one that is tailored to Airflow Slack and our in-person Meetups.
> > >
> > > I propose we expand on our current Code of Conduct for these additional
> > > places where community members communicate.
> > >
> > > For Slack, this would be a great way to acquaint folks with the space
> > when
> > > they join. As of now, there is no "onboarding" for when members join
> > Slack.
> > > This can leave people confused about how to engage. The same applies to
> > > in-person Meetups.
> > >
> > > I invite everyone to take a look and let me know your thoughts: <
> > >
> > >
> >
> https://docs.google.com/document/d/1OfV2-UkoAVp_u8-DswZO9xNt3w14olgo4DujCiRi_Z4/edit?usp=sharing
> > > >
> > >
> > > --
> > > Briana Okyere
> > > Community Manager
> > > *Astronomer Inc.*
> > >
> >
>


Re: [PROPOSE] Add A Code of Conduct for Slack and Meetups

2024-01-26 Thread Constance Martineau
Wow Briana! This is fantastic, what a great idea! I added a few comments.

I also had a similar question as Jarek that I think merits a discussion:
Should we have a committee or group to handle reported guideline
violations? If we single out one person to report violations to, we'll have
to continuously update the guidelines whenever someone else takes up the
job.

On Thu, Jan 25, 2024 at 5:26 PM Briana Okyere
 wrote:

> Hey All,
>
> While we currently have a Code of Conduct by the ASF. But, we do not have
> one that is tailored to Airflow Slack and our in-person Meetups.
>
> I propose we expand on our current Code of Conduct for these additional
> places where community members communicate.
>
> For Slack, this would be a great way to acquaint folks with the space when
> they join. As of now, there is no "onboarding" for when members join Slack.
> This can leave people confused about how to engage. The same applies to
> in-person Meetups.
>
> I invite everyone to take a look and let me know your thoughts: <
>
> https://docs.google.com/document/d/1OfV2-UkoAVp_u8-DswZO9xNt3w14olgo4DujCiRi_Z4/edit?usp=sharing
> >
>
> --
> Briana Okyere
> Community Manager
> *Astronomer Inc.*
>


Re: [DISCUSSION] Enhanced Multi-Tenant Dataset Management in Airflow: Potential First Steps

2024-01-24 Thread Constance Martineau
You're right. I didn't mean to say that the Connections and Datasets
facilitate the same thing - they don't. I meant that Connections are also
"useless" if no task is using that Connection - but we allow them to be
created independently of dags. From that angle - I don't see how allowing
Datasets to be created independently is any different.

Also happy to hear from others about this.

On Wed, Jan 24, 2024 at 1:55 PM Jarek Potiuk  wrote:

> I'd love to hear what others - especially those who are involved in dataset
> creation and discussion more than me. I personally believe that
> conceptually connections and datasets are as far from each other as
> possible (I have no idea where the similarities of connections - which are
> essentially static configuration of credentials) and datasets (which are
> dynamic reflection of data being passed live between tasks) comes from. The
> only similarity I see is that they are both stored by Airflow in some table
> (and even not that if you use SecretsManager). So comparing those two is an
> apple to pear comparison if you ask me.
>
> But (despite my 4 years experience of creating Airflow) my actual
> experience with Datasets is limited, I've been mainly observing what was
> going on, so I would love to hear what those who created (and continue to
> think about future of) the datasets :).
>
> J,
>
> On Wed, Jan 24, 2024 at 7:27 PM Constance Martineau
>  wrote:
>
> > Right. That is why I was trying to make a distinction in the PR and in
> this
> > discussion between CRUD-ing Dataset Objects/Definitions vs creating and
> > deleting Dataset Events from the queue. Happy to standardize on whatever
> > terminology to make sure things are understood and we can have a
> productive
> > conversation.
> >
> > For Dataset Events - creating, reading and deleting them via API is IMHO
> > not controversial.
> > - For creating: This has been discussed in various places, and that the
> > endpoint could be used to trigger dependent dags
> > - For deleting: It is easy for DAGs with multiple upstream dependencies
> to
> > go out of sync, and there is no way to recover from that without
> > manipulating the DB directory. See here
> > <https://github.com/apache/airflow/discussions/36618> and here
> > <
> >
> https://forum.astronomer.io/t/airflow-datasets-can-they-be-cleared-or-reset/2801
> > >
> >
> > For CRUD-ing Dataset Definitions via API:
> >
> > > IMHO Airflow should only manage it's own entities and at most it should
> > > emit events (dataset listeners, openlineage API) to inform others about
> > > state changes of things that Airflow manages, but it should not be
> abused
> > > to store "other" datasets, that Airflow DAGs know nothing about.
> >
> >
> > I disagree that it is an abuse. If I as an internal data producer
> publish a
> > dataset that I expect internal Airflow users to use, it is not abusing
> > Airflow to create a dataset and make it visible in Airflow. At some point
> > in the near future, users will start referencing them in their dags -
> it's
> > just a sequencing question. We don't enforce connections being tied to a
> > dag - and conceptually - this is no different. It is also no different
> than
> > adding the definition as part of a dag file and having that dataset show
> up
> > in the dataset list, without forcing it to be a task output as part of a
> > dag. The only valid reason to now allow it IMHO is because they were
> > designed to be defined within a dag file, similar to a dag, and we don't
> > want to deal with the impediment I laid out.
> >
> > On Wed, Jan 24, 2024 at 12:45 PM Jarek Potiuk  wrote:
> >
> > > On Wed, Jan 24, 2024 at 5:33 PM Constance Martineau
> > >  wrote:
> > >
> > > > I also think it makes sense to allow people to create/update/delete
> > > > Datasets via the API and eventually UI. Even if the dataset is not
> > > > initially connected to a DAG, it's nice to be able to see in one
> place
> > > all
> > > > the datasets and ML models that my dags can leverage. We allow people
> > to
> > > > create Connections and Variables via the API and UI without forcing
> > users
> > > > to use them as part of a task or dag. This isn't any different from
> > that
> > > > aspect.
> > > >
> > > > Airflow has some objects that cab
> > > > > be created by a dag processor (Dags, Datasets) and others that can
> be
> > > > > created with A

Re: [DISCUSSION] Enhanced Multi-Tenant Dataset Management in Airflow: Potential First Steps

2024-01-24 Thread Constance Martineau
Right. That is why I was trying to make a distinction in the PR and in this
discussion between CRUD-ing Dataset Objects/Definitions vs creating and
deleting Dataset Events from the queue. Happy to standardize on whatever
terminology to make sure things are understood and we can have a productive
conversation.

For Dataset Events - creating, reading and deleting them via API is IMHO
not controversial.
- For creating: This has been discussed in various places, and that the
endpoint could be used to trigger dependent dags
- For deleting: It is easy for DAGs with multiple upstream dependencies to
go out of sync, and there is no way to recover from that without
manipulating the DB directory. See here
<https://github.com/apache/airflow/discussions/36618> and here
<https://forum.astronomer.io/t/airflow-datasets-can-they-be-cleared-or-reset/2801>

For CRUD-ing Dataset Definitions via API:

> IMHO Airflow should only manage it's own entities and at most it should
> emit events (dataset listeners, openlineage API) to inform others about
> state changes of things that Airflow manages, but it should not be abused
> to store "other" datasets, that Airflow DAGs know nothing about.


I disagree that it is an abuse. If I as an internal data producer publish a
dataset that I expect internal Airflow users to use, it is not abusing
Airflow to create a dataset and make it visible in Airflow. At some point
in the near future, users will start referencing them in their dags - it's
just a sequencing question. We don't enforce connections being tied to a
dag - and conceptually - this is no different. It is also no different than
adding the definition as part of a dag file and having that dataset show up
in the dataset list, without forcing it to be a task output as part of a
dag. The only valid reason to now allow it IMHO is because they were
designed to be defined within a dag file, similar to a dag, and we don't
want to deal with the impediment I laid out.

On Wed, Jan 24, 2024 at 12:45 PM Jarek Potiuk  wrote:

> On Wed, Jan 24, 2024 at 5:33 PM Constance Martineau
>  wrote:
>
> > I also think it makes sense to allow people to create/update/delete
> > Datasets via the API and eventually UI. Even if the dataset is not
> > initially connected to a DAG, it's nice to be able to see in one place
> all
> > the datasets and ML models that my dags can leverage. We allow people to
> > create Connections and Variables via the API and UI without forcing users
> > to use them as part of a task or dag. This isn't any different from that
> > aspect.
> >
> > Airflow has some objects that cab
> > > be created by a dag processor (Dags, Datasets) and others that can be
> > > created with API/UI (Connections, Variables)
> >
> >
> A comment from my side. I think there is a big conceptual difference here
> that you yourself noticed - DAG code - via DAGProcessor - creates DAG and
> DataSets, and UI/API can allow to create and modify Connections/Variables
> that are then USED (but never created) by DAG code. This is why while I see
> no fundamental security blocker with "Creating" Datasets via API - it
> definitely feels out-of-place to be able to manage them via API.
>
> And following the discussion from the PR -  Yes, we should discuss create,
> update and delete differently. Because conceptually they are NOT typical
> CRUD (which the Connection / Variables API UI is).
> I think there is a huge difference between "Updating" and "Deleting"
> datasets via the API and the `UD` in CRUD:
>
> * Updating dataset does not actually "update" its definition, it informs
> those who listen on dataset that it has changed. No more, no less.
> Typically when you have CRUD operation, you pass the same data in "C" and
> "U" - but in our case those two operations are different and serve
> different purposes
> * Deleting the dataset is also not what "D" in CRUD is - in this case it is
> mostly a "retention". And there are some very specific things here. Should
> we delete a dataset that some of the DAGs still have as input/output ? IMHO
> - absolutely not. But  How do we know that? If we have only DAGs,
> implicitly creating Datasets by declaring whether they are used or not we
> can easily know that by reference counting. But when we allow the creation
> of the datasets via API - it's no longer that obvious and the number of
> cases to handle gets really big.
>
> After seeing the comments and discussion - I believe it's not a good idea
> to allow external Dataset creations, the use case does not justify it IMHO.
>
> Why ?
>
> We do not want Airflow to become a "dataset metadata storage" that you can
> que

Re: [DISCUSSION] Enhanced Multi-Tenant Dataset Management in Airflow: Potential First Steps

2024-01-24 Thread Constance Martineau
export dataset updates
> to
> >> > make
> >> > > it
> >> > > >> possible to trigger DAGs consuming from a Dataset across tenants.
> >> > > >>
> >> > > >> Context
> >> > > >> Below I will give some context about our current situation and
> >> > solution
> >> > > >> we have in place and propose a new workflow that would be more
> >> > > efficient.
> >> > > >> To be able to implement this new workflow we would need a way to
> >> > export
> >> > > >> Dataset updates as mentioned.
> >> > > >>
> >> > > >> Current Workflow
> >> > > >> In our organization, we're dealing with multiple Airflow tenants,
> >> > let's
> >> > > >> say Tenant 1 and Tenant 2, as examples. To synchronize Dataset A
> >> > across
> >> > > >> these tenants, we currently have a complex setup:
> >> > > >>
> >> > > >>1. Containers run on a schedule to export metadata to CosmosDB
> >> > (these
> >> > > >>will be replaced by the listener).
> >> > > >>2. Additional scheduled containers pull data from CosmosDB and
> >> > write
> >> > > >>it to a shared file system, enabling generated DAGS to read it
> >> and
> >> > > mirror a
> >> > > >>dataset across tenants.
> >> > > >>
> >> > > >>
> >> > > >> Proposed Workflow
> >> > > >> Here's a breakdown of our proposed workflow:
> >> > > >>
> >> > > >>1. Cross-Tenant Dataset Interaction: We have Dags in Tenant 1
> >> > > >>producing Dataset A. We need a mechanism to trigger all Dags
> >> > > consuming
> >> > > >>Dataset A in Tenant 2. This interaction is crucial for our
> data
> >> > > pipeline's
> >> > > >>efficiency and synchronicity.
> >> > > >>2. Dataset Listener Implementation: Our approach involves
> >> > > >>implementing a Dataset listener that programmatically creates
> >> > > Dataset A in
> >> > > >>all tenants where it's not present (like Tenant 2) and export
> >> > Dataset
> >> > > >>updates when they happen. This would trigger an update on all
> >> Dags
> >> > > >>consuming from that Dataset.
> >> > > >>3. Standardized Dataset Names: We plan to use standardized
> >> dataset
> >> > > >>names, which makes sense since a URI is its identifier and
> >> > > uniqueness is a
> >> > > >>logical requirement.
> >> > > >>
> >> > > >> [image: image.png]
> >> > > >>
> >> > > >> Why This Matters:
> >> > > >>
> >> > > >>- It offers a streamlined, automated way to manage datasets
> >> across
> >> > > >>different Airflow instances.
> >> > > >>- It aligns with a need for efficient, interconnected
> workflows
> >> in
> >> > a
> >> > > >>multi-tenant environment.
> >> > > >>
> >> > > >>
> >> > > >> I invite the community to discuss:
> >> > > >>
> >> > > >>- Are there alternative methods within Airflow's current
> >> framework
> >> > > >>that could achieve similar goals?
> >> > > >>- Any insights or experiences that could inform our approach?
> >> > > >>
> >> > > >> Your feedback and suggestions are invaluable, and I look forward
> >> to a
> >> > > >> collaborative discussion.
> >> > > >>
> >> > > >> Best Regards,
> >> > > >> Eduardo Nicastro
> >> > > >>
> >> > > >
> >> > >
> >> >
> >>
> >
>


-- 

Constance Martineau
Senior Product Manager

Email: consta...@astronomer.io
Time zone: US Eastern (EST UTC-5 / EDT UTC-4)


<https://www.astronomer.io/>


Re: [VOTE] January 2024 PR of the Month

2024-01-22 Thread Constance Martineau
+1 #22253

The PR was opened in March 2022, and was finally merged last week! I admire
the author's persistence in getting this merged in, and think the
simplifications to the interface make the Operator more user-friendly for
our Data Science users.

On Mon, Jan 22, 2024 at 1:29 PM Briana Okyere
 wrote:

> Hey All,
>
> It’s once again time to vote for the PR of the Month.
>
> With the help of the `get_important_pr_candidates` script in dev/stats,
> we've identified the following candidates:
>
> PR #36513: Include plugins in the architecture diagrams.
> <https://github.com/apache/airflow/pull/36513>
>
> PR #32867: Sanitize the conn_id to disallow potential script execution. <
> https://github.com/apache/airflow/pull/32867>
>
> PR #22253: Add SparkKubernetesOperator crd implementation.
> <https://github.com/apache/airflow/pull/22253>
>
> PR #36171: Implement AthenaSQLHook.
> <https://github.com/apache/airflow/pull/36171>
>
> PR #36537: Standardize airflow build process and switch to Hatchling build
> backend. <https://github.com/apache/airflow/pull/36537>
>
> Please reply to this thread with your selection or offer your own
> nominee(s).
>
> Voting will close on Jan. 26th at 1 PM PST. The winner(s) will be featured
> in the next issue of the Airflow newsletter.
>
> Also, if there’s an article or event that you think should be included in
> this or a future issue of the newsletter, please drop me a line at <
> briana.oky...@astronomer.io>.
>
> --
> Briana Okyere
> Community Manager
> *Astronomer*
>


-- 

Constance Martineau
Senior Product Manager

Email: consta...@astronomer.io
Time zone: US Eastern (EST UTC-5 / EDT UTC-4)


<https://www.astronomer.io/>


Re: [VOTE] New Airflow Community Provider: Teradata

2024-01-16 Thread Constance Martineau
+1 non binding

On Tue, Jan 16, 2024 at 1:53 PM Vincent Beck  wrote:

> +1 binding. Makes sense to me.
>
> On 2024/01/16 18:21:39 Jarek Potiuk wrote:
> > +1 binding
> >
> > On Tue, Jan 16, 2024 at 6:20 PM Phani Kumar
> >  wrote:
> >
> > > +1 non binding
> > >
> > > On Tue, Jan 16, 2024 at 10:43 PM K Mallam, Sunil
> > >  wrote:
> > >
> > > > Hello Airflow Community,
> > > >
> > > > Thank you very much for your comments/feedback on the Discussion
> Thread.
> > > >
> > > > I’m creating this voting thread for Teradata to be Airflow’s new
> > > community
> > > > provider.
> > > >
> > > > We have one 1 binding vote from Kaxil and I request the other
> community
> > > > members to share their votes.
> > > >
> > > > Below are initial implementation links -
> > > >
> > > > Implementation:
> > > >
> > >
> https://github.com/Teradata/airflow/tree/td_develop/airflow/providers/teradata
> > > >
> > > > Documentation:
> > > >
> > >
> https://github.com/Teradata/airflow/tree/td_develop/docs/apache-airflow-providers-teradata
> > > >
> > > > Unit Tests:
> > > >
> > >
> https://github.com/Teradata/airflow/tree/td_develop/tests/providers/teradata
> > > >
> > > > System Tests:
> > > >
> > >
> https://github.com/Teradata/airflow/tree/td_develop/tests/system/providers/teradata
> > > > System Tests Dashboard: https://teradata.github.io/airflow/
> > > >
> > >
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> For additional commands, e-mail: dev-h...@airflow.apache.org
>
>

-- 

Constance Martineau
Senior Product Manager

Email: consta...@astronomer.io
Time zone: US Eastern (EST UTC-5 / EDT UTC-4)


<https://www.astronomer.io/>


Re: [VOTE] October 2023 PR of the Month

2023-10-31 Thread Constance Martineau
Oh, I'm very sorry. I had forgotten about 34729. Apologies, but I'll be
changing my vote to that.

On Tue, Oct 31, 2023 at 1:08 PM Hussein Awala  wrote:

> I vote for 34729 <https://github.com/apache/airflow/pull/34729>, which
> implemented Airflow FileSystem.
>
> On Tue, Oct 31, 2023 at 6:00 PM Jed Cunningham 
> wrote:
>
> > The new OpenSearch provider gets my vote - 34705.
> >
>


-- 

Constance Martineau
Senior Product Manager

Email: consta...@astronomer.io
Time zone: US Eastern (EST UTC-5 / EDT UTC-4)


<https://www.astronomer.io/>


Re: [VOTE] October 2023 PR of the Month

2023-10-31 Thread Constance Martineau
Hi Briana,

My vote is for 34784. I love that we are resolving some inconsistencies
within the AWS provider package

Constance

On Mon, Oct 30, 2023 at 1:19 PM Briana Okyere
 wrote:

> Hey All,
>
> It’s once again time to vote for the PR of the Month.
>
> Please note that if there is no clear 'winner', we can include multiple PRs
> in the newsletter per our recent vote: <
> https://lists.apache.org/thread/hro5vkq87q6scscr1yh0s4onk6sgt63z>
>
> With the help of the `get_important_pr_candidates` script in dev/stats,
> we've identified the following candidates:
>
> PR #34317 by @vincbeck: Use auth manager `is_authorized_` APIs to check
> user permissions in Rest API. <
> https://github.com/apache/airflow/pull/34317>
>
>  PR #34349 by @vandonr-amz: Let auth managers provide their own API
> endpoints <https://github.com/apache/airflow/pull/34349>
>
>  PR #34705 by @cjames23: Add Open Search Provider. <
> https://github.com/apache/airflow/pull/34705>
>
>  PR #35146 by @amoghrajesh: Permitting airflow kerberos to run in different
> modes <https://github.com/apache/airflow/pull/35146>
>
>  PR #34784 by @Taragolis: Implements `AwsBaseOperator` and `AwsBaseSensor`
> <
> https://github.com/apache/airflow/pull/34784>
>
> Please reply to this thread with your selection or offer your own
> nominee(s).
>
> Voting will close on November 1st at 9 am PST. The winner(s) will be
> featured in the next issue of the Airflow newsletter.
>
> Also, if there’s an article or event that you think should be included in
> this or a future issue, please drop me a line at <
> briana.oky...@astronomer.io>.
>
> --
> Briana Okyere
> Community Manager
> Email: briana.oky...@astronomer.io
> Mobile: +1 415.713.9943
> Time zone: US Pacific UTC
>
> <https://www.astronomer.io/>
>


-- 

Constance Martineau
Senior Product Manager

Email: consta...@astronomer.io
Time zone: US Eastern (EST UTC-5 / EDT UTC-4)


<https://www.astronomer.io/>


Re: [VOTE] (extended) on AIP-50 (part 2) to finalize it

2023-05-23 Thread Constance Martineau
Hello,

I think Option B is reasonable. +1 for B, non-binding.

Constance

On Mon, May 22, 2023 at 4:54 PM Scheffler Jens (XC-DX/ETV5)
 wrote:

> Hi Airflow-Developers,
>
> It is not democracy if nobody makes a vote. I don't want to be a
> "dictator" but still propose Option B.
> As nobody responded until today, I am extending the vote until tomorrow,
> 23. May 2023 22:00 CEST for everybody expressing a binding or non-binding
> opinion.
>
> Mit freundlichen Grüßen / Best regards
>
> Jens Scheffler
>
> Deterministik open Loop (XC-DX/ETV5)
> Robert Bosch GmbH | Hessbruehlstraße 21 | 70565 Stuttgart-Vaihingen |
> GERMANY | www.bosch.com
> Tel. +49 711 811-91508 | Mobil +49 160 90417410 |
> jens.scheff...@de.bosch.com
>
> Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000;
> Aufsichtsratsvorsitzender: Prof. Dr. Stefan Asenkerschbaumer;
> Geschäftsführung: Dr. Stefan Hartung,
> Dr. Christian Fischer, Dr. Markus Forschner, Stefan Grosch, Dr. Markus
> Heyn, Dr. Tanja Rückert
>
> -Original Message-
> From: Scheffler Jens (XC-DX/ETV5) 
> Sent: Donnerstag, 18. Mai 2023 21:47
> To: dev@airflow.apache.org
> Subject: [VOTE] on AIP-50 (part 2) to finalize it
>
> Hi Developers,
>
> The implementation of AIP-50 went into Airflow 2.6.0 and I am proud that
> 80% of the implementation proposal made it to the release!
>
> During implementation there was a bit of discussion on part 2 of the AIP
> Proposal. I’d like to close the implementation and to prevent a discussion
> (and wasted effort) for raising the final PR I’d like to call for a vote
> for Part 2 of the implementation across devlist. Please let me know your
> preferred option for part 2 of AIP-50:
>
> AIP-50 Part 2 Option A) Keep the Trigger button like today, close the
> AIP-50
> AIP-50 Part 2 Option B) Trigger Parameter Form is displayed if the DAG has
> params defined, else it is skipped
> AIP-50 Part 2 Option C) A global configuration option defines the behavior
> of the Trigger button (Like implemented and reverted in previous PR)
> AIP-50 Part 2 Option D) Trigger button behavior can be defined per DAG
> (Like originally proposed in AIP-50)
>
> Details of the implementation options and a comparison are documented in
> CWIKI:
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-50+Trigger+DAG+UI+Extension+with+Flexible+User+Form+Concept#AIP50TriggerDAGUIExtensionwithFlexibleUserFormConcept-ImplementationPart2
>
> As there is an extended weekend in a couple of countries due to Ascension
> Day I’d collect the votes until May 22nd, 22:00 CEST. Majority wins.
>
> My non-binding vote is Option B)
>
> Mit freundlichen Grüßen / Best regards
>
> Jens Scheffler
>
> Deterministik open Loop (XC-DX/ETV5)
> Robert Bosch GmbH | Hessbruehlstraße 21 | 70565 Stuttgart-Vaihingen |
> GERMANY | www.bosch.com Tel. +49 711 811-91508 | Mobil +49 160 90417410 |
> jens.scheff...@de.bosch.com<mailto:jens.scheff...@de.bosch.com>
>
> Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000;
> Aufsichtsratsvorsitzender: Prof. Dr. Stefan Asenkerschbaumer;
> Geschäftsführung: Dr. Stefan Hartung, Dr. Christian Fischer, Dr. Markus
> Forschner, Stefan Grosch, Dr. Markus Heyn, Dr. Tanja Rückert ​
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> For additional commands, e-mail: dev-h...@airflow.apache.org
>


-- 

Constance Martineau
Senior Product Manager

Email: consta...@astronomer.io
Time zone: US Eastern (EST UTC-5 / EDT UTC-4)


<https://www.astronomer.io/>


Re: [VOTE] AIP-50 Trigger DAG UI Extension with Flexible User Form Concept

2022-12-23 Thread Constance Martineau
I somehow missed the discussion for this earlier. I can't comment on
implementation, but the feature itself is really cool and such a useful
addition! This is a bit presumptuous since there hasn't been a vote yet,
but I'm looking forward to seeing this officially part of the project :)

On Fri, Dec 23, 2022 at 6:26 AM Scheffler Jens (XC-DX/ETV5)
 wrote:

> Hi Airflow Developers,
>
>
>
> sorry, new to the process after discussion we had in previous emails in
> https://lists.apache.org/thread/kxkctcbh9drfw065dgvr673zl0xyfl3r and the
> Confluence in
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-50+Trigger+DAG+UI+Extension+with+Flexible+User+Form+Concept
> I collected some non-binding (positive) feedback so far. I am a bit late to
> ask the devlist for a VOTE to progress.
>
>
>
> In October I went ahead and created some PoC for an implementation in
> https://github.com/apache/airflow/pull/27063.
>
> After investing some time the last days, the AIP-50 proposal in the PR
> would now even be complete, so you can check-out the branch and have a
> preview there. Only 821 LoC added for a cool feature 😃
>
> (Of course knowing it is still on FAB, but implementation can be converted
> to React later similar when AIP-38 progresses)
>
>
>
> Hope proposal is accepted, happy to get feedback.
>
>
>
> Mit freundlichen Grüßen / Best regards
>
> *Jens Scheffler*
>
> Deterministik open Loop (XC-DX/ETV5)
> Robert Bosch GmbH | Hessbruehlstraße 21 | 70565 Stuttgart-Vaihingen | GERMANY
> | www.bosch.com
> Tel. +49 711 811-91508 | Mobil +49 160 90417410 | Threema / Threema Work:
> KKTVR3F4 | jens.scheff...@de.bosch.com
>
> Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000;
> Aufsichtsratsvorsitzender: Prof. Dr. Stefan Asenkerschbaumer;
> Geschäftsführung: Dr. Stefan Hartung,
> Dr. Christian Fischer, Filiz Albrecht, Dr. Markus Forschner, Dr. Markus
> Heyn, Rolf Najork
> ​
>


-- 

Constance Martineau
Product Manager

Email: consta...@astronomer.io
Time zone: US Eastern (EST UTC-5 / EDT UTC-4)


<https://www.astronomer.io/>


Re: [VOTE] November 2022 PR of the Month

2022-11-25 Thread Constance Martineau
Oh that one is really nice. +1 for
https://github.com/apache/airflow/pull/26457


On Fri, Nov 25, 2022 at 1:17 PM Kaxil Naik  wrote:

> +1 for https://github.com/apache/airflow/pull/26457
>
> On Tue, 22 Nov 2022 at 19:27, Jarek Potiuk  wrote:
>
>> +1 for 26457 too
>>
>> On Tue, Nov 22, 2022 at 5:44 PM Bas Harenslak 
>> wrote:
>> >
>> > +1 for 26457
>> >
>> > Bas
>> >
>> > On 22 Nov 2022, at 17:31, Jeambrun Pierre 
>> wrote:
>> >
>> > Hello,
>> >
>> > I vote for https://github.com/apache/airflow/pull/26457. Not sure why
>> it wasn't selected by our heuristic, with 2k additions, 140 comments, 9
>> participants, and even one #protm tag 🤔. (I will try to take a look)
>> >
>> > Best,
>> >
>> > Le mar. 22 nov. 2022 à 16:40, John Thomas 
>> > 
>> a écrit :
>> >>
>> >> Another month passes, and another newsletter approaches!
>> Heuristically, we have the below PRs nominated:
>> >>
>> >> Please vote by selecting the most important, interesting, or impactful
>> PR from the list below (or an entirely new one!) and replying with the
>> number
>> >>
>> >> Voting will close on 11/29 at 8:15a PT.
>> >>
>> >> [27540] Allow datasets to be used in taskflow
>> >> https://github.com/apache/airflow/pull/27540
>> >>
>> >> [27597] Add max_wait for exponential_backoff in BaseSensor
>> >> https://github.com/apache/airflow/pull/27597
>> >>
>> >> [27506] Fix mini scheduler expansion of mapped task
>> >> https://github.com/apache/airflow/pull/27506
>> >>
>> >> [27526] Clean backcompat code kpo
>> >> https://github.com/apache/airflow/pull/27526
>> >>
>> >> Regards,
>> >> John
>> >
>> >
>>
>

-- 

Constance Martineau
Product Manager

Email: consta...@astronomer.io
Time zone: US Eastern (EST UTC-5 / EDT UTC-4)


<https://www.astronomer.io/>


Re: [VOTE] September 2022 PR of the Month

2022-09-27 Thread Constance Martineau
Mine is the performance improvement PR for `airflow dag test` command.
Plus, you can set breakpoints :)

https://github.com/apache/airflow/pull/26400

On Tue, Sep 27, 2022 at 2:32 PM Ferruzzi, Dennis
 wrote:

> I'd second this one.  https://github.com/apache/airflow/pull/23592 no
> more 'as dag' is really nice.
>
>
> --
> *From:* Bas Harenslak 
> *Sent:* Tuesday, September 27, 2022 11:11 AM
> *To:* dev@airflow.apache.org
> *Subject:* RE: [EXTERNAL][VOTE] September 2022 PR of the Month
>
>
> *CAUTION*: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
> My vote goes to https://github.com/apache/airflow/pull/23592 (no more “as
> dag” needed)
>
>
> On 27 Sep 2022, at 19:52, Daniel Standish <
> daniel.stand...@astronomer.io.INVALID> wrote:
>
> One vote for https://github.com/apache/airflow/pull/26400 (improved test
> command)
>
> On Tue, Sep 27, 2022 at 10:50 AM Jed Cunningham 
> wrote:
>
>> My write-in is ExternalPythonOperator:
>> https://github.com/apache/airflow/pull/25780
>>
>
>

-- 

Constance Martineau
Product Manager

Email: consta...@astronomer.io
Time zone: US Eastern (EST UTC-5 / EDT UTC-4)


<https://www.astronomer.io/>


Re: [VOTE] August 2022 PR of the Month

2022-08-31 Thread Constance Martineau
Mine is 25888 :) The docker-compose warnings were really off-putting and
unfriendly. If someone is evaluating a new tool, Quick Starts are one of
the first places they go to and it's a bad first impression. Prioritizing
Airflow Standalone is a huge improvement.

On Wed, Aug 31, 2022 at 9:30 AM Collin McNulty 
wrote:

> I vote for 25610.
>
> On Wed, Aug 31, 2022 at 7:12 AM Kaxil Naik  wrote:
>
>> I have a tied vote for [25610] and [25888]. 0.5 each :)
>>
>> [*25610*] Grid logs for mapped instances and *[25888] *Prefer the local
>> Quick Start in docs
>>
>>
>> On Mon, 29 Aug 2022 at 14:38, Jarek Potiuk  wrote:
>>
>>> [25888] Prefer the local Quick Start in docs
>>> https://github.com/apache/airflow/pull/25888
>>>
>>> Why ?
>>>
>>> I think this is a step in the right direction when it comes to
>>> communication with our users.
>>>
>>> With this one, we start to analyse how users are looking at our docs
>>> (this change was driven by our doc page analytics). But it's a bit more
>>> than that - we also deliberately engineer their "experience" (especially
>>> for the first-time users this time).
>>> We simply start to use docs as deliberate guidance where we would like
>>> to lead our users and take into account the "uses" of Airflow that we want
>>> to promote.
>>>
>>> We need more of those and more deliberate rather than accidental doc
>>> decisions.
>>>
>>> J.
>>>
>>>
>>>
>>> On Mon, Aug 29, 2022 at 3:20 PM Michael Robinson
>>>  wrote:
>>>
>>>> Hey devlist!
>>>>
>>>> It’s time again to select a PR of the Month for the Airflow newsletter.
>>>> The candidate PRs below have been selected using the
>>>> `get_important_pr_candidates.py` script in airflow/dev/stats, which we’re
>>>> continuing to tweak. (Most recently, we added comments and reactions in
>>>> linked issues to the score calculation.)
>>>>
>>>> Please vote by selecting the most important, interesting, or impactful
>>>> PR from the list below (or an entirely new one!) and replying with the
>>>> number. Voting will close on 8/31 at 6:15 AM PT.
>>>>
>>>> [25509] Possibility to document DAG with a separated Markdown file
>>>> https://github.com/apache/airflow/pull/25509
>>>>
>>>> [25888] Prefer the local Quick Start in docs
>>>> https://github.com/apache/airflow/pull/25888
>>>>
>>>> [25788] Properly check the existence of missing mapped TIs
>>>> https://github.com/apache/airflow/pull/25788
>>>>
>>>> [25610] Grid logs for mapped instances
>>>> https://github.com/apache/airflow/pull/25610
>>>>
>>>> [25857] Add `RedshiftCreateClusterSnapshotOperator`
>>>> https://github.com/apache/airflow/pull/25857
>>>>
>>>> --
>
> Collin McNulty
> Lead Airflow Engineer
>
> Email: col...@astronomer.io 
> Time zone: US Central (CST UTC-6 / CDT UTC-5)
>
>
> <https://www.astronomer.io/>
>


-- 

Constance Martineau
Product Manager

Email: consta...@astronomer.io
Time zone: US Eastern (EST UTC-5 / EDT UTC-4)


<https://www.astronomer.io/>


Re: [Proposal] Creating DAG through the REST api

2022-08-11 Thread Constance Martineau
> 1. Accept DAGs only from trusted parties. Airflow already supports
>>> pluggable authentication modules where strong authentication such as
>>> Kerberos can be used.
>>> 2. Execute DAG code as the API identity, i.e. A DAG created through the
>>> API service will have run_as_user set to be the API identity.
>>> 3. To enforce data access control on DAGs, the API identity should also
>>> be used to access the data warehouse.
>>>
>>> We shared a demo based on a prototype implementation in the summit and
>>> some details are described in this ppt
>>> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
>>> and would love to get feedback and comments from the community about this
>>> initiative.
>>>
>>> thanks
>>> Mocheng
>>>
>>

-- 

Constance Martineau
Product Manager

Email: consta...@astronomer.io
Time zone: US Eastern (EST UTC-5 / EDT UTC-4)


<https://www.astronomer.io/>


Re: Implicit DAG registration

2022-04-27 Thread Constance Martineau
Am intrigued. Curious about dynamic dag pattern, where you create the DAG 
object in a a create_dag function and adding the DAG to globals. Would this new 
way prevent someone from modifying the dag object within the function, or 
returning it?

> On Apr 27, 2022, at 11:20 AM, Ferruzzi, Dennis  
> wrote:
> 
> I don't know what it would take under the hood, but I'm intrigued.  From a 
> user perspective, anything to make the DAG DRYer is a win IMHO.
> 
> 
> From: Malthe 
> Sent: Wednesday, April 27, 2022 1:49 AM
> To: dev@airflow.apache.org
> Subject: [EXTERNAL] Implicit DAG registration
> 
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open attachments unless you can confirm the sender and know the 
> content is safe.
> 
> 
> 
> DAGs must be registered at the module top-level. During dagbag
> processing we have:
> 
> top_level_dags = ((o, m) for m in mods for o in
> m.__dict__.values() if isinstance(o, DAG))
> 
> It makes sense of course – we can't have DAGs floating in space, they
> need an anchor.
> 
> Or do they?
> 
> It would be entirely possible for the DAG constructor to register
> itself with some global registry, obviating for the follow pattern:
> 
> with DAG(...) as dag:
>...
> 
> I think most users would prefer:
> 
> with DAG(...):
>...
> 
> Since there is generally speaking no need for the alias.
> 
> Downsides? You have less control of what DAGs are made available – but
> really, does it make sense to create a DAG object only to drop it
> immediately after?
> 
> I volunteer to implement this if there's a positive feedback.
> 
> Cheers



Re: Make first dag run optional when catchup is False

2022-03-22 Thread Constance Martineau
I agree with Collin; said a lot better than what I was in the process of
writing. Execution Date -> Data Intervals was an improvement, but even with
this change, it's still difficult to understand. I think doing different
things depending on whether the start_date exists or not will add to that
complexity. I do like the idea of it being optional, and I think if you
were to go that route, the plan you proposed is the right one, but I
prefer the explicitness of the args in this case.

What is making me uncomfortable is that DAG authors do not necessarily have
a software engineering background, nor are all of them Airflow experts.
Also, lots of people who are not DAG authors interact with Airflow directly
and indirectly, and they also need to be able to easily reason about when
DAGs are supposed to run, and what period that run encompasses. I'm
picturing a support engineer pagerduty for a missed SLA for a new report,
and the support engineer trying to reason this behaviour out via the DAG
code.

On Tue, Mar 22, 2022 at 3:11 PM Collin McNulty 
wrote:

> I like the idea of supporting start_date=None, but that absolutely should
> not mean that we interpret start_date as “now”. start_date=now is one of
> the most common ways to shoot yourself in the foot writing DAGs. I think
> interpreting start_date=None as “don’t do any sort of catchup and run the
> next time you’re able” makes some amount of sense, but I like Philippe’s
> idea a little more. Specifically, it seems like bool is simply not a
> correct type for catchup, as we can describe at least 3 behaviors that make
> sense. What if we change the default type to string, and support bool as a
> legacy at least until 3.0?
>
> Catchup="all" (or True): run all intervals. Make "all" the default.
> Catchup="none" : do not run any past interval
> Catchup="last" (or False) run only the most recent interval
>
> On Tue, Mar 22, 2022 at 1:15 PM Daniel Standish
>  wrote:
>
>> There's some wiggliness here because of Airflow's behavior of actually
>> *running* the dag at the end of the interval rather than the start.  So
>> if we have start_date=None, then we default the start date to *now,* then
>> maybe to be consistent, the first run needs to be not 00:00 tomorrow but
>> 00:00 the next day.  The oddness is amplified when you consider a monthly
>> dag, where if you deploy now, start date is now, first schedulable run is
>> next month, therefore first run _more_ than a month away.  To fix this I
>> think we need to add support in our timetables for running at the start of
>> the interval instead of the end -- and I think this is something that
>> timetables were introduced to support anyway.
>>
>>
>>

-- 

Constance Martineau
Product Manager

Email: consta...@astronomer.io
Time zone: US Eastern (EST UTC-5 / EDT UTC-4)


<https://www.astronomer.io/>


Re: Make first dag run optional when catchup is False

2022-03-21 Thread Constance Martineau
;>> I'd be in favor of adjusting this behavior either permanently or by
>>>> a configuration.
>>>> >>>
>>>> >>> On Fri, Mar 4, 2022 at 3:00 PM Philippe Lanoe
>>>>  wrote:
>>>> >>>>
>>>> >>>> Hello Daniel,
>>>> >>>>
>>>> >>>> Thank you for your answer. In your example, as I experienced, the
>>>> first run would not be 2010-01-01 but 2022-03-03, 00:00:00 (it is currently
>>>> March 4 - 21:00 here), which is the execution date corresponding to the
>>>> start of the previous data interval, but the result is the same: an
>>>> undesired dag run. (For instance, in case of cron schedule '00 22 * * *',
>>>> one dagrun would be started immediately with execution date of 2022-03-02,
>>>> 22:00:00)
>>>> >>>>
>>>> >>>> I also agree with you that it could be categorized as a bug and I
>>>> would also vote for a fix.
>>>> >>>>
>>>> >>>> Would be great to have the feedback of others on this.
>>>> >>>>
>>>> >>>> On Fri, Mar 4, 2022 at 6:17 PM Daniel Standish
>>>>  wrote:
>>>> >>>>>
>>>> >>>>> You are saying, when you turn on for the first time a dag with
>>>> e.g. @daily schedule, and catchup = False, if start date is 2010-01-01,
>>>> then it would run first the 2010-01-01 run, then the current run (whatever
>>>> yesterday is)?  That sounds familiar.
>>>> >>>>>
>>>> >>>>> Yeah I don't like that behavior.  I agree that, as you say, it's
>>>> not the intuitive behavior.  Seems it could reasonably be categorized as a
>>>> bug.  I'd prefer we just "fix" it rather than making it configurable.  But
>>>> some might have concerns re backcompat.
>>>> >>>>>
>>>> >>>>> What do others think?
>>>> >>>>>
>>>> >>>>>
>>>>
>>> --
>>>
>>> Collin McNulty
>>> Lead Airflow Engineer
>>>
>>> Email: col...@astronomer.io 
>>> Time zone: US Central (CST UTC-6 / CDT UTC-5)
>>>
>>>
>>> <https://www.astronomer.io/>
>>>
>> --
>
> Collin McNulty
> Lead Airflow Engineer
>
> Email: col...@astronomer.io 
> Time zone: US Central (CST UTC-6 / CDT UTC-5)
>
>
> <https://www.astronomer.io/>
>


-- 

Constance Martineau
Product Manager

Email: consta...@astronomer.io
Time zone: US Eastern (EST UTC-5 / EDT UTC-4)


<https://www.astronomer.io/>


Re: SSIS

2021-06-29 Thread Constance Martineau
Hi Neeku,

I used to work in an environment that heavily relied on MSSQL Server and
SSIS. Among other things, we used Airflow to orchestrate the SSIS jobs when
moving to Airflow. While there is no specific "SSIS" package, assuming you
are using odbc drivers, there is an odbc provider (
https://airflow.apache.org/docs/apache-airflow-providers-odbc/stable/_api/airflow/providers/odbc/hooks/odbc/index.html).
You can use this or the PythonOperator in conjunction with pyodbc to run
the same stored procs or sql statements that your SSIS jobs are running.
Some refactoring will be required, but it is worth it. Your mileage may
vary, but after a data scientist moved all his workloads to Airflow, he
said that the time he spent debugging his pipelines went down 50%.

Hope this helps,
Constance

On Tue, Jun 29, 2021 at 5:04 AM Neeku Endhuku Nenu cheppanu <
idk.050...@gmail.com> wrote:

> Thank you for your response...
>
> On Tue, 29 Jun, 2021, 2:17 pm Jarek Potiuk,  wrote:
>
>> I think there are no ready operators for SSIS. You can build your own
>> (and contribute them back maybe) - it's actually not that difficult if you
>> have a Python API, or some command line interface (which you can invoke via
>> Bash Operator for example).
>>
>> I might be biased, and I do not know much about SSIS but from a quick
>> search I think people treat SSIS as an expensive competitor to Airflow and
>> mostly abandon it in favour of Airflow and Python:
>>
>>-
>>
>> https://medium.com/kabbage-engineering/running-airflow-at-kabbage-d39ebc101778
>>- https://www.stitchdata.com/vs/ssis/airflow/
>>-
>>
>> https://towardsdatascience.com/3-reasons-why-im-ditching-ssis-for-python-ee129fa127b5
>>
>> So maybe instead of integrating it with Airflow, consider replacing the
>> use of SSIS with Airflow.
>>
>> Of course it's a narrow, biased, 2 minutes search in Google so don't take
>> it for granted, do your own research and decide.
>>
>> J.
>>
>>
>>
>>
>> On Tue, Jun 29, 2021 at 10:36 AM Neeku Endhuku Nenu cheppanu <
>> idk.050...@gmail.com> wrote:
>>
>>> Thank you for your response.
>>>
>>> SSIS -  SQL server integration services..
>>>
>>> We are trying to schedule these SSIS jobs in Airflow...
>>>
>>> We don't find any article about this...
>>>
>>> Is it possible to schedule these SSIS jobs in Airflow??
>>>
>>>
>>> Please help me out..
>>>
>>>
>>> Many Thanks,
>>> Krishna v.
>>>
>>> On Tue, 29 Jun, 2021, 1:46 pm Ash Berlin-Taylor,  wrote:
>>>
 What is SSIS?

 What have you tried already?

 What error are your getting?

 -ash

 On 29 June 2021 09:02:43 BST, Neeku Endhuku Nenu cheppanu <
 idk.050...@gmail.com> wrote:
>
> Please help me out on this..
>
> On Thu, 24 Jun, 2021, 4:13 pm Neeku Endhuku Nenu cheppanu, <
> idk.050...@gmail.com> wrote:
>
>> Hi team,
>>
>> Is it possible to schedule a SSIS package's in Airflow??
>>
>> Please let me know, if it possible...
>>
>>
>> Many thanks,
>> Krishna v.
>>
>
>>
>> --
>> +48 660 796 129
>>
>


Re: [VOTE] AIP-38: Modern Web Application

2021-03-03 Thread Constance Martineau
+1 (non-binding)

On Wed, Mar 3, 2021 at 10:31 AM Ryan Hamilton
 wrote:

> Team,
>
> This email calls for a vote on the project proposed in AIP-38:
>
>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-38+Modern+Web+Application
>
> Discussion thread:
>
>
> https://lists.apache.org/thread.html/rd27b9ded4dcff38ef31e30236c5bd34830805fe6e8fd31bc1df75b8f%40%3Cdev.airflow.apache.org%3E
>
> Upon acceptance, we will initiate recruitment amongst the community to
> participate in the proposed UI SIG that will conduct the “Information
> Architecture & Design Process” outlined in the AIP.
>
> This vote will last for 5 days until 2021-03-08 15:30 UTC, and until at
> least 3 votes have been cast.
>
>
> https://www.timeanddate.com/worldclock/fixedtime.html?msg=AIP-38+Voting+Deadline&iso=20210308T1030&p1=414
>
> Consider this my +1 (binding).
>
>
> Reminder: committer and PMC votes are both binding on AIP votes. All
> community members are encouraged to vote.
>
> Cheers,
>
> Ryan Hamilton and Brent Bovenzi
>