Re: S3 Dag Bundle Versions and DB Manager

Jarek Potiuk Thu, 17 Jul 2025 22:09:19 -0700

> In my opinion, we can simply add an optional `manifest` field (or another
suitable name). I don’t think we need to introduce a new table via
DbManager; an additional field for storing metadata about the external
state (such as prefix and object versions for all dags in the bundle, in
the case of S3DagBundle) should suffice. We could introduce a new parent
subclass, such as `RemoteDagBundle` or `ObjectStoreDagBundle`, in the
common provider to define the structure for serializing and deserializing
the `manifest` field.


This is a good solution. It goes along the idea of a "generic" solution
that does not need an "amazon specific" table and DB manager. If the
manifest serialized field can be used for all other "bundles" (even if
manifest format itself is specific to S3 bundle), I am very happy with that
solution. One thing to consider (but this is entirely up to the S3 bundle
implementation) is handling versioning of such manifest during
serialization/deserialization to allow downgrading and upgrading the
provider seamlessly.



On Fri, Jul 18, 2025 at 5:56 AM Zhe You Liu <jason...@apache.org> wrote:

> Sorry for the late response.
>
> Both approaches work for me; I just wanted to share my opinion as we
> settle on a final decision.
>
> From my perspective, the DagBundle acts as a client that pulls external
> state and stores only the version identifier in the Airflow metadata DB.
>
> For example, with GitDagBundle, the Git repository serves as the external
> storage. The GitDagBundle pulls DAG files locally and stores the commit
> hash as the `version` field in `DagBundleModel.version`.
>
> 1. If we choose to store the manifest in the Airflow metadata DB:
>
> In my opinion, we can simply add an optional `manifest` field (or another
> suitable name). I don’t think we need to introduce a new table via
> DbManager; an additional field for storing metadata about the external
> state (such as prefix and object versions for all dags in the bundle, in
> the case of S3DagBundle) should suffice. We could introduce a new parent
> subclass, such as `RemoteDagBundle` or `ObjectStoreDagBundle`, in the
> common provider to define the structure for serializing and deserializing
> the `manifest` field.
>
> 2. If we decide to store the manifest outside the Airflow metadata DB:
>
> We will need to clarify:
>
> a) The required parameters for all DagBundles that pull DAGs from object
> storage. Based on the discussion above, we would need the `conn_id`,
> `bucket`, and `prefix` for the manifest file.
>
> b) The interface for calculating the bundle version based on the external
> state or DAG content hash.
>
> Here is a concrete example of how the manifest could be stored:
> https://github.com/apache/airflow/pull/46621#issuecomment-3078208467
>
> Thank you all for the insightful discussion!
>
> Best,
> Jason
>
> On 2025/07/10 21:56:31 "Oliveira, Niko" wrote:
> > Thanks for the reply Jarek :)
> >
> > Indeed we have different philosophies about this so we will certainly
> keep going in circles about where to draw the line on making things easy
> and enjoyable to use, whether to intentionally add friction or not, etc,
> etc.
> >
> > I think if we have optional paths to take and it's not immensely harder
> we should err on the side of making OSS Airflow as good as it can be,
> despite whatever managed services we have in the community. I'm not sure
> where it has come from recently but this new push to make Airflow
> intentionally hard to use so that managed services stay in business is a
> bit unsettling. We're certainly not asking for that, and those around that
> I've chatted to (since I'm now seeing this mentioned frequently) are also
> not asking for this. I'm curious where this new pressure is coming from and
> why you feel it recently.
> >
> > But regardless of the curiosity above, I'll return to the drawing board,
> and see what else can be done for this particular problem. If there are
> other Bundle types who need to solve the same problem perhaps we can find a
> more acceptable implementation in Airflow core to support this. And if not,
> I'll proceed with externalizing the storage of the S3 Bundle version
> metadata outside of Airflow.
> >
> > Cheers,
> > Niko
> >
> > ________________________________
> > From: Jarek Potiuk <ja...@potiuk.com>
> > Sent: Wednesday, July 9, 2025 11:59:06 PM
> > To: dev@airflow.apache.org
> > Subject: RE: [EXT] S3 Dag Bundle Versions and DB Manager
> >
> > CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
> >
> >
> >
> > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
> externe. Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous
> ne pouvez pas confirmer l’identité de l’expéditeur et si vous n’êtes pas
> certain que le contenu ne présente aucun risque.
> >
> >
> >
> > > To me, I'm always working from a user perspective. My goal is to make
> > their lives easier, their deployments easier, the product the most
> > enjoyable for them to use. To me, the best user experience is that they
> > should enable bundle versioning and it should just work with as little or
> > no extra steps and with as little infra as possible, and with the fewest
> > possible pit falls for them to fall into. From a user perspective,
> they've
> > already provisioned a database for airflow metadata, why is this portion
> of
> > metadata leaking out to other forms of external storage? Now this is
> > another resource they now need to be aware of and manage the lifecycle of
> > (or allow us write access into their accounts to manage for them).
> >
> >
> > *TL;DR; I think our goal in open-source is to have frictionless and "out
> of
> > the box" experience only for basic cases, but not for more complex
> > deployments.*
> >
> > It's a long read if you want to read it .. so beware :).
> >
> > I think that is an important "optimization goal" for sure to provide
> > frictionless and enjoyable experience - but I think it's one of many
> goals
> > that are sometimes contradicting with long term open-source project
> > sustainability and it's very import to clarify which "user" we are
> talking
> > about.
> >
> > To be honest, I am not sure that our goal should be "airflow should work
> > out of the box in case of integration with external services in
> production'
> > if it complicates our code and makes it service-dependent  - and as Jens
> > noticed, if we can come up with a "generic" thing that can be reusable
> > across multiple services, we can invest more in making it works "out of
> the
> > box", but if you anyhow need to integrate and make work with external
> > service, it adds very little "deployment complexity" to use another piece
> > of the service - and this is basically the job of deployment manager
> > anyway.
> >
> > The "just work" goal as I see it should only cover those individual users
> > who want to try and use airflow in it's basic form and "standalone"
> > configuration - not for "deployment managers".
> >
> > I think yes - our goal should be to make things extremely easy for users
> > who want to use airflow in its basic form where things should **just
> > work**. Like "docker run -it apache/airflow standalone" - this is what
> > currently **just works**, 0 configuration, 0 work for external
> > integrations, and we even had a discussion that we could make it "low
> > production ready" (which I think we could - just implement automated
> > backup/recovery of sqlite db and maybe document mounting a folder with
> DAGs
> > and db, better handling of logs rather than putting them as mixed output
> on
> > stdout and we are practically done). But when you add "S3" as the dag
> > storage you already need to make a lot of decisions - mostly about
> service
> > accounts, security, access, versioning, backup of the s3 objects, etc.
> etc.
> > And that's not a "standalone user' case - that is a "deployment manager"
> > work (where "deployment manager" is a role - not necessarily title of the
> > job you have.
> >
> > I think - and that is a bit of philosophical - but I've been talking
> about
> > it to Maciek Obuchowski yesterday - that there is a pretty clear boundary
> > of what open-source solutions delivers and it should match expectations
> of
> > people using it. Maintainers and community developing open-source should
> > mostly deliver a working, generic solutions that are extendable with
> > various deployment options and we should make it possible for those
> > deployments to happen - and provide building blocks for them. But it's
> > "deployment manager" work to make sure to put things together and make it
> > works. And we should not do it "for them". It's their job to figure out
> how
> > to configure and set-up things, make backups, set security boundaries
> etc.
> > - we should make it possible, document the options, document security
> model
> > and make it "easy" to configure things - but there should not be an
> > expectation from the deploiyment manager that it "just works".
> >
> > And I think your approach is perfectly fine - but only for "managed
> > services" - there, indeed manage service user's expectations can be that
> > things "just work" and they are willing to pay for it with real money,
> > rather than their time and effort to make it so. And there I think, those
> > who deliver such a service should have the "just work" as primary goal -
> > also because users will have such expectations - because they actually
> pay
> > for it to "just work". Not so much for open-source product - where "just
> > work" often involves complexity, additional maintenance overhead and
> making
> > opinionated decisions on "how it just works". For those "managed service"
> > teams - "just work" is very much a primary goal.  But for "open source
> > community" - having such a goal is  actually not good - it's dangerous
> > because it might result in wrong expectations from the users. If we start
> > making airflow "just works" in all kinds of deployment with zero work
> from
> > the users who want to deploy it in production and at scale, they will
> > expect it to happen for everything - why don't we have automated log
> > trimming, why don't we have automated backup of the Database, why don't
> we
> > auto vacuum the db, why don't we provide one-click deployment option on
> > AWS. GCS. Azure, why don't we provide DDOS protection in our webserver,
> why
> > don't we ..... you name it.
> >
> > That's a bit of philosophy - those are the same assumptions and goals
> that
> > I had in mind when designing multi-team - and there it's also why we had
> > different views - I just feel that some level of friction is a "property"
> > of open-source product.
> >
> > Also a bit of "business" side - this is also "good" for those who provide
> > managed services and airflow to keep sustainable open-source business
> model
> > working - because what people are paying them is precisely to "remove the
> > friction".  If take the "frictionless user experience" goal case to
> extreme
> > - Airflow would essentially be killed IMHO. Imagine if Airflow would be
> > frictioness for all kinds of deployments and had "everything" working out
> > of the box. There would be no business for any of the managed services
> > (because users would not need to pay for it). Then we would only have
> users
> > who expect thigns to "just work" and most of them would not even think
> > about contributing back. And there would be no managed services people
> > (like you)  whose job is paid by the services - or people like me who
> work
> > with and get money from several of those - which would basically slow
> down
> > active development and maintenance for Airflow to a halt - because even
> if
> > we had a lot of people willing to contribute, maintainers would have very
> > little - own - time to keep things running. There is a fine balance that
> we
> > keep now between the open-source and stakeholders, and open-source
> product
> > "friction" is an important property that the balance is built on.
> >
> > J.
> >
> >
> > On Wed, Jul 9, 2025 at 9:21 PM Oliveira, Niko
> <oniko...@amazon.com.invalid>
> > wrote:
> >
> > > To me, I'm always working from a user perspective. My goal is to make
> > > their lives easier, their deployments easier, the product the most
> > > enjoyable for them to use. To me, the best user experience is that they
> > > should enable bundle versioning and it should just work with as little
> or
> > > no extra steps and with as little infra as possible, and with the
> fewest
> > > possible pit falls for them to fall into. From a user perspective,
> they've
> > > already provisioned a database for airflow metadata, why is this
> portion of
> > > metadata leaking out to other forms of external storage? Now this is
> > > another resource they now need to be aware of and manage the lifecycle
> of
> > > (or allow us write access into their accounts to manage for them).
> > >
> > > Ultimately, we should not be afraid of doing sometimes difficult work
> to
> > > make a good product for our users, it's for them in the end :)
> > >
> > > However, I see your perspectives as well, making our code and DB
> > > management more complex is more work and complication for us. And from
> the
> > > feedback so far I'm out voted, so I'm happy as always to disagree and
> > > commit, and do as you wish :)
> > >
> > > Thanks for the feedback everyone!
> > >
> > > Cheers,
> > > Niko
> > >
> > > ________________________________
> > > From: Jens Scheffler <j_scheff...@gmx.de.INVALID>
> > > Sent: Wednesday, July 9, 2025 12:07:08 PM
> > > To: dev@airflow.apache.org
> > > Subject: RE: [EXT] S3 Dag Bundle Versions and DB Manager
> > >
> > > CAUTION: This email originated from outside of the organization. Do not
> > > click links or open attachments unless you can confirm the sender and
> know
> > > the content is safe.
> > >
> > >
> > >
> > > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
> externe.
> > > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
> pouvez
> > > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain
> que
> > > le contenu ne présente aucun risque.
> > >
> > >
> > >
> > > My 2ct on the discussions are similar like the opinions before.
> > >
> > >  From my Edge3 experience migrating DB from provider - even if
> > > technically enabled - is a bit of a pain. Adding a lot of boilerplate,
> > > you need to consider your provider should also still be compatible with
> > > AF2 (I assume) and once a user wants to downgrade it is a bit of manual
> > > effort to downgrade DB as well.
> > >
> > > As long as we are not adding a generic Key/Value store to core (similar
> > > liek Variables but for general purpose internal use not exposed to
> users
> > > - but then in case of trougbleshooting how to "manage/admin it?) I
> would
> > > also see it like Terraform - a secondary bucked for state os cheap and
> > > convenient. Yes write access would be needed but only for Airflow. And
> > > as it is separated from other should not be a general security harm...
> > > just a small deployment complexity. And I assume versining is optional.
> > > So no requirement to have it on per default and if a user wants to move
> > > to/enable versioing then just the state bucket would need to be added
> to
> > > Bundle-config?
> > >
> > > TLDR I would favor a bucket, else if DB is the choice then a common
> > > solution in core might be easier than a DB handling in provider. But
> > > would also not block any other, just from point of complexity I'd not
> > > favor provider specifc DB tables.
> > >
> > > Jens
> > >
> > > On 09.07.25 19:57, Jarek Potiuk wrote:
> > > > What about the DynamoDB idea ? What you are trying to trade-off is
> > > "writing
> > > > to airflow metadata DB" with "writing to another DB" really. So yes
> it
> > > is -
> > > > another thing you will need to have access to write to - other than
> > > Airflow
> > > > DB, but it's really the question should the boundaries be on
> "Everything
> > > > writable should be in Airflow" vs. "Everything writable should be in
> the
> > > > "cloud" that the integration is about.
> > > >
> > > > Yes - it makes the management using S3 versioning a bit more
> "write-y" -
> > > > but on the other hand it does allow to confine complexity to a pure
> > > > "amazon" provider  - with practically 0 impact on Airflow core and
> > > airflow
> > > > DB. Which I really like to be honest.
> > > >
> > > > And yes "co-location" is also my goal. And I think this is a perfect
> way
> > > to
> > > > explain it as well why it is better to keep "S3 versioning" close to
> "S3"
> > > > and not to Airflow - especially that there will be a lot of
> "S3-specific"
> > > > things in the state that are not easy to abstract and have "common"
> for
> > > > other Airflow versioning implementations.
> > > >
> > > > You can think about it this way:
> > > >
> > > > Airflow has already done its job with abstractions - versioning
> changes
> > > and
> > > > metadata DB is implemented in Airflow DB. If there are any missing
> pieces
> > > > in the abstraction that will be usable across multiple
> implementations of
> > > > versioning, we should - of course - add it to Airflow metadata DB -
> in
> > > the
> > > > way that they can be used by those different implementations. But the
> > > code
> > > > to manage and use it should be in airflow-core.
> > > > If there is anything specific for the implementation of S3 / Amazon
> > > > integration -> it should be implemented independently from Airflow
> > > Metadata
> > > > DB. There are many complexities in managing and upgrading core DB
> and we
> > > > should not use the db to make provider-specific things. The
> discussion
> > > > about shared code and isolation is interesting in this context.
> Because I
> > > > think we might get to the point when we go deeper and deeper in this
> > > > direction that we will have (and we already do it more or less) NO
> > > > (regular) providers needed with whatever CLI or tooling we will need
> to
> > > > manage the Metadata DB. FAB and Edge are currently exceptions - but
> they
> > > > are by no means "regular" providers.
> > > >
> > > > So I'd say - if while designing/ implementing S3 versioning you will
> see
> > > > that part of the implementation can be abstracted away and added to
> the
> > > > core and used by other implementations - 100% - let's add it to the
> core.
> > > > But only then. If it is something that only Amazon provider needs
> and S3
> > > > needs - let's make it use Amazon **whatever** as backing storage.
> > > >
> > > > I would even say - talk to the Google team and try to come up with an
> > > > abstraction that can be used for versioning in both S3 and GCS,
> agree on
> > > > it, and let's see if this abstraction should find its way to the
> core.
> > > That
> > > > would be my proposal.
> > > >
> > > > J.
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Jul 9, 2025 at 7:37 PM Oliveira, Niko
> > > <oniko...@amazon.com.invalid>
> > > > wrote:
> > > >
> > > >> Thanks for engaging folks!
> > > >>
> > > >> I don’t love the idea of using another bucket. For one, this means
> > > Airflow
> > > >> needs write access to S3 which is not ideal; some users/customers
> are
> > > very
> > > >> sensitive about ever allowing write access to things. And two, you
> will
> > > >> commonly get issues with a design that leaks state into customer
> managed
> > > >> accounts/resources, they may delete the bucket not knowing what it
> is,
> > > they
> > > >> may not migrate it to a new account or region if they ever move. I
> think
> > > >> it’s best for the data to be stored transparently to the user and
> > > >> co-located with the data it strongly relates to (i.e. the dag runs
> that
> > > are
> > > >> associated with those bundle versions).
> > > >>
> > > >> Is using DB Manager completely unacceptable these days? What are
> folks'
> > > >> thoughts on that?
> > > >>
> > > >> Cheers,
> > > >> Niko
> > > >>
> > > >> ________________________________
> > > >> From: Jarek Potiuk <ja...@potiuk.com>
> > > >> Sent: Wednesday, July 9, 2025 6:23:54 AM
> > > >> To: dev@airflow.apache.org
> > > >> Subject: RE: [EXT] S3 Dag Bundle Versions and DB Manager
> > > >>
> > > >> CAUTION: This email originated from outside of the organization. Do
> not
> > > >> click links or open attachments unless you can confirm the sender
> and
> > > know
> > > >> the content is safe.
> > > >>
> > > >>
> > > >>
> > > >> AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
> > > externe.
> > > >> Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
> > > pouvez
> > > >> pas confirmer l’identité de l’expéditeur et si vous n’êtes pas
> certain
> > > que
> > > >> le contenu ne présente aucun risque.
> > > >>
> > > >>
> > > >>
> > > >>> Another option also would be Using dynamodb table? that also
> supports
> > > >> snapshots and i feel it works very well with state management.
> > > >>
> > > >> Yep that would also work.
> > > >>
> > > >> Anything "Amazon" to keep state would do. I think that it should be
> our
> > > >> "default" approach that if we have to keep state and the state is
> > > connected
> > > >> with specific "provider's" implementation, it's best to not keep
> state
> > > in
> > > >> Airflow, but in the "integration" that the provider works with if
> > > possible.
> > > >> We cannot do it in "generic" case because we do not know what
> > > >> "integrations" the user has - but since this is "provider's"
> > > functionality,
> > > >> using anything else that the given integration provides makes
> perfect
> > > >> sense.
> > > >>
> > > >> J.
> > > >>
> > > >>
> > > >> On Wed, Jul 9, 2025 at 3:12 PM Pavankumar Gopidesu <
> > > >> gopidesupa...@gmail.com>
> > > >> wrote:
> > > >>
> > > >>> Agree another s3 bucket also works here
> > > >>>
> > > >>> Another option also would be Using dynamodb table? that also
> supports
> > > >>> snapshots and i feel it works very well with state management.
> > > >>>
> > > >>>
> > > >>> Pavan
> > > >>>
> > > >>> On Wed, Jul 9, 2025 at 2:06 PM Jarek Potiuk <ja...@potiuk.com>
> wrote:
> > > >>>
> > > >>>> One of the options would be to use a similar approach as terraform
> > > >> uses -
> > > >>>> i.e. use dedicated "metadata" state storage in a DIFFERENT s3
> bucket
> > > >> than
> > > >>>> DAG files. Since we know there must be an S3 available
> (obviously) -
> > > it
> > > >>>> seems not too excessive to assume that there might be another
> bucket,
> > > >>>> independent of the DAG bucket where the state is stored - same
> bucket
> > > >>> (and
> > > >>>> dedicated connection id) could even be used to store state for
> > > multiple
> > > >>> S3
> > > >>>> dag bundles - each Dag bundle could have a dedicated object
> storing
> > > the
> > > >>>> state. The metadata is not huge, so continuously reading and
> replacing
> > > >> it
> > > >>>> should not be an issue.
> > > >>>>
> > > >>>>   What's nice about it - this single object could even
> **actually**
> > > use
> > > >> S3
> > > >>>> versioning to keep historical state  - to optimize things and
> keep a
> > > >> log
> > > >>> of
> > > >>>> changes potentially.
> > > >>>>
> > > >>>> J.
> > > >>>>
> > > >>>> On Wed, Jul 9, 2025 at 3:01 AM Oliveira, Niko
> > > >>> <oniko...@amazon.com.invalid
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Hey folks,
> > > >>>>>
> > > >>>>> tl;dr I’d like to get some thoughts on a proposal to use DB
> Manager
> > > >> for
> > > >>>> S3
> > > >>>>> Dag Bundle versioning.
> > > >>>>>
> > > >>>>> The initial commit for S3 Dag Bundles was recently merged [1]
> but it
> > > >>>> lacks
> > > >>>>> Bundle versioning (since this isn’t trivial with something like
> S3,
> > > >>> like
> > > >>>> it
> > > >>>>> is with Git). The proposed solution involves building a snapshot
> of
> > > >> the
> > > >>>> S3
> > > >>>>> bucket at the time each Bundle version is created, noting the
> version
> > > >>> of
> > > >>>>> all the objects in the bucket (using S3’s native bucket
> versioning
> > > >>>> feature)
> > > >>>>> and creating a manifest to store those versions and then giving
> that
> > > >>>> whole
> > > >>>>> manifest itself some unique id/version/uuid. These manifests now
> need
> > > >>> to
> > > >>>> be
> > > >>>>> stored somewhere for future use/retrieval. The proposal is to
> use the
> > > >>>>> Airflow database using the DB Manager feature. Other options
> include
> > > >>>> using
> > > >>>>> the local filesystem to store them (but this obviously wont work
> in
> > > >>>>> Airflow’s distributed architecture) or the S3 bucket itself (but
> this
> > > >>>>> requires write access to the bucket and we will always be at the
> > > >> mercy
> > > >>> of
> > > >>>>> the user accidentally deleting/modifying the manifests as they
> try to
> > > >>>>> manage the lifecycle of their bucket, they should not need to be
> > > >> aware
> > > >>> of
> > > >>>>> or need to account for this metadata). So the Airflow DB works
> nicely
> > > >>> as
> > > >>>> a
> > > >>>>> persistent and internally accessible location for this data.
> > > >>>>>
> > > >>>>> But I’m aware of the complexities of using the DB Manager and the
> > > >>>>> discussion we had during the last dev call about providers
> vending DB
> > > >>>>> tables (concerning migrations and ensuring smooth upgrades or
> > > >>> downgrades
> > > >>>> of
> > > >>>>> the schema). So I wanted to reach out to see what folks thought.
> I
> > > >> have
> > > >>>>> talked to Jed, the Bundle Master (tm), and we haven’t come up
> with
> > > >>>> anything
> > > >>>>> else that solves the problem as cleanly, so the DB Manager is
> still
> > > >> my
> > > >>>> top
> > > >>>>> choice. I think what we go with will pave the way for other
> Bundle
> > > >>>>> providers of a similar type as well, so it's worth thinking
> deeply
> > > >>> about
> > > >>>>> this decision.
> > > >>>>>
> > > >>>>> Let me know what you think and thanks for your time!
> > > >>>>>
> > > >>>>> Cheers,
> > > >>>>> Niko
> > > >>>>>
> > > >>>>> [1] https://github.com/apache/airflow/pull/46621
> > > >>>>>
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> > > For additional commands, e-mail: dev-h...@airflow.apache.org
> > >
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> For additional commands, e-mail: dev-h...@airflow.apache.org
>
>

Re: S3 Dag Bundle Versions and DB Manager

Reply via email to