Does anyone have any more comments ? I think I will make a PR soon describing the "Public API" of Airflow (as a work in progress) and I think we can discuss any details of it in the PR.
Also the 'common.api' approach does not need to be the "chosen" to automate the API description - this is just an example. Just so we all know - over the last few months I have been using the "common.sql" as a "testbed" for various problems and approaches that involve common code and exposing it to multiple users in the context of Airflow. We already had some super valuable lessons (and there quite a few of PRs and issues and discussions I can refer to when we will be discussing potential splitting out and separating providers in the future as good examples of what "*might*" happen and what we should take care about if/when we split. So for me the current approach with MyPy, stubgen and the way how we are going to keep "API" changes in-check is an experiment that we **might** apply to Airflow if we find the approach useful We do not have to decide now, we do not even have to implement anything in order to define what we **think** Airflow API is. I hope we can do the definition first and when we get some lessons from common.sql (which is a cool example because it's small but it evolves quickly enough to get some learnings - including some failures we learn from). This would be my current approach now: * start defining what Public API is * learn more about "keeping API in-check" from common.sql * see how we can improve automation around the API check J. On Mon, Dec 5, 2022 at 8:58 PM Oliveira, Niko <oniko...@amazon.com.invalid> wrote: > 1) users "peace of mind" as top priority: clarity of what they can > expect from Airflow, and avoiding surprises when upgrading > 2) targeting minimal disruption to user's workflows (though we might > never reach absolute 100%) > 3) making it easy for contributors and maintainers to decide on > breaking/non-breaking behaviours > > Yupp, I agree, this is an accurate encapsulation of the issues at hand. > > > My proposal to work on documenting our approach for our users (and > for maintainers) in a single page: "What is Airflow Public API?" and > what users can expect. > > I think this is actually a very important piece we've been missing. From > the SemVer RFC itself it says: > > > "*For this system to work, you first need to declare a public API. This > may consist of documentation or be enforced by the code itself. Regardless, > it is important that this API be clear and precise. Once you identify your > public API, you communicate changes to it with specific increments to your > version number.*" > > So as difficult as I think it will be to accurately describe and automate > what the Airflow public API is, I think it's a very useful project to > undertake. Perhaps even codifying it in an AIP. > At the moment we consider even the deepest/smallest "private" helper > function within util provider code to be public. This level of public API > makes iterating and maintaining the code very laborious. So I definitely > think this is worth the effort. > I'll need to have a closer look at that PR, but the exact technical > details can certainly be hammered out later. > > Cheers, > Niko > > ------------------------------ > *From:* Jarek Potiuk <ja...@potiuk.com> > *Sent:* Saturday, December 3, 2022 1:25 AM > *To:* dev@airflow.apache.org > *Subject:* RE: [EXTERNAL][DISCUSSION] Assessing what is a breaking change > for Airflow (SemVer context) > > CAUTION: This email originated from outside of the organization. Do not > click links or open attachments unless you can confirm the sender and know > the content is safe. > > > > Sorry for not following up on this for a bit - it's been hectic these > days for me. I think valid points were said, and from the tone of > those I feel that we all who participated have the same sense of what > is important: > > 1) users "peace of mind" as top priority: clarity of what they can > expect from Airflow, and avoiding surprises when upgrading > 2) targeting minimal disruption to user's workflows (though we might > never reach absolute 100%) > 3) making it easy for contributors and maintainers to decide on > breaking/non-breaking behaviours > > I think there is a main blocker to all of those (also mentioned in the > discussion above): > > We are extremely cautious about any change because there is a lack of > agreement/expectations with our users on what is supposed to be the > "public API" . > > # Proposal > > My proposal to work on documenting our approach for our users (and > for maintainers) in a single page: "What is Airflow Public API?" and > what users can expect. > > There are certain areas where we can define rules and either automate > or document (or both) our statement about what is the "public" API and > (more importantly) what is clearly NOT on a single page document. > Also it should also be accompanied (where possible) with some > automation and tooling that would help us to express it in detail (and > help our users to validate if they are conforming to the "public > API"). > > We won't solve it very quickly, but once we start doing it, it might > turn out that it's not that long of a process in fact. And if we start > it now - in a few months we might be in a different place. > > # Some concrete actions we might take > > 1) On the 'Code" level - we can start to define the API that is > considered as "public" and add verification of those for our users. We > could implement a similar solution to what I proposed to common.sql > https://github.com/apache/airflow/pull/27962 (where I followed Ash's > idea to use MyPy stubgen and pre-commits to flag changes to it, and > where we harness MyPy capabilities to control how the API is used). I > believe that we could apply a similar solution to all providers and > eventually even all parts of core, to make it very clear which part of > the Airflow API is public and which is not. I think MyPy and > strong-ish typing is taking the Python world by a storm, and we could > use it as a standard way of communicating to those who use Airflow as > a library, which parts are "public". > > Having .pyi files as part of our packages with "hidden" parts that are > not supported to be exposed, seems to be not only a nice communication > tool but also has support for all the kind of tooling from day 0 for > our users (IDE integrations, automations to check if the right API is > used etc.). We could even easily provide guidelines for the users > "Here is how you can check if you are using Airflow code properly". > Not 100% foolproof but much better than anything else I can imagine. > > Also having it in place will allow the providers to be finally > separated to separate repositories - and we could use MyPy checks > rather than running the full test suite with the Providers to verify > if changes in Airflow do not break Providers. That would finally make > it possible to loosen the coupling we have between Providers and > Airflow (currently we basically run whole suite of tests to be certain > things are working - but we could simply run providers with MyPy > checks if we have proper .pyi files (not the same confidence but very, > very close). > > 2) On the DB level - we already have "AIP-44" as the foundation of > telling the users - those are the "Airflow" you can do "this" when you > write your DAGs. Direct DB access will be forbidden and we can > specifically communicate to the users "do not use DB any more" and we > can even work out warnings when our users do. We could even make it a > default behaviour later to block direct access by default (but that is > likely only in Airflow 3). > > 3) On the UI level - we could simply explain that UI changes are > exempt from the "no removal" policy. We might simply treat all the UI > changes as non-breaking by default and loosen our strictness there. > This would be very close to the Chrome/Firefox example by Bolke - I > think UI changes are not breaking in the sense that you have to fix > your code that uses it, it requires simply changing user's habits. > We've already done this, That would be simply acknowledging the > approach we already used when TreeView was replaced by GridView. > > 4) Airflow also has also a few non-code interfaces that are considered > as part of the platform: statsd metrics is one of them. I can't think > of any more but maybe there are more. We could simply make an > inventory and discuss our approach on those ONCE and document it. This > will avoid discussions, discussions, discussions, and let our users > have some clear expectations and maintainers making quick decisions > when approving (or not) PRs. > > # Question > > Does it sound like a good plan? Is it worth making such an effort ? Or > maybe what we have as status-quo is "good enough" and that would be a > waste of effort? WDYT? > > J. > > > > > J >