Re: [DISCUSS] Creating an external connector repository

Arvid Heise Mon, 18 Oct 2021 06:23:34 -0700

Hi folks,

thanks for joining the discussion. I'd like to give some ideas on how
certain concerns are going to be addressed:


Ingo:
> In general I think breaking up the big repo would be a good move with many
> benefits (which you have outlined already). One concern would be how to
> proceed with our docs / examples if we were to really separate out all
> connectors.
>

I don't see any issue at all with both options. You'd just have to update
the dependency to the connector for blog posts and starter examples.
Each connector page should provide specific examples themselves.
Note that I would keep File Source/Sink in the main repo as they don't add
dependencies on their own. Formats and Filesystem may be externalized at a
much later point after we gained more knowledge on how to build an real
ecosystem with connectors.


> 1. More real-life examples would essentially now depend on external
> projects. Particularly if hosted outside the ASF, this would feel somewhat
> odd. Or to put it differently, if flink-connector-foo is not part of Flink
> itself, should the Flink Docs use it for any examples?
>
Why not? We also have blog posts that use external dependencies.

2. Generation of documentation (config options) wouldn't be possible unless
> the docs depend on these external projects, which would create weird
> version dependency cycles (Flink 1.X's docs depend on flink-connector-foo
> 1.X which depends on Flink 1.X).
>
Config options that are connector specific should only appear on the
connector pages. So we need to incorporate the config option generation in
the connector template.


> 3. Documentation would inevitably be much less consistent when split across
> many repositories.
>
Fair point. If we use the same template as Flink Web UI for connectors, we
could embed subpages directly in the main documentation. If we allow that
for all connectors, it would be actually less fragmented as now where some
connectors are only described in Bahir or on external pages.


> As for your approaches, how would (A) allow hosting personal / company
> projects if only Flink committers can write to it?
>
That's entirely independent. In both options and even now, there are
several connectors living on other pages. They are currently only findable
through a search engine and we should fix that anyhow. See [1] for an
example on how Kafka connect is doing it.

> Connectors may receive some sort of quality seal
>
> This sounds like a lot of work and process, and could easily become a
> source of frustration.
>
Yes this is definitively some effort but strictly less than maintaining the
connector in the community as it's an irregular review.


Chesnay:
> What I'm concerned about, and which we never really covered in past
> discussions about split repositories, are
> a) ways to share infrastructure (e.g., CI/release utilities/codestyle)
>
I'd provide a common Github connector template where everything is in. That
means of course making things public.

> b) testing
>
See below

> c) documentation integration
>
See Ingo's response.

>
> Particularly for b) we still lack any real public utilities.
> Even fundamental things such as the MiniClusterResource are not
> annotated in any way.
> I would argue that we need to sort this out before a split can happen.
> We've seen with the flink-benchmarks repo and recent discussions how
> easily things can break.
>
Yes, I agree but given that we already have connectors outside of the main
repo, the situation can only improve. By moving the connectors out, we are
actually forced to provide a level ground for everyone and thus really
enabling the community to contribute connectors.
We also plan to finish the connector testing framework in 1.15.

Related to that, there is the question on how Flink is then supposed to
> ensure that things don't break. My impression is that we heavily rely on
> the connector tests to that end at the moment.
> Similarly, what connector (version) would be used for examples (like the
> WordCount which reads from Kafka) or (e2e) tests that want to read
> something other than a file? You end up with this circular dependency
> which are always troublesome.
>
I agree that we must avoid any kind of circular dependencies. There are a
couple of options that we probably are going to mix:
* Move connector specific e2e tests into connector repo
* Have nightly builds on connector repo and collect results in some
overview.
* React on failures, especially if several connectors fail at once.
* Have an e2e repo/module in Flink that has cross-connector tests etc.

As for for the repo structure, I would think that a single one could
> work quite well (because having 10+ connector repositories is just a
> mess), but currently I wouldn't set it up as a single project.
> I would rather have something like N + 1 projects (one for each
> connectors + a shared testing project) which are released individually
> as required, without any snapshot dependencies in-between.
> Then 1 branch for each major Flink version (again, no snapshot
> dependencies). Individual connectors can be released at any time against
> any of the latest bugfix releases, which due to lack of binaries (and
> python releases) would be a breeze.
>
This sounds like a good idea but it's not entirely clear how independent
releases would work. I guess we just bump versions independently?
The only thing that it wouldn't solve is how we can give fine-grain
contributor permissions but maybe that's your intent.

I don't like the idea of moving existing connectors out of the Apache
> organization. At the very least, not all of them. While some are
> certainly ill-maintained (e.g., Cassandra) where it would be neat if
> external projects could maintain them, others (like Kafka) are not and
> quite fundamental to actually using Flink.
>
I would like to avoid treating some connectors different from other
connectors by design. In reality, we can assume that some connectors will
receive more love than others. However, if we already treat some connectors
"better" than others we may run in a vicious cycle where the "bad" ones
never improve.
Nevertheless, I'd also be fine to just start with some of them and move
others later.


David:
> - How exactly are we going to maintain the high quality standard of the
> connectors?
>
Which high quality standard? Neither many of Flink's connectors nor Bahir's
connectors are throughly shining examples of well-maintained connectors
imho. So I'd like to get a more specific example on what you think will
happen quality-wise. You gained some experiences in Beam and other
occasions.

- How would the connector release cycle to look like? Is this going to
> affect the Flink release cycle?
>
Ideally, they are completely separate.


> - How would the documentation process / generation look like?
>
Documentation is generated on release on the connector page and deep linked
on the Flink connector page (see response to Ingo).


> - Not all of the connectors rely solely on the Stable APIs. Moving them
> outside of the Flink code-base will make any refactoring on the Flink side
> significantly more complex as potentially needs to be reflected into all
> connectors. There are some possible solutions, such as Gradle's included
> builds, but we're far away from that. How are we planning to address this?
>
The goal is to have stable APIs. We already got complaints about connector
API stability in 1.14 exactly because we do not see the impact of our API
changes. Source interface is now Public and Sink needs to become Public
asap and then this concern is hopefully resolved.

- How would we develop connectors against unreleased Flink version? Java
> snapshots have many limits when used for the cross-repository development.
>
I was actually betting on snapshots versions. What are the limits?
Obviously, we can only do a release of a 1.15 connector after 1.15 is
release.


> - With appropriate tooling, this whole thing is achievable even with the
> single repository that we already have. It just matter of having a more
> fine-grained build / release process. Have you tried to research this
> option?
>
The central point of decoupled release processes and maintainership seems
to be impossible in 1 repo under ASF. Could you elaborate on how you would
approach it?


Leonard:
> I only have one concern. Once we migrate these connectors to external
> projects, how can we ensure them with high quality? All the built-in
> connectors of Flink are developed or reviewed by the committers.
>
Same as to David, I'm not sure we have the same view on the quality of the
existing connectors.


> The reported connector bugs from JIRA and mailing lists will be quick
> fixed currently, how does the Flink community ensure the development rhythm
> of the connector after the move? In other words, are these connectors still
> first-class citizens of the Flink community? If it is how we guarantee.
>
We currently have 667 open issues that are connector related [2], of which
443 have been created in 2020 or earlier. So, I wouldn't say connector bugs
are quickly fixed at all and I would certainly not say connectors are
currently first-class citizens of Flink. I personally try to change that
but that takes a long time.
My goal is to get more contributions from the community by making it much
easier to engage. Ideally each connector is a self-managed github project
where everyone can contribute in a much quicker fashion. There is a risk of
fragmentation but it can't be much worse than the current state where it's
really hard to find a maintainer for a particular connector in the main
Flink repo (Who knows Cassandra and could fix something or review a PR?
PubSub? HBase? Nifi?).

Recently, I have maintained a series of cdc connectors in the Flink CDC
> project. My feeling is that it is not easy to develop and maintain
> connectors. Contributors to the Flink CDC project have done some approaches
> in this way, such as building connector integration tests [2], document
> management [3]. Personally, I don’t have a strong tendency to move the
> built-in connectors out or keep them. If the final decision of this thread
> discussion  turns out to move out, I’m happy to share our experience and
> provide help in the new connector project.
>
I think that's the main goal of moving connectors out: If we are going to
maintain _some_ connectors outside Flink, we will make sure that all
tooling is there. Developing connectors shouldn't be just easy for the
Flink devs but it should be easy for everyone.

TL;DR I think the main misconception of the 4 question blocks is that we
somewhat have a good connector quality or that it's working fine as is. No,
it's a mess. We need to change something and ideally in a way that it
scales beyond what we have now.
I'm personally fine with 1 external repo approach and if we can streamline
the release process, it may even work under ASF umbrella. But we should do
whatever we can to get more folks working on connectors and have faster
releases.
Releases need to be decoupled from Flink release cycle such that users are
actually contributing patches to solve their issues. If we still depend on
upstream Flink, most users will not see their fixes within 1 year (release
+ trying out on user's side + upgrade) and that's a terrible incensitive.

[1]
https://www.confluent.io/hub/dariobalinzo/kafka-connect-elasticsearch-source
[2]
https://issues.apache.org/jira/issues/?jql=project%20%3D%20FLINK%20AND%20component%20in%20(%22Connectors%20%2F%20Cassandra%22%2C%20%22Connectors%20%2F%20Common%22%2C%20%22Connectors%20%2F%20ElasticSearch%22%2C%20%22Connectors%20%2F%20FileSystem%22%2C%20%22Connectors%20%2F%20Google%20Cloud%20PubSub%22%2C%20%22Connectors%20%2F%20Hadoop%20Compatibility%22%2C%20%22Connectors%20%2F%20HBase%22%2C%20%22Connectors%20%2F%20Hive%22%2C%20%22Connectors%20%2F%20JDBC%22%2C%20%22Connectors%20%2F%20Kafka%22%2C%20%22Connectors%20%2F%20Kinesis%22%2C%20%22Connectors%20%2F%20Nifi%22%2C%20%22Connectors%20%2F%20ORC%22%2C%20%22Connectors%2F%20RabbitMQ%22%2C%20FileSystems)%20AND%20status%20%3D%20Open&startIndex=200


On Mon, Oct 18, 2021 at 1:00 PM David Morávek <d...@apache.org> wrote:

> We are mostly talking about the freedom this would bring to the connector
> authors, but we still don't have answers for the important topics:
>
> - How exactly are we going to maintain the high quality standard of the
> connectors?
> - How would the connector release cycle to look like? Is this going to
> affect the Flink release cycle?
> - How would the documentation process / generation look like?
> - Not all of the connectors rely solely on the Stable APIs. Moving them
> outside of the Flink code-base will make any refactoring on the Flink side
> significantly more complex as potentially needs to be reflected into all
> connectors. There are some possible solutions, such as Gradle's included
> builds, but we're far away from that. How are we planning to address this?
> - How would we develop connectors against unreleased Flink version? Java
> snapshots have many limits when used for the cross-repository development.
> - With appropriate tooling, this whole thing is achievable even with the
> single repository that we already have. It just matter of having a more
> fine-grained build / release process. Have you tried to research this
> option?
>
> I'd personally strongly suggest against moving the connectors out of the
> ASF umbrella. The ASF brings legal guarantees, hard gained trust of the
> users and high quality standards to the table. I still fail to see any good
> reason for giving this up. Also this decision would be hard to reverse,
> because it would most likely require a new donation to the ASF (would this
> require a consent from all contributors as there is no clear ownership?).
>
> Best,
> D.
>
>
> On Mon, Oct 18, 2021 at 12:12 PM Qingsheng Ren <renqs...@gmail.com> wrote:
>
>> Thanks for driving this discussion Arvid! I think this will be one giant
>> leap for Flink community. Externalizing connectors would give connector
>> developers more freedom in developing, releasing and maintaining, which can
>> attract more developers for contributing their connectors and expand the
>> Flink ecosystems.
>>
>> Considering the position for hosting connectors, I prefer to use an
>> individual organization outside Apache umbrella. If we keep all connectors
>> under Apache, I think there’s not quite difference comparing keeping them
>> in the Flink main repo. Connector developers still require permissions from
>> Flink committers to contribute, and release process should follow Apache
>> rules, which are against our initial motivations of externalizing
>> connectors.
>>
>> Using an individual Github organization will maximum the freedom provided
>> to developers. An ideal structure in my mind would be like "
>> github.com/flink-connectors/flink-connector-xxx". The new established
>> flink-extended org might be another choice, but considering the amount of
>> connectors, I prefer to use an individual org for connectors to avoid
>> flushing other repos under flink-extended.
>>
>> In the meantime, we need to provide a well-established standard /
>> guideline for contributing connectors, including CI, testing, docs (maybe
>> we can’t provide resources for running them, but we should give enough
>> guide on how to setup one) to keep the high quality of connectors. I’m
>> happy to help building these fundamental bricks. Also since Kafka connector
>> is widely used among Flink users, we can make Kafka connector a “model” of
>> how to build and contribute a well-qualified connector into Flink
>> ecosystem, and we can still use this trusted one for Flink E2E tests.
>>
>> Again I believe this will definitely boost the expansion of Flink
>> ecosystem. Very excited to see the progress!
>>
>> Best,
>>
>> Qingsheng Ren
>> On Oct 15, 2021, 8:47 PM +0800, Arvid Heise <ar...@apache.org>, wrote:
>> > Dear community,
>> > Today I would like to kickstart a series of discussions around creating
>> an external connector repository. The main idea is to decouple the release
>> cycle of Flink with the release cycles of the connectors. This is a common
>> approach in other big data analytics projects and seems to scale better
>> than the current approach. In particular, it will yield the following
>> changes.
>> >  • Faster releases of connectors: New features can be added more
>> quickly, bugs can be fixed immediately, and we can have faster security
>> patches in case of direct or indirect (through dependencies) security
>> flaws. • New features can be added to old Flink versions: If the connector
>> API didn’t change, the same connector jar may be used with different Flink
>> versions. Thus, new features can also immediately be used with older Flink
>> versions. A compatibility matrix on each connector page will help users to
>> find suitable connector versions for their Flink versions. • More activity
>> and contributions around connectors: If we ease the contribution and
>> development process around connectors, we will see faster development and
>> also more connectors. Since that heavily depends on the chosen approach
>> discussed below, more details will be shown there. • An overhaul of the
>> connector page: In the future, all known connectors will be shown on the
>> same page in a similar layout independent of where they reside. They could
>> be hosted on external project pages (e.g., Iceberg and Hudi), on some
>> company page, or may stay within the main Flink reposi    tory. Connectors
>> may receive some sort of quality seal such that users can quickly access
>> the production-readiness and we could also add which community/company
>> promises which kind of support. • If we take out (some) connectors out of
>> Flink, Flink CI will be faster and Flink devs will experience less build
>> stabilities (which mostly come from connectors). That would also speed up
>> Flink development.
>> > Now I’d first like to collect your viewpoints on the ideal state. Let’s
>> first recap which approaches, we currently have:
>> >  • We have half of the connectors in the main Flink repository.
>> Relatively few of them have received updates in the past couple of
>> months. • Another large chunk of connectors are in Apache Bahir. It
>> recently has seen the first release in 3 years. • There are a few other
>> (Apache) projects that maintain a Flink connector, such as Apache Iceberg,
>> Apache Hudi, and Pravega. • A few connectors are listed on company-related
>> repositories, such as Apache Pulsar on StreamNative and CDC connectors on
>> Ververica.
>> > My personal observation is that having a repository per connector seems
>> to increase the activity on a connector as it’s easier to maintain. For
>> example, in Apache Bahir all connectors are built against the same Flink
>> version, which may not be desirable when certain APIs change; for example,
>> SinkFunction will be eventually deprecated and removed but new Sink
>> interface may gain more features.
>> > Now, I'd like to outline different approaches. All approaches will
>> allow you to host your connector on any kind of personal, project, or
>> company repository. We still want to provide a default place where users
>> can contribute their connectors and hopefully grow a community around it.
>> The approaches are:
>> >  1. Create a mono-repo under the Apache umbrella where all connectors
>> will reside, for example, github.com/apache/flink-connectors. That
>> repository needs to follow its rules: No GitHub issues, no Dependabot or
>> similar tools, and a strict manual release process. It would be under the
>> Flink community, such that Flink committers can write to that repository
>> but no-one else. 2. Create a GitHub organization with small repositories,
>> for example github.com/flink-connectors. Since it’s not under the Apache
>> umbrella, we are free to use whatever process we deem best (up to a future
>> discussion). Each repository can have a shared list of maintainers +
>> connector specific committers. We can provide more automation. We may even
>> allow different licenses to incorporate things like a connector to Oracle
>> that cannot be released under ASL. 3. ??? <- please provide your additional
>> approaches
>> > In both cases, we will provide opinionated module/repository templates
>> based on a connector testing framework and guidelines. Depending on the
>> approach, we may need to enforce certain things.
>> > I’d like to first focus on what the community would ideally seek and
>> minimize the discussions around legal issues, which we would discuss later.
>> For now, I’d also like to postpone the discussion if we move all or only a
>> subset of connectors from Flink to the new default place as it seems to be
>> orthogonal to the fundamental discussion.
>> > PS: If the external repository for connectors is successful, I’d also
>> like to move out other things like formats, filesystems, and metric
>> reporters in the far future. So I’m actually aiming for
>> github.com/(apache/)flink-packages. But again this discussion is
>> orthogonal to the basic one.
>> > PPS: Depending on the chosen approach, there may be synergies with the
>> recently approved flink-extended organization.
>>
>

Re: [DISCUSS] Creating an external connector repository

Reply via email to