Re: [DISCUSS] Connectors in Apache Gobblin

Chris Li Wed, 24 Mar 2021 18:38:01 -0700

Hi JB, Shirshanka,

Thanks for sharing your insights!


About release cycles if we are going to make connector into a sub-repo, 
generally connectors will be updated more frequently than core; therefore, 
separate release cycles should be more manageable. To facilitate that, I had 
already made Gobblin core an external dependency in my code.

Great point to have a catalog! once we started decoupling release cycles, a 
catalog will be essential, otherwise, the combination of different releases 
would be hard to understand to end users, thinking what if we add a sub-repo 
for new state store managers in the future, for example.

One question on having a sub-repo for each connector. With the generic 
multistage connector, we won’t be expecting as many connectors like before. For 
example, Salesforce, Zuora, and Google Ingestion are all replaced by it. In 
such case, shall we keep connectors in a single sub-repo? I do see Kafka 
created sub-repos for JDBC and Elastic Search connectors, etc., but those repos 
look extremely small.  The corresponding practice would be to make the 
“protocols” from multistage connector into sub-repos. Then those sub-repos will 
also have just a couple of Java classes.

Regards,
Chris


> Hi Chris,
>   Thanks for this proposal! I think we have had quite a few issues with our
> monolithic repository and I think it has hindered the development and
> maintenance of new connectors.
>   JB makes some good points that are worth considering.
>
>   My 2c:
>    I think separating out the connectors into a separate repo, and in fact
> supporting multiple repos that can contain separate connectors is probably
> going to be my vote.
>    This will help us also clarify the "public API" of the Gobblin framework
> versus internal details that many connectors probably depend on today.
>
>  I would rather follow the Kafka Connect model of — core framework has
> API-s and is versioned independently from connector implementations which
> can live in other repositories. Implementations should feature in the
> "Connector Matrix" as part of the documentation for discoverability.
>
> There can be an official catalog of supported connectors, and maybe that
> can be our first "repo" that Abhishek is proposing. But I would make sure
> we are not creating a new monorepo pattern with it.
>
> What do others think?
> Shirshanka
>
>
>
>
>
> On Mon, Mar 22, 2021 at 10:09 PM, Jean-Baptiste Onofre <[email protected]>
> wrote:
>
> Hi Chris,
>
> I agree that connector is very important. Other Apache projects became
> popular mostly thank to the connectors set (I’m thinking about Apache Beam,
> Apache Camel, or Apache Karaf Decanter for instance). The connectors allow
> more users to "integrate" Gobblin in their ecosystem, so it would increase
> our users community. It will also increase our dev community as it’s
> probably easier to contribute on connector than in the Gobblin core.
>
> About the repo vs module, there are two questions IMHO:
> 1. How to keep API/code sync together between Gobblin core and the
> connectors
> 2. Do we plan to have a different release cycle between core and
> connectors (even if it’s always possible to release a module atomically)
>
> IMHO, if we plan to do a Gobblin release including core + connectors, then
> a module is easier.
>
> Regards
> JB
>
> Le 22 mars 2021 à 23:44, Chris Li <[email protected]> a écrit :
>
> Proposal:
>
> DIL (LinkedIn internal project name) is a generic multi-stage Gobblin
> connector library. The code can be accessed here: https://github.com/
> linkedin/gobblin-connectors. Its core features and high level
> descriptions are shared here: https://engineering.linkedin.com/blog/2021/
> data-integration-library.
>
> Per initial discussion with members of Gobblin community, we are here
> proposing a separate sub-repo for this library.
>
> Why:
> Some thoughts/justifications of a sub-repo vs. a module in the main
> Gobblin repo.
>
> 1. Gobblin connectors are important part of Gobblin ecosystem, but the
> development of connectors is relatively independent of Gobblin core.
> 2. Gobblin connector is where open source communities can contribute the
> most, and it will be growing much faster than Gobblin core.
> 3. The new connector library is a comprehensive package of unique design
> patterns. This is where the data integration diversity challenge will be
> addressed. The importance of this code base grows by day as more
> integration scenarios are becoming supported.
> 4. The new connector library evolves and replaces many prior Gobblin
> connectors under the “gobblin-modules” module. A separate repo will help
> avoid confusion.
> 5. Separating core and ecosystem modules can help improve isolation and
> reduce the number of defects.
>
> Regards,
> Chris
>
>

Re: [DISCUSS] Connectors in Apache Gobblin

Reply via email to