Hi Thanks for the proposal Jim! I will add you as a mentor then start the vote Cheers Olivier
On 16 February 2017 at 02:35, Jim Jagielski <j...@jagunet.com> wrote: > If you need/want another mentor, I volunteer > > > On Feb 14, 2017, at 3:53 PM, Olivier Lamy <ol...@apache.org> wrote: > > > > Hi > > Well I don't see issues as no one discuss the proposal. > > So I will start the official vote tomorrow. > > Cheers > > Olivier > > > > On 6 February 2017 at 14:08, Olivier Lamy <ol...@apache.org> wrote: > > > >> Hello everyone, > >> I would like to submit to you a proposal to bring Gooblin to the Apache > >> Software Foundation. > >> The text of the proposal is included below and available as a draft here > >> in the Wiki: https://wiki.apache.org/incubator/GobblinProposal > >> > >> We will appreciate any feedback and input. > >> > >> Olivier on behalf of the Gobblin community > >> > >> > >> = Apache Gobblin Proposal = > >> == Abstract == > >> Gobblin is a distributed data integration framework that simplifies > common > >> aspects of big data integration such as data ingestion, replication, > >> organization and lifecycle management for both streaming and batch data > >> ecosystems. > >> > >> == Proposal == > >> > >> Gobblin is a universal data integration framework. The framework has > been > >> used to build a variety of big data applications such as ingestion, > >> replication, and data retention. The fundamental constructs provided by > the > >> Gobblin framework are: > >> > >> 1. An expandable set of connectors that allow data to be integrated from > >> a variety of sources and sinks. The range of connectors already > available > >> in Gobblin are quite diverse and are an ever expanding set. To highlight > >> just a few examples, connectors exist for databases (e.g., MySQL, Oracle > >> Teradata, Couchbase etc.), web based technologies (REST APIs, FTP/SFTP > >> servers, Filers), scalable storage (HDFS, S3, Ambry etc,), streaming > data > >> (Kafka, EventHubs etc.), and a variety of proprietary data sources and > >> sinks (e.g.Salesforce, Google Analytics, Google Webmaster etc.). > Similarly, > >> Gobblin has a rich library of converters that allow for conversion of > data > >> from one format to another as data moves across system boundaries (e.g. > >> AVRO in HDFS to JSON in another system). > >> > >> > >> 2. Gobblin has a well defined and customizable state management layer > >> that allows writing stateful applications. These are particularly useful > >> when solving problems like bulk incremental ingest and keeping several > >> clusters replicated in sync. The ability to record work that has been > >> completed and what remains in a scalable manner is critical to writing > such > >> diverse applications successfully. > >> > >> > >> 3. Gobblin is agnostic to the underlying execution engine. It can be > >> tailored to run ontop of a variety of execution frameworks ranging from > >> multiple processes on a single node, to open source execution engines > like > >> MapReduce, Spark or Samza, natively on top of raw containers like Yarn > or > >> Mesos, and the public cloud like Amazon AWS or Microsoft Azure. We are > >> extending Gobblin to run on top of a self managed cluster when security > is > >> vital. This allows different applications that require different > degrees > >> of scalability, latency or security to be customized to for their > specific > >> needs. For example, highly latency sensitive applications can be > executed > >> in a streaming environment while batch based execution might benefit > >> applications where the priority might be geared towards optimal > container > >> utilization. > >> > >> 4.Gobblin comes out of the box with several diagnosability features like > >> Gobblin metrics and error handling. Collectively, these features allow > >> Gobblin to operate at the scale of petabytes of data. To give just one > >> example, the ability to quarantine a few bad records from an isolated > Kafka > >> topic without stopping the entire flow from continued execution is vital > >> when the number of Kafka topics range in the thousands and the > collective > >> data handled is in the petabytes. > >> > >> Gobblin thus provides crisply defined software constructs that can be > used > >> to build a vast array of data integration applications customizable for > >> varied user needs. It has become a preferred technology for data > >> integration use-cases by many organizations worldwide (see a partial > list > >> here). > >> > >> == Background == > >> > >> Over the last decade, data integration has evolved use case by use case > in > >> most companies. For example, at LinkedIn, when Kafka became a > significant > >> part of the data ecosystem, a system called Camus was built to ingest > this > >> data for analytics processing on Hadoop. Similarly, we had custom > pipelines > >> to ingest data from Salesforce, Oracle and myriad other sources. This > >> pattern became the norm rather than the exception and one point, > LinkedIn > >> was running at least fifteen different types of ingestion pipelines. > This > >> fragmentation has several unfortunate implications. Operational costs > scale > >> with the number of pipelines even if the myriad pipelines share a vasty > >> array of common features. Bug fixes and performance optimizations > cannot be > >> shared across the pipelines. A common set of practices around debugging > and > >> deployment does not emerge. Each pipeline operator will continue to > invest > >> in his little silo of the data integration world completely oblivious to > >> the challenges of his fellow operator sitting five tables down. > >> > >> These experiences were the genesis behind the design and implementation > of > >> Gobblin. Gobblin thus started out as a universal data ingestion > framework > >> focussed on extracting, transforming, and synchronizing large volumes of > >> data between different data sources and sinks. Not surprisingly, given > its > >> origins, the initial design of Gobblin placed great emphasis on > >> abstractions that can be leveraged repeatedly. These abstractions have > >> stood the test of time at LinkedIn and we have been able to leverage the > >> constructs well beyond ingest. Gobblin's architecture has allowed us at > >> LinkedIn to use it for a variety of applications ranging from from > optimal > >> format conversion to adhering to compliance policies set by European > >> standards. Finally, as noted earlier, Gobblin can be deployed in a > variety > >> of execution environments: it can be deployed as a library embedded in > >> another application or can be used to execute jobs on a public cloud. A > >> fluid architectural and execution design story has allowed Gobblin to > >> become a truly successful data integration platform. > >> > >> Gobblin has continued to evolve with a variety of utility packages like > >> Gobblin metrics and Gobblin config management. Collectively, these allow > >> organizations utilizing Gobblin to use a system that has been battle > tested > >> at LinkedIn scale. This is something that its consumers have to come to > >> appreciate greatly. > >> > >> == Rationale == > >> > >> Gobblin's entry to the Apache foundation is beneficial to both the > Gobblin > >> and the Apache communities. Gobblin has greatly benefited from its open > >> source roots. Its community and adoption has grown greatly as a result. > >> More importantly, the feedback from the community whether through > >> interactions at meetups or through the mailing list have allowed for a > rich > >> exchange of ideas. In order to grow up the Gobblin community and improve > >> the project, we would like to propose Gobblin to the Apache incubator. > The > >> Gobblin community will greatly benefit from the established development > and > >> consensus processes that have worked well for other projects. The Apache > >> process has served many other open source projects well and we believe > that > >> the Gobblin community will greatly benefit from these practices as well. > >> > >> == Initial Goals == > >> > >> Migrate the existing codebase to Apache > >> Study and Integrate with the Apache development process > >> Ensure all dependencies are compliant with Apache License version 2.0 > >> Incremental development and releases per Apache guidelines > >> Improve the relationship between Gobblin and other Apache projects > >> > >> == Current Status == > >> > >> Gobblin has undergone five major releases (0.5, 0.6, 0.7, 0.8, 0.9) and > >> many minor ones. The latest version, Gobblin 0.9 has just been released > in > >> December, 2016. Gobblin is being used in production by over 20 > >> organizations. Gobblin codebase is currently hosted at github.com, > which > >> will seed the Apache git repository. > >> > >> === Meritocracy === > >> > >> We plan to invest in supporting a meritocracy. We will discuss the > >> requirements in an open forum. Several companies have already expressed > >> interest in this project, and we intend to invite additional developers > to > >> participate. We will encourage and monitor community participation so > that > >> privileges can be extended to those that contribute. > >> > >> === Community === > >> > >> The need for a extensible and flexible data integration platform in the > >> open source is tremendous. Gobblin is currently being used by at least > 20 > >> organizations worldwide (some examples are listed here). By bringing > >> Gobblin into Apache, we believe that the community will grow even > bigger. > >> > >> === Core Developers === > >> > >> Gobblin was started by engineers at LinkedIn, and now has developers > from > >> Google, Facebook, LinkedIn, Cloudera, Nerdwallet, Swisscom, and many > other > >> companies. > >> > >> === Alignment === > >> > >> Gobblin aligns exceedingly well with the Apache ecosystem. Gobblin is > >> built leveraging several existing Apache projects (Apache Helix, Yarn, > >> Zookeeper etc.). As Gobblin matures, we expect to leverage several other > >> Apache projects further. This leverage invariably results in > contributions > >> back to these projects (e.g., a contribution to Helix was made during > the > >> Gobblin Yarn development). Finally, being an integration platform, it > >> serves as a bridge between several Apache projects like Apache Hadoop > and > >> Apache Kafka. This integration is highly desired and their interaction > >> through Gobblin will lead to a virtuous cycle of greater adoption and > newer > >> features in these projects. Thus, we believe that it will be a nice > >> addition to the current set of big data projects under the auspices of > the > >> Apache foundation. > >> > >> == Known Risks == > >> > >> === Orphaned Products === > >> > >> The risk of the Gobblin project being abandoned is minimal. As noted > >> earlier, there are many organizations that have already invested in > Gobblin > >> significantly and are thus incentivized to continue development. Many of > >> these organizations operate critical data ingest, compliance and > retention > >> pipelines built with Gobblin and are thus heavily invested in the > continued > >> success of Gobblin. > >> > >> === Inexperience with Open Source === > >> > >> Gobblin has existed as a healthy open source project for several years. > >> During that time, we have curated an open-source community successfully. > >> Any risks that we foresee are ones associated with scaling our open > source > >> communication and operation process rather than with inherent > inexperience > >> in operating an open source project. > >> > >> === Homogenous Developers === > >> > >> Gobblin’s committers are employed by companies of varying sizes and > >> industry. Committers come from well heeled internet companies like > Google, > >> LinkedIn and Facebook. We also have developers from traditional > enterprise > >> companies like SwissCom. Well funded startups like Nerdwallet are > active in > >> the community of developers. We plan to double our efforts in > cultivating > >> a diverse set of committers for Gobblin. > >> > >> === Reliance on Salaried Developers === > >> > >> It is expected that Gobblin development will occur on both salaried time > >> and on volunteer time, after hours. The majority of initial committers > are > >> paid by their employer to contribute to this project. However, they are > all > >> passionate about the project, and we are confident that the project will > >> continue even if no salaried developers contribute to the project. We > are > >> committed to recruiting additional committers including non-salaried > >> developers. > >> > >> === Relationships with Other Apache Products === > >> > >> As noted earlier, Gobblin leverages several open source projects and > >> contributes back to them. There is also overlap with aspects of other > >> Apache projects that we will discuss briefly here. Apache Nifi, like > >> Gobblin aspires to reduce the operational overhead arising from data > >> heterogeneity. Apache Nifi is structured as a visual flow based approach > >> and provides built-in constructs for buffering data, prioritizing data, > and > >> understanding data lineage as data flows across systems. Apache Nifi has > >> its own dataflow based execution engine with buffering, scheduling and > >> streaming capabilities. Apache Falcon is a Hadoop centric data > governance > >> engine for defining, scheduling, and monitoring data management policies > >> through flow definition typically for data that has been ingested into > >> Hadoop already. Apache Falcon generally delegates data management jobs > to > >> tools that already exist in the Hadoop ecosystem (e.g. Distcp, Sqoop, > Hive > >> etc). Apache Sqoop is primarily geared for bulk ingest especially from > >> databases which is one part of Gobblin’s feature set. Apache Flume > focuses > >> primarily on streaming data movement. Finally, general purpose data > >> processing engines like Apache Flink, Apache Samza, and Apache Spark > focus > >> on generic computation. > >> > >> Gobblin design choices intersect with specific features in all of these > >> systems, however in aggregate, it is a different point in the design > space. > >> It is designed to handle both streaming and batch data. It supports > >> execution through a standalone cluster mode as well as through existing > >> frameworks such as MR, Yarn, Hive, Samza etc allowing users to choose > the > >> deployment model that is optimal for the specific data integration > >> challenge. It provides native optimized implementations for critical > >> integrations such as Kafka, Hadoop - Hadoop copies etc. Gobblin also > >> supports both Hadoop and non-Hadoop data, being able to ingest data into > >> Kafka as well as other key-value stores like Couchbase. Gobblin is also > not > >> just a generic computation framework, it has specific constructs for > data > >> integration patterns such as data quality metrics and policies. > Gobblin’s > >> configuration management system allows it to be fully multi-tenant and > take > >> advantage of grouped policies when required. For batch workloads, > Gobblin > >> has a planning phase that provides for better resource utilization. > >> > >> In summary, there is healthy diversity in the number of systems > >> approaching the interesting and pressing problem of big data > integration. > >> We believe that Gobblin will provide another compelling choice in that > >> design space. > >> > >> === An Excessive Fascination with the Apache Brand === > >> > >> Gobblin is already a healthy and well known open source project. This > >> proposal is not for the purpose of generating publicity. Rather, the > >> primary benefits to joining Apache are already outlined in the Rationale > >> section. > >> > >> == Documentation == > >> > >> The reader will find these websites highly relevant: > >> * Website: http://linkedin.github.io/gobblin/ > >> * Documentation: https://gobblin.readthedocs.io/en/latest/ > >> * Codebase: https://github.com/linkedin/gobblin/ > >> * User group: https://groups.google.com/forum/#!forum/gobblin-users > >> > >> == Source and Intellectual Property Submission Plan == > >> > >> The Gobblin codebase is currently hosted on Github. This is the exact > >> codebase that we would migrate to the Apache foundation.The Gobblin > source > >> code is already licensed under Apache License Version 2.0. Going > forward, > >> we will continue to have all the contributions licensed directly to the > >> Apache foundation through our signed Individual Contributor License > >> Agreements for all the committers on the project. > >> > >> == External Dependencies == > >> > >> To the best of our knowledge, all of Gobblin dependencies are > distributed > >> under Apache compatible licenses. Upon acceptance to the incubator, we > >> would begin a thorough analysis of all transitive dependencies to verify > >> this fact and introduce license checking into the build and release > process > >> (for instance integrating Apache Rat). > >> > >> == Cryptography == > >> > >> We do not expect Gobblin to be a controlled export item due to the use > of > >> encryption. > >> > >> == Required Resources == > >> > >> === Mailing lists === > >> > >> * gobblin-user > >> * gobblin-dev > >> * gobblin-commits > >> * gobblin-private for private PMC discussions (with moderated > >> subscriptions) > >> > >> === Subversion Directory === > >> > >> Git is the preferred source control system: git:// > git.apache.org/gobblin > >> > >> === Issue Tracking === > >> > >> JIRA Gobblin (GOBBLIN) > >> > >> === Other Resources === > >> > >> The existing code already has unit and integration tests, so we would > >> like a Jenkins instance to run them whenever a new patch is submitted. > This > >> can be added after project creation. > >> > >> == Initial Committers == > >> > >> * Abhishek Tiwari <abhishektiwari dot btech at gmail dot com> > >> * Shirshanka Das <shirshanka at apache dot org> > >> * Chavdar Botev <cbotev at gmail dot com> > >> * Sahil Takiar <takiar.sahil at gmail dot com> > >> * Yinan Li <liyinan926 at gmail dot com> > >> * Ziyang Liu <> > >> * Lorand Bendig <lbendig at gmail dot com> > >> * Issac Buenrostro <ibuenros at linkedin dot com> > >> * Hung Tran <hutran at linkedin dot com> > >> * Olivier Lamy <olamy at apache dot org> > >> * Jean-Baptiste Onofré <jbono...@apache.org> > >> > >> == Affiliations == > >> > >> * Abhishek Tiwari - LinkedIn > >> * Shirshanka Das - LinkedIn > >> * Chavdar Botev - Stealth Startup > >> * Sahil Takiar - Cloudera > >> * Yinan Li - Google > >> * Ziyang Liu - Facebook > >> * Lorand Bendig - Swisscom > >> * Issac Buenrostro - LinkedIn > >> * Hung Tran - LinkedIn > >> * Olivier Lamy - Webtide > >> * Jean-Baptiste Onofre - Talend > >> > >> == Sponsors == > >> > >> === Champion === > >> > >> Olivier Lamy < olamy at apache dot org> > >> > >> === Nominated Mentors === > >> > >> * Olivier Lamy <olamy at apache dot org> > >> * Jean-Baptiste Onofre <jbonofre at apache dot org> > >> * ? > >> * ? > >> > >> == Sponsoring Entity == > >> The Apache Incubator > >> > > > > > > > > -- > > Olivier Lamy > > http://twitter.com/olamy | http://linkedin.com/in/olamy > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > > -- Olivier Lamy http://twitter.com/olamy | http://linkedin.com/in/olamy