Ah shoot ok :-) I'm used to seeing it next to the committer's names. I guess that works just as well.
Are the mentors all eBay as well? John On Wed, Nov 23, 2016 at 8:04 PM Henry Saputra <henry.sapu...@gmail.com> wrote: > Hi John, > > We have added this comment in the proposal: > > " > The initial committers are employees of eBay Inc. > " > > - Henry > > On Wed, Nov 23, 2016 at 4:50 PM, John D. Ament <johndam...@apache.org> > wrote: > > > Henry, > > > > Can you add initial committer affiliations to the proposal? > > > > John > > > > On Wed, Nov 23, 2016 at 6:30 PM Henry Saputra <henry.sapu...@gmail.com> > > wrote: > > > > > Hi All, > > > > > > As the champion for Griffin, I would like to bring up discussion to > > > bring the project as Apache incubator podling. > > > > > > Here is the direct quote from the abstract: > > > > > > " > > > Griffin is a Data Quality Service platform built on Apache Hadoop and > > > Apache Spark. It provides a framework process for defining data > > > quality model, executing data quality measurement, automating data > > > profiling and validation, as well as a unified data quality > > > visualization across multiple data systems. It tries to address the > > > data quality challenges in big data and streaming context. > > > " > > > > > > Here is the link to the proposal: > > > https://wiki.apache.org/incubator/GriffinProposal > > > > > > I have copied the proposal below for easy access > > > > > > > > > Thanks, > > > > > > - Henry > > > > > > > > > Griffin Proposal > > > > > > Abstract > > > > > > Griffin is a Data Quality Service platform built on Apache Hadoop and > > > Apache Spark. It provides a framework process for defining data > > > quality model, executing data quality measurement, automating data > > > profiling and validation, as well as a unified data quality > > > visualization across multiple data systems. It tries to address the > > > data quality challenges in big data and streaming context. > > > > > > Proposal > > > > > > Griffin is a open source Data Quality solution for distributed data > > > systems at any scale in both streaming or batch data context. When > > > people use open source products (e.g. Apache Hadoop, Apache Spark, > > > Apache Kafka, Apache Storm), they always need a data quality service > > > to build his/her confidence on data quality processed by those > > > platforms. Griffin creates a unified process to define and construct > > > data quality measurement pipeline across multiple data systems to > > > provide: > > > > > > Automatic quality validation of the data > > > Data profiling and anomaly detection > > > Data quality lineage from upstream to downstream data systems. > > > Data quality health monitoring visualization > > > Shared infrastructure resource management > > > > > > Overview of Griffin > > > > > > Griffin has been deployed in production at eBay serving major data > > > systems, it takes a platform approach to provide generic features to > > > solve common data quality validation pain points. Firstly, user can > > > register the data asset which user wants to do data quality check. The > > > data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop > > > system or near real-time streaming data from Apache Kafka, Apache > > > Storm and other real time data platforms. Secondly, user can create > > > data quality model to define the data quality rule and metadata. > > > Thirdly, the model or rule will be executed automatically (by the > > > model engine) to get the sample data quality validation results in a > > > few seconds for streaming data. Finally, user can analyze the data > > > quality results through built-in visualization tool to take actions. > > > > > > Griffin includes: > > > > > > Data Quality Model Engine > > > > > > Griffin is model driven solution, user can choose various data quality > > > dimension to execute his/her data quality validation based on selected > > > target data-set or source data-set ( as the golden reference data). It > > > has a corresponding library supporting it in back-end for the > > > following measurement: > > > > > > Accuracy - Does data reflect the real-world objects or a verifiable > > source > > > Completeness - Is all necessary data present > > > Validity - Are all data values within the data domains specified by the > > > business > > > Timeliness - Is the data available at the time needed > > > Anomaly detection - Pre-built algorithm functions for the > > > identification of items, events or observations which do not conform > > > to an expected pattern or other items in a dataset > > > Data Profiling - Apply statistical analysis and assessment of data > > > values within a dataset for consistency, uniqueness and logic. > > > > > > Data Collection Layer > > > > > > We support two kinds of data sources, batch data and real time data. > > > > > > For batch mode, we can collect data source from Apache Hadoop based > > > platform by various data connectors. > > > > > > For real time mode, we can connect with messaging system like Kafka to > > > near real time analysis. > > > > > > Data Process and Storage Layer > > > > > > For batch analysis, our data quality model will compute data quality > > > metrics in our spark cluster based on data source in Apache Hadoop. > > > > > > For near real time analysis, we consume data from messaging system, > > > then our data quality model will compute our real time data quality > > > metrics in our spark cluster. for data storage, we use time series > > > database in our back end to fulfill front end request. > > > > > > Griffin Service > > > > > > We have RESTful web services to accomplish all the functionalities of > > > Griffin, such as register data asset, create data quality model, > > > publish metrics, retrieve metrics, add subscription, etc. So, the > > > developers can develop their own user interface based on these web > > > services. > > > > > > Background > > > > > > At eBay, when people play with big data in Apache Hadoop (or other > > > streaming data), data quality often becomes one big challenge. > > > Different teams have built customized data quality tools to detect and > > > analyze data quality issues within their own domain. We are thinking > > > to take a platform approach to provide shared Infrastructure and > > > generic features to solve common data quality pain points. This would > > > enable us to build trusted data assets. > > > > > > Currently it’s very difficult and costly to do data quality validation > > > when we have big data flow across multi-platforms at eBay (e.g. > > > Oracle, Apache Hadoop, Couchbase, Apache Cassandra, Apache Kafka, > > > MongoDB). Take eBay real time personalization platform as an example. > > > Every day we have to validate data quality status for ~600M records ( > > > imagine we have 150M active users for our website). Data quality often > > > becomes one big challenge both in its streaming and batch pipelines. > > > > > > So we conclude 3 data quality problems at eBay: > > > > > > Lack of end2end unified view of data quality measurement from multiple > > > data sources to target applications, it usually takes a long time to > > > identify and fix poor data quality. > > > How to get data quality measured in streaming mode, we need to have a > > > process and tool to visualize data quality insights through > > > registering dataset which you want to check data quality, creating > > > data quality measurement model, executing the data quality validation > > > job and getting metrics insights for action taking. > > > No Shared platform and API Service, have to apply and manage own > > > hardware and software infrastructure. > > > > > > Rationale > > > > > > The challenge we face at eBay is that our data volume is becoming > > > bigger and bigger, system processes become more complex, while we do > > > not have a unified data quality solution to ensure the trusted data > > > sets which provide confidences on data quality to our data consumers. > > > The key challenges on data quality includes: > > > > > > Existing commercial data quality solution cannot address data quality > > > lineage among systems, cannot scale out to support fast growing data > > > at eBay > > > Existing eBay's domain specific tools take a long time to identify and > > > fix poor data quality when data flowed through multiple systems > > > Business logic becomes complex, requires data quality system much > > flexible. > > > > > > Some data quality issues do have business impact on user experiences, > > > revenue, efficiency & compliance. > > > > > > Communication overhead of data quality metrics, typically in a big > > > organization, which involve different teams. > > > > > > The idea of Griffin is to provide Data Quality validation as a > > > Service, to allow data engineers and data consumers to have: > > > > > > Near real-time understanding of the data quality health of your data > > > pipelines with end-to-end monitoring, all in one place. > > > Profiling, detecting and correlating issues and providing > > > recommendations that drive rapid and focused troubleshooting > > > A centralized data quality model management system including rule, > > > metadata, scheduler etc. > > > Native code generation to run everywhere, including Hadoop, Kafka, > Spark, > > > etc. > > > One set of tools to build data quality pipelines across all eBay data > > > platforms. > > > > > > Current Status > > > > > > Meritocracy > > > > > > Griffin has been deployed in production at eBay and provided the > > > centralized data quality service for several eBay systems ( for > > > example, real time personalization platform, eBay real time ID linking > > > platform, Hadoop datasets, Site speed analytics platform). Our aim is > > > to build a diverse developer and user community following the Apache > > > meritocracy model. We will encourage contributions and participation > > > of all types of work, and ensure that contributors are appropriately > > > recognized. > > > > > > Community > > > > > > Currently the project is being developed at eBay. It's only for eBay > > > internal community. Griffin seeks to develop the developer and user > > > communities during incubation. We believe it will grow substantially > > > by becoming an Apache project. > > > > > > Core Developers > > > > > > Griffin is currently being designed and developed by engineers from > > > eBay Inc. – William Guo, Alex Lv, Shawn Sha, Vincent Zhao, John Liu. > > > All of these core developers have deep expertise in Apache Hadoop and > > > the Hadoop Ecosystem in general. > > > > > > Alignment > > > > > > The ASF is a natural host for Griffin given that it is already the > > > home of Hadoop, Beam, HBase, Hive, Storm, Kafka, Spark and other > > > emerging big data products. Those are requiring data quality solution > > > by nature to ensure the data quality which they processed. When people > > > use open source data technology, the big question to them is that how > > > we can ensure the data quality in it. Griffin leverages lot of Apache > > > open-source products. Griffin was designed to enable real time > > > insights into data quality validation by shared Infrastructure and > > > generic features to solve common data quality pain points. > > > > > > Known Risks > > > > > > Orphaned Products > > > > > > The core developers of Griffin team work full time on this project. > > > There is no risk of Griffin getting orphaned since at least one large > > > company (eBay) is extensively using it in their production Hadoop and > > > Spark clusters for multiple data systems. For example, currently there > > > are 4 data systems at eBay (real time personalization platform, eBay > > > real time ID linking platform, Hadoop, Site speed analytics platform) > > > are leveraging Griffin, with more than ~600M records for data quality > > > status validation every day, 35 data sets being monitored, 50+ data > > > quality models have been created. > > > > > > As Griffin is designed to connect many types of data sources, we are > > > very confident that they will use Griffin as a service for ensuring > > > the data quality in open source data ecosystems. We plan to extend and > > > diversify this community further through Apache. > > > > > > Inexperience with Open Source > > > > > > Griffin's core engineers are all active users and followers of open > > > source projects. They are already committers and contributors to the > > > Griffin Github project. All have been involved with the source code > > > that has been released under an open source license, and several of > > > them also have experience developing code in an open source > > > environment. Though the core set of Developers do not have Apache Open > > > Source experience, there are plans to onboard individuals with Apache > > > open source experience on to the project. > > > > > > Homogenous Developers > > > > > > The core developers are from eBay. Apache Incubation process > > > encourages an open and diverse meritocratic community. Griffin intends > > > to make every possible effort to build a diverse, vibrant and involved > > > community. We are committed to recruiting additional committers from > > > other companies based on their contribution to the project. > > > > > > Reliance on Salaried Developers > > > > > > eBay invested in Griffin as a company-wide data quality service > > > platform and some of its key engineers are working full time on the > > > project. they are all paid by eBay. We look forward to other Apache > > > developers and researchers to contribute to the project. > > > > > > Relationships with Other Apache Products > > > > > > Griffin has a strong relationship and dependency with Apache Hadoop, > > > Apache HBase, Apache Spark, Apache Kafka and Apache Storm, Apache > > > Hive. In addition, since there is a growing need for data quality > > > solution for open source platform (e.g. Hadoop, Kafka, Spark etc), > > > being part of Apache’s Incubation community, could help with a closer > > > collaboration among these four projects and as well as others. > > > > > > Documentation > > > > > > Information about Griffin can be found at https://github.com/eBay/ > > griffin > > > > > > Initial Source > > > > > > Griffin has been under development since early 2016 by a team of > > > engineers at eBay Inc. It is currently hosted on Github.com under an > > > Apache license 2.0 at https://github.com/eBay/griffin . Once in > > > incubation we will be moving the code base to apache git library. > > > > > > External Dependencies > > > > > > Griffin has the following external dependencies. > > > > > > Basic > > > > > > JDK 1.7+ > > > Scala > > > Apache Maven > > > JUnit > > > Log4j > > > Slf4j > > > Apache Commons > > > > > > Hadoop > > > > > > Apache Hadoop > > > Apache HBase > > > Apache Hive > > > > > > DB > > > > > > InfluxData > > > > > > Apache Spark > > > > > > Spark Core Library > > > > > > REST Service > > > > > > Jersey > > > Spring MVC > > > > > > Web frontend > > > > > > AngularJS > > > jQuery > > > Bootstrap > > > RequireJS > > > eCharts > > > Font Awesome > > > > > > Cryptography > > > > > > Currently there's no cryptography in Griffin. > > > > > > Required Resources > > > > > > Mailing List > > > > > > We currently use eBay mail box to communicate, but we'd like to move > > > that to ASF maintained mailing lists. > > > > > > Current mailing list: ebay-griffin-d...@googlegroups.com > > > > > > Proposed ASF maintained lists: > > > > > > priv...@griffin.incubator.apache.org > > > > > > d...@griffin.incubator.apache.org > > > > > > comm...@griffin.incubator.apache.org > > > > > > Subversion Directory > > > > > > Git is the preferred source control system. > > > > > > Issue Tracking > > > > > > JIRA > > > > > > Other Resources > > > > > > The existing code already has unit tests so we will make use of > > > existing Apache continuous testing infrastructure. The resulting load > > > should not be very large. > > > > > > Initial Committers > > > > > > William Go > > > Alex Lv > > > Vincent Zhao > > > Shawn Sha > > > John Liu > > > Liang Shao > > > > > > Affiliations > > > > > > The initial committers are employees of eBay Inc. > > > > > > Sponsors > > > > > > Champion > > > > > > Henry Saputra (hsapu...@apache.org) > > > > > > Nominated Mentors > > > > > > Kasper Sørensen (kasper...@apache.org) > > > > > > Uma Maheswara Rao Gangumalla (umamah...@apache.org) > > > > > > Luciano Resende (luckbr1...@gmail.com) > > > > > > Sponsoring Entity > > > > > > We are requesting the Incubator to sponsor this project. > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > > > > > >