Re: [DISCUSS] Proposing Griffin for Apache incubator

John D. Ament Wed, 23 Nov 2016 17:07:14 -0800

Ah shoot ok :-)
I'm used to seeing it next to the committer's names.  I guess that works
just as well.


Are the mentors all eBay as well?

John

On Wed, Nov 23, 2016 at 8:04 PM Henry Saputra <henry.sapu...@gmail.com>
wrote:

> Hi John,
>
> We have added this comment in the proposal:
>
> "
> The initial committers are employees of eBay Inc.
> "
>
> - Henry
>
> On Wed, Nov 23, 2016 at 4:50 PM, John D. Ament <johndam...@apache.org>
> wrote:
>
> > Henry,
> >
> > Can you add initial committer affiliations to the proposal?
> >
> > John
> >
> > On Wed, Nov 23, 2016 at 6:30 PM Henry Saputra <henry.sapu...@gmail.com>
> > wrote:
> >
> > > Hi All,
> > >
> > > As the champion for Griffin, I would like to bring up discussion to
> > > bring the project as Apache incubator podling.
> > >
> > > Here is the direct quote from the abstract:
> > >
> > > "
> > > Griffin is a Data Quality Service platform built on Apache Hadoop and
> > > Apache Spark. It provides a framework process for defining data
> > > quality model, executing data quality measurement, automating data
> > > profiling and validation, as well as a unified data quality
> > > visualization across multiple data systems. It tries to address the
> > > data quality challenges in big data and streaming context.
> > > "
> > >
> > > Here is the link to the proposal:
> > > https://wiki.apache.org/incubator/GriffinProposal
> > >
> > > I have copied the proposal below for easy access
> > >
> > >
> > > Thanks,
> > >
> > > - Henry
> > >
> > >
> > > Griffin Proposal
> > >
> > > Abstract
> > >
> > > Griffin is a Data Quality Service platform built on Apache Hadoop and
> > > Apache Spark. It provides a framework process for defining data
> > > quality model, executing data quality measurement, automating data
> > > profiling and validation, as well as a unified data quality
> > > visualization across multiple data systems. It tries to address the
> > > data quality challenges in big data and streaming context.
> > >
> > > Proposal
> > >
> > > Griffin is a open source Data Quality solution for distributed data
> > > systems at any scale in both streaming or batch data context. When
> > > people use open source products (e.g. Apache Hadoop, Apache Spark,
> > > Apache Kafka, Apache Storm), they always need a data quality service
> > > to build his/her confidence on data quality processed by those
> > > platforms. Griffin creates a unified process to define and construct
> > > data quality measurement pipeline across multiple data systems to
> > > provide:
> > >
> > > Automatic quality validation of the data
> > > Data profiling and anomaly detection
> > > Data quality lineage from upstream to downstream data systems.
> > > Data quality health monitoring visualization
> > > Shared infrastructure resource management
> > >
> > > Overview of Griffin
> > >
> > > Griffin has been deployed in production at eBay serving major data
> > > systems, it takes a platform approach to provide generic features to
> > > solve common data quality validation pain points. Firstly, user can
> > > register the data asset which user wants to do data quality check. The
> > > data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
> > > system or near real-time streaming data from Apache Kafka, Apache
> > > Storm and other real time data platforms. Secondly, user can create
> > > data quality model to define the data quality rule and metadata.
> > > Thirdly, the model or rule will be executed automatically (by the
> > > model engine) to get the sample data quality validation results in a
> > > few seconds for streaming data. Finally, user can analyze the data
> > > quality results through built-in visualization tool to take actions.
> > >
> > > Griffin includes:
> > >
> > > Data Quality Model Engine
> > >
> > > Griffin is model driven solution, user can choose various data quality
> > > dimension to execute his/her data quality validation based on selected
> > > target data-set or source data-set ( as the golden reference data). It
> > > has a corresponding library supporting it in back-end for the
> > > following measurement:
> > >
> > > Accuracy - Does data reflect the real-world objects or a verifiable
> > source
> > > Completeness - Is all necessary data present
> > > Validity - Are all data values within the data domains specified by the
> > > business
> > > Timeliness - Is the data available at the time needed
> > > Anomaly detection - Pre-built algorithm functions for the
> > > identification of items, events or observations which do not conform
> > > to an expected pattern or other items in a dataset
> > > Data Profiling - Apply statistical analysis and assessment of data
> > > values within a dataset for consistency, uniqueness and logic.
> > >
> > > Data Collection Layer
> > >
> > > We support two kinds of data sources, batch data and real time data.
> > >
> > > For batch mode, we can collect data source from Apache Hadoop based
> > > platform by various data connectors.
> > >
> > > For real time mode, we can connect with messaging system like Kafka to
> > > near real time analysis.
> > >
> > > Data Process and Storage Layer
> > >
> > > For batch analysis, our data quality model will compute data quality
> > > metrics in our spark cluster based on data source in Apache Hadoop.
> > >
> > > For near real time analysis, we consume data from messaging system,
> > > then our data quality model will compute our real time data quality
> > > metrics in our spark cluster. for data storage, we use time series
> > > database in our back end to fulfill front end request.
> > >
> > > Griffin Service
> > >
> > > We have RESTful web services to accomplish all the functionalities of
> > > Griffin, such as register data asset, create data quality model,
> > > publish metrics, retrieve metrics, add subscription, etc. So, the
> > > developers can develop their own user interface based on these web
> > > services.
> > >
> > > Background
> > >
> > > At eBay, when people play with big data in Apache Hadoop (or other
> > > streaming data), data quality often becomes one big challenge.
> > > Different teams have built customized data quality tools to detect and
> > > analyze data quality issues within their own domain. We are thinking
> > > to take a platform approach to provide shared Infrastructure and
> > > generic features to solve common data quality pain points. This would
> > > enable us to build trusted data assets.
> > >
> > > Currently it’s very difficult and costly to do data quality validation
> > > when we have big data flow across multi-platforms at eBay (e.g.
> > > Oracle, Apache Hadoop, Couchbase, Apache Cassandra, Apache Kafka,
> > > MongoDB). Take eBay real time personalization platform as an example.
> > > Every day we have to validate data quality status for ~600M records (
> > > imagine we have 150M active users for our website). Data quality often
> > > becomes one big challenge both in its streaming and batch pipelines.
> > >
> > > So we conclude 3 data quality problems at eBay:
> > >
> > > Lack of end2end unified view of data quality measurement from multiple
> > > data sources to target applications, it usually takes a long time to
> > > identify and fix poor data quality.
> > > How to get data quality measured in streaming mode, we need to have a
> > > process and tool to visualize data quality insights through
> > > registering dataset which you want to check data quality, creating
> > > data quality measurement model, executing the data quality validation
> > > job and getting metrics insights for action taking.
> > > No Shared platform and API Service, have to apply and manage own
> > > hardware and software infrastructure.
> > >
> > > Rationale
> > >
> > > The challenge we face at eBay is that our data volume is becoming
> > > bigger and bigger, system processes become more complex, while we do
> > > not have a unified data quality solution to ensure the trusted data
> > > sets which provide confidences on data quality to our data consumers.
> > > The key challenges on data quality includes:
> > >
> > > Existing commercial data quality solution cannot address data quality
> > > lineage among systems, cannot scale out to support fast growing data
> > > at eBay
> > > Existing eBay's domain specific tools take a long time to identify and
> > > fix poor data quality when data flowed through multiple systems
> > > Business logic becomes complex, requires data quality system much
> > flexible.
> > >
> > > Some data quality issues do have business impact on user experiences,
> > > revenue, efficiency & compliance.
> > >
> > > Communication overhead of data quality metrics, typically in a big
> > > organization, which involve different teams.
> > >
> > > The idea of Griffin is to provide Data Quality validation as a
> > > Service, to allow data engineers and data consumers to have:
> > >
> > > Near real-time understanding of the data quality health of your data
> > > pipelines with end-to-end monitoring, all in one place.
> > > Profiling, detecting and correlating issues and providing
> > > recommendations that drive rapid and focused troubleshooting
> > > A centralized data quality model management system including rule,
> > > metadata, scheduler etc.
> > > Native code generation to run everywhere, including Hadoop, Kafka,
> Spark,
> > > etc.
> > > One set of tools to build data quality pipelines across all eBay data
> > > platforms.
> > >
> > > Current Status
> > >
> > > Meritocracy
> > >
> > > Griffin has been deployed in production at eBay and provided the
> > > centralized data quality service for several eBay systems ( for
> > > example, real time personalization platform, eBay real time ID linking
> > > platform, Hadoop datasets, Site speed analytics platform). Our aim is
> > > to build a diverse developer and user community following the Apache
> > > meritocracy model. We will encourage contributions and participation
> > > of all types of work, and ensure that contributors are appropriately
> > > recognized.
> > >
> > > Community
> > >
> > > Currently the project is being developed at eBay. It's only for eBay
> > > internal community. Griffin seeks to develop the developer and user
> > > communities during incubation. We believe it will grow substantially
> > > by becoming an Apache project.
> > >
> > > Core Developers
> > >
> > > Griffin is currently being designed and developed by engineers from
> > > eBay Inc. – William Guo, Alex Lv, Shawn Sha, Vincent Zhao, John Liu.
> > > All of these core developers have deep expertise in Apache Hadoop and
> > > the Hadoop Ecosystem in general.
> > >
> > > Alignment
> > >
> > > The ASF is a natural host for Griffin given that it is already the
> > > home of Hadoop, Beam, HBase, Hive, Storm, Kafka, Spark and other
> > > emerging big data products. Those are requiring data quality solution
> > > by nature to ensure the data quality which they processed. When people
> > > use open source data technology, the big question to them is that how
> > > we can ensure the data quality in it. Griffin leverages lot of Apache
> > > open-source products. Griffin was designed to enable real time
> > > insights into data quality validation by shared Infrastructure and
> > > generic features to solve common data quality pain points.
> > >
> > > Known Risks
> > >
> > > Orphaned Products
> > >
> > > The core developers of Griffin team work full time on this project.
> > > There is no risk of Griffin getting orphaned since at least one large
> > > company (eBay) is extensively using it in their production Hadoop and
> > > Spark clusters for multiple data systems. For example, currently there
> > > are 4 data systems at eBay (real time personalization platform, eBay
> > > real time ID linking platform, Hadoop, Site speed analytics platform)
> > > are leveraging Griffin, with more than ~600M records for data quality
> > > status validation every day, 35 data sets being monitored, 50+ data
> > > quality models have been created.
> > >
> > > As Griffin is designed to connect many types of data sources, we are
> > > very confident that they will use Griffin as a service for ensuring
> > > the data quality in open source data ecosystems. We plan to extend and
> > > diversify this community further through Apache.
> > >
> > > Inexperience with Open Source
> > >
> > > Griffin's core engineers are all active users and followers of open
> > > source projects. They are already committers and contributors to the
> > > Griffin Github project. All have been involved with the source code
> > > that has been released under an open source license, and several of
> > > them also have experience developing code in an open source
> > > environment. Though the core set of Developers do not have Apache Open
> > > Source experience, there are plans to onboard individuals with Apache
> > > open source experience on to the project.
> > >
> > > Homogenous Developers
> > >
> > > The core developers are from eBay. Apache Incubation process
> > > encourages an open and diverse meritocratic community. Griffin intends
> > > to make every possible effort to build a diverse, vibrant and involved
> > > community. We are committed to recruiting additional committers from
> > > other companies based on their contribution to the project.
> > >
> > > Reliance on Salaried Developers
> > >
> > > eBay invested in Griffin as a company-wide data quality service
> > > platform and some of its key engineers are working full time on the
> > > project. they are all paid by eBay. We look forward to other Apache
> > > developers and researchers to contribute to the project.
> > >
> > > Relationships with Other Apache Products
> > >
> > > Griffin has a strong relationship and dependency with Apache Hadoop,
> > > Apache HBase, Apache Spark, Apache Kafka and Apache Storm, Apache
> > > Hive. In addition, since there is a growing need for data quality
> > > solution for open source platform (e.g. Hadoop, Kafka, Spark etc),
> > > being part of Apache’s Incubation community, could help with a closer
> > > collaboration among these four projects and as well as others.
> > >
> > > Documentation
> > >
> > > Information about Griffin can be found at https://github.com/eBay/
> > griffin
> > >
> > > Initial Source
> > >
> > > Griffin has been under development since early 2016 by a team of
> > > engineers at eBay Inc. It is currently hosted on Github.com under an
> > > Apache license 2.0 at https://github.com/eBay/griffin . Once in
> > > incubation we will be moving the code base to apache git library.
> > >
> > > External Dependencies
> > >
> > > Griffin has the following external dependencies.
> > >
> > > Basic
> > >
> > > JDK 1.7+
> > > Scala
> > > Apache Maven
> > > JUnit
> > > Log4j
> > > Slf4j
> > > Apache Commons
> > >
> > > Hadoop
> > >
> > > Apache Hadoop
> > > Apache HBase
> > > Apache Hive
> > >
> > > DB
> > >
> > > InfluxData
> > >
> > > Apache Spark
> > >
> > > Spark Core Library
> > >
> > > REST Service
> > >
> > > Jersey
> > > Spring MVC
> > >
> > > Web frontend
> > >
> > > AngularJS
> > > jQuery
> > > Bootstrap
> > > RequireJS
> > > eCharts
> > > Font Awesome
> > >
> > > Cryptography
> > >
> > > Currently there's no cryptography in Griffin.
> > >
> > > Required Resources
> > >
> > > Mailing List
> > >
> > > We currently use eBay mail box to communicate, but we'd like to move
> > > that to ASF maintained mailing lists.
> > >
> > > Current mailing list: ebay-griffin-d...@googlegroups.com
> > >
> > > Proposed ASF maintained lists:
> > >
> > > priv...@griffin.incubator.apache.org
> > >
> > > d...@griffin.incubator.apache.org
> > >
> > > comm...@griffin.incubator.apache.org
> > >
> > > Subversion Directory
> > >
> > > Git is the preferred source control system.
> > >
> > > Issue Tracking
> > >
> > > JIRA
> > >
> > > Other Resources
> > >
> > > The existing code already has unit tests so we will make use of
> > > existing Apache continuous testing infrastructure. The resulting load
> > > should not be very large.
> > >
> > > Initial Committers
> > >
> > > William Go
> > > Alex Lv
> > > Vincent Zhao
> > > Shawn Sha
> > > John Liu
> > > Liang Shao
> > >
> > > Affiliations
> > >
> > > The initial committers are employees of eBay Inc.
> > >
> > > Sponsors
> > >
> > > Champion
> > >
> > > Henry Saputra (hsapu...@apache.org)
> > >
> > > Nominated Mentors
> > >
> > > Kasper Sørensen (kasper...@apache.org)
> > >
> > > Uma Maheswara Rao Gangumalla (umamah...@apache.org)
> > >
> > > Luciano Resende (luckbr1...@gmail.com)
> > >
> > > Sponsoring Entity
> > >
> > > We are requesting the Incubator to sponsor this project.
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > > For additional commands, e-mail: general-h...@incubator.apache.org
> > >
> > >
> >
>

Re: [DISCUSS] Proposing Griffin for Apache incubator

Reply via email to