[DISCUSS] Proposing Griffin for Apache incubator

Henry Saputra Wed, 23 Nov 2016 15:30:38 -0800

Hi All,

As the champion for Griffin, I would like to bring up discussion to
bring the project as Apache incubator podling.


Here is the direct quote from the abstract:

"
Griffin is a Data Quality Service platform built on Apache Hadoop and
Apache Spark. It provides a framework process for defining data
quality model, executing data quality measurement, automating data
profiling and validation, as well as a unified data quality
visualization across multiple data systems. It tries to address the
data quality challenges in big data and streaming context.
"

Here is the link to the proposal:
https://wiki.apache.org/incubator/GriffinProposal

I have copied the proposal below for easy access


Thanks,

- Henry


Griffin Proposal

Abstract

Griffin is a Data Quality Service platform built on Apache Hadoop and
Apache Spark. It provides a framework process for defining data
quality model, executing data quality measurement, automating data
profiling and validation, as well as a unified data quality
visualization across multiple data systems. It tries to address the
data quality challenges in big data and streaming context.

Proposal

Griffin is a open source Data Quality solution for distributed data
systems at any scale in both streaming or batch data context. When
people use open source products (e.g. Apache Hadoop, Apache Spark,
Apache Kafka, Apache Storm), they always need a data quality service
to build his/her confidence on data quality processed by those
platforms. Griffin creates a unified process to define and construct
data quality measurement pipeline across multiple data systems to
provide:

Automatic quality validation of the data
Data profiling and anomaly detection
Data quality lineage from upstream to downstream data systems.
Data quality health monitoring visualization
Shared infrastructure resource management

Overview of Griffin

Griffin has been deployed in production at eBay serving major data
systems, it takes a platform approach to provide generic features to
solve common data quality validation pain points. Firstly, user can
register the data asset which user wants to do data quality check. The
data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
system or near real-time streaming data from Apache Kafka, Apache
Storm and other real time data platforms. Secondly, user can create
data quality model to define the data quality rule and metadata.
Thirdly, the model or rule will be executed automatically (by the
model engine) to get the sample data quality validation results in a
few seconds for streaming data. Finally, user can analyze the data
quality results through built-in visualization tool to take actions.

Griffin includes:

Data Quality Model Engine

Griffin is model driven solution, user can choose various data quality
dimension to execute his/her data quality validation based on selected
target data-set or source data-set ( as the golden reference data). It
has a corresponding library supporting it in back-end for the
following measurement:

Accuracy - Does data reflect the real-world objects or a verifiable source
Completeness - Is all necessary data present
Validity - Are all data values within the data domains specified by the business
Timeliness - Is the data available at the time needed
Anomaly detection - Pre-built algorithm functions for the
identification of items, events or observations which do not conform
to an expected pattern or other items in a dataset
Data Profiling - Apply statistical analysis and assessment of data
values within a dataset for consistency, uniqueness and logic.

Data Collection Layer

We support two kinds of data sources, batch data and real time data.

For batch mode, we can collect data source from Apache Hadoop based
platform by various data connectors.

For real time mode, we can connect with messaging system like Kafka to
near real time analysis.

Data Process and Storage Layer

For batch analysis, our data quality model will compute data quality
metrics in our spark cluster based on data source in Apache Hadoop.

For near real time analysis, we consume data from messaging system,
then our data quality model will compute our real time data quality
metrics in our spark cluster. for data storage, we use time series
database in our back end to fulfill front end request.

Griffin Service

We have RESTful web services to accomplish all the functionalities of
Griffin, such as register data asset, create data quality model,
publish metrics, retrieve metrics, add subscription, etc. So, the
developers can develop their own user interface based on these web
services.

Background

At eBay, when people play with big data in Apache Hadoop (or other
streaming data), data quality often becomes one big challenge.
Different teams have built customized data quality tools to detect and
analyze data quality issues within their own domain. We are thinking
to take a platform approach to provide shared Infrastructure and
generic features to solve common data quality pain points. This would
enable us to build trusted data assets.

Currently it’s very difficult and costly to do data quality validation
when we have big data flow across multi-platforms at eBay (e.g.
Oracle, Apache Hadoop, Couchbase, Apache Cassandra, Apache Kafka,
MongoDB). Take eBay real time personalization platform as an example.
Every day we have to validate data quality status for ~600M records (
imagine we have 150M active users for our website). Data quality often
becomes one big challenge both in its streaming and batch pipelines.

So we conclude 3 data quality problems at eBay:

Lack of end2end unified view of data quality measurement from multiple
data sources to target applications, it usually takes a long time to
identify and fix poor data quality.
How to get data quality measured in streaming mode, we need to have a
process and tool to visualize data quality insights through
registering dataset which you want to check data quality, creating
data quality measurement model, executing the data quality validation
job and getting metrics insights for action taking.
No Shared platform and API Service, have to apply and manage own
hardware and software infrastructure.

Rationale

The challenge we face at eBay is that our data volume is becoming
bigger and bigger, system processes become more complex, while we do
not have a unified data quality solution to ensure the trusted data
sets which provide confidences on data quality to our data consumers.
The key challenges on data quality includes:

Existing commercial data quality solution cannot address data quality
lineage among systems, cannot scale out to support fast growing data
at eBay
Existing eBay's domain specific tools take a long time to identify and
fix poor data quality when data flowed through multiple systems
Business logic becomes complex, requires data quality system much flexible.

Some data quality issues do have business impact on user experiences,
revenue, efficiency & compliance.

Communication overhead of data quality metrics, typically in a big
organization, which involve different teams.

The idea of Griffin is to provide Data Quality validation as a
Service, to allow data engineers and data consumers to have:

Near real-time understanding of the data quality health of your data
pipelines with end-to-end monitoring, all in one place.
Profiling, detecting and correlating issues and providing
recommendations that drive rapid and focused troubleshooting
A centralized data quality model management system including rule,
metadata, scheduler etc.
Native code generation to run everywhere, including Hadoop, Kafka, Spark, etc.
One set of tools to build data quality pipelines across all eBay data platforms.

Current Status

Meritocracy

Griffin has been deployed in production at eBay and provided the
centralized data quality service for several eBay systems ( for
example, real time personalization platform, eBay real time ID linking
platform, Hadoop datasets, Site speed analytics platform). Our aim is
to build a diverse developer and user community following the Apache
meritocracy model. We will encourage contributions and participation
of all types of work, and ensure that contributors are appropriately
recognized.

Community

Currently the project is being developed at eBay. It's only for eBay
internal community. Griffin seeks to develop the developer and user
communities during incubation. We believe it will grow substantially
by becoming an Apache project.

Core Developers

Griffin is currently being designed and developed by engineers from
eBay Inc. – William Guo, Alex Lv, Shawn Sha, Vincent Zhao, John Liu.
All of these core developers have deep expertise in Apache Hadoop and
the Hadoop Ecosystem in general.

Alignment

The ASF is a natural host for Griffin given that it is already the
home of Hadoop, Beam, HBase, Hive, Storm, Kafka, Spark and other
emerging big data products. Those are requiring data quality solution
by nature to ensure the data quality which they processed. When people
use open source data technology, the big question to them is that how
we can ensure the data quality in it. Griffin leverages lot of Apache
open-source products. Griffin was designed to enable real time
insights into data quality validation by shared Infrastructure and
generic features to solve common data quality pain points.

Known Risks

Orphaned Products

The core developers of Griffin team work full time on this project.
There is no risk of Griffin getting orphaned since at least one large
company (eBay) is extensively using it in their production Hadoop and
Spark clusters for multiple data systems. For example, currently there
are 4 data systems at eBay (real time personalization platform, eBay
real time ID linking platform, Hadoop, Site speed analytics platform)
are leveraging Griffin, with more than ~600M records for data quality
status validation every day, 35 data sets being monitored, 50+ data
quality models have been created.

As Griffin is designed to connect many types of data sources, we are
very confident that they will use Griffin as a service for ensuring
the data quality in open source data ecosystems. We plan to extend and
diversify this community further through Apache.

Inexperience with Open Source

Griffin's core engineers are all active users and followers of open
source projects. They are already committers and contributors to the
Griffin Github project. All have been involved with the source code
that has been released under an open source license, and several of
them also have experience developing code in an open source
environment. Though the core set of Developers do not have Apache Open
Source experience, there are plans to onboard individuals with Apache
open source experience on to the project.

Homogenous Developers

The core developers are from eBay. Apache Incubation process
encourages an open and diverse meritocratic community. Griffin intends
to make every possible effort to build a diverse, vibrant and involved
community. We are committed to recruiting additional committers from
other companies based on their contribution to the project.

Reliance on Salaried Developers

eBay invested in Griffin as a company-wide data quality service
platform and some of its key engineers are working full time on the
project. they are all paid by eBay. We look forward to other Apache
developers and researchers to contribute to the project.

Relationships with Other Apache Products

Griffin has a strong relationship and dependency with Apache Hadoop,
Apache HBase, Apache Spark, Apache Kafka and Apache Storm, Apache
Hive. In addition, since there is a growing need for data quality
solution for open source platform (e.g. Hadoop, Kafka, Spark etc),
being part of Apache’s Incubation community, could help with a closer
collaboration among these four projects and as well as others.

Documentation

Information about Griffin can be found at https://github.com/eBay/griffin

Initial Source

Griffin has been under development since early 2016 by a team of
engineers at eBay Inc. It is currently hosted on Github.com under an
Apache license 2.0 at https://github.com/eBay/griffin . Once in
incubation we will be moving the code base to apache git library.

External Dependencies

Griffin has the following external dependencies.

Basic

JDK 1.7+
Scala
Apache Maven
JUnit
Log4j
Slf4j
Apache Commons

Hadoop

Apache Hadoop
Apache HBase
Apache Hive

DB

InfluxData

Apache Spark

Spark Core Library

REST Service

Jersey
Spring MVC

Web frontend

AngularJS
jQuery
Bootstrap
RequireJS
eCharts
Font Awesome

Cryptography

Currently there's no cryptography in Griffin.

Required Resources

Mailing List

We currently use eBay mail box to communicate, but we'd like to move
that to ASF maintained mailing lists.

Current mailing list: ebay-griffin-d...@googlegroups.com

Proposed ASF maintained lists:

priv...@griffin.incubator.apache.org

d...@griffin.incubator.apache.org

comm...@griffin.incubator.apache.org

Subversion Directory

Git is the preferred source control system.

Issue Tracking

JIRA

Other Resources

The existing code already has unit tests so we will make use of
existing Apache continuous testing infrastructure. The resulting load
should not be very large.

Initial Committers

William Go
Alex Lv
Vincent Zhao
Shawn Sha
John Liu
Liang Shao

Affiliations

The initial committers are employees of eBay Inc.

Sponsors

Champion

Henry Saputra (hsapu...@apache.org)

Nominated Mentors

Kasper Sørensen (kasper...@apache.org)

Uma Maheswara Rao Gangumalla (umamah...@apache.org)

Luciano Resende (luckbr1...@gmail.com)

Sponsoring Entity

We are requesting the Incubator to sponsor this project.

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

[DISCUSS] Proposing Griffin for Apache incubator

Reply via email to