Re: [VOTE] Accept Crail into the Apache Incubator

Raphael Bircher Fri, 27 Oct 2017 04:58:34 -0700

+1 (binding)

Am .10.2017, 18:01 Uhr, schrieb Luciano Resende <luckbr1...@gmail.com>:

Off course, my + 1

On Thu, Oct 26, 2017 at 12:31 PM, Luciano Resende <luckbr1...@gmail.com>
wrote:
Now that the discussion thread on the Crail proposal has ended, please
vote on accepting Crail into into the Apache Incubator.

The ASF voting rules are described at:
   http://www.apache.org/foundation/voting.html

A vote for accepting a new Apache Incubator podling is a majority vote
for which only Incubator PMC member votes are binding.

Votes from other people are also welcome as an indication of peoples
enthusiasm (or lack thereof).

Please do not use this VOTE thread for discussions.
If needed, start a new thread instead.

This vote will run for at least 72 hours. Please VOTE as follows
[] +1 Accept Crail into the Apache Incubator
[] +0 Abstain.
[] -1 Do not accept Crail into the Apache Incubator because ...

The proposal below is also on the wiki:
https://wiki.apache.org/incubator/CrailProposal

===

Abstract

Crail is a storage platform for sharing performance critical data in
distributed data processing jobs at very high speed. Crail is built
entirely upon principles of user-level I/O and specifically targets data
center deployments with fast network and storage hardware (e.g., 100Gbps
RDMA, plenty of DRAM, NVMe flash, etc.) as well as new modes ofoperationsuch resource disaggregation or serverless computing. Crail is writteninJava and integrates seamlessly with the Apache data processingecosystem.It can be used as a backbone to accelerate high-level data operationssuch
as shuffle or broadcast, or as a cache to store hot data that is queried
repeatedly, or as a storage platform for sharing inter-job data incomplex
multi-job pipelines, etc.

Proposal
Crail enables Apache data processing frameworks to run efficiently innext
generation data centers using fast storage and network hardware in
combination with resource (e.g., DRAM, Flash) disaggregation.

Background
Crail started as a research project at the IBM Zurich ResearchLaboratory
around 2014 aiming to integrate high-speed I/O hardware effectively into
large scale data processing systems.

Rational

During the last decade, I/O hardware has undergone rapid performance
improvements, typically in the order of magnitudes. Modern daynetworkingand storage hardware can deliver 100+ Gbps (10+ GBps) bandwidth with afewmicroseconds of access latencies. However, despite such progress in rawI/O
performance, effectively leveraging modern hardware in data processing
frameworks remains challenging. In most of the cases, upgrading tohigh-endnetworking or storage hardware has very little effect on theperformance of
analytics workloads. The problem comes from heavily layered software
imposing overheads such as deep call stacks, unnecessary data copies,
thread contention, etc. These problems have already been addressed atthe
operating system level with new I/O APIs such as RDMA verbs, NVMe, etc.,
allowing applications to bypass software layers during I/O operations.
Distributed data processing frameworks on the other hand, are typically
implemented on legacy I/O interfaces such as such as sockets or block
storage. These interfaces have been shown to be insufficient to deliverthefull hardware performance. Yet, to the best of our knowledge, there areno
active and systematic efforts to integrate these new user level I/O APIs
into Apache software frameworks. This problem affects all end-users and
organizations that use Apache software. We expect them to see
unsatisfactory small performance gains when upgrading their networkingand
storage hardware.
Crail solves this problem by providing an efficient storage platformbuiltupon user-level I/O, thus, bypassing layers such as JVM and OS duringI/O
operations. Moreover, Crail directly leverages the specific hardware
features of RDMA and NVMe to provide a better integration withhigh-level
data operations in Apache compute frameworks. As a consequence, Crail
enables users to run larger, more complex queries against everincreasing
amounts of data at a speed largely determined by the deployed hardware.
Crail is generic solution that integrates well with the Apache ecosystem
including frameworks like Spark, Hadoop, Hive, etc.

Initial Goals
The initial goals to move Crail to the Apache Incubator is to broadenthe
community, and foster contributions from developers to leverage Crail in
various data processing frameworks and workloads. Ultimately, the goalfor
Crail is to become the de-facto standard platform for storing temporary
performance critical data in distributed data processing systems.

Current Status
The initial code has been developed at the IBM Zurich Research Centerand
has recently been made available in GitHub under the Apache Software
License 2.0. The Project currently has explicit support for Spark and
Hadoop. Project documentation is available on the website www.crail.io.
There is also a public forum for discussions related to Crail availableat
https://groups.google.com/forum/#!forum/zrlio-users.

Mericrotacy

The current developers are familiar with the meritocratic open source
development process at Apache. Over the last year, the project hasgatheredinterest at GitHub and several companies have already expressedinterest in
the project. We plan to invest in supporting a meritocracy by inviting
additional developers to participate.

Community
The need for a generic solution to integrate high-performance I/Ohardwarein the open source is tremendous, so there is a potential for a verylarge
community. We believe that Crail’s extensible architecture and its
alignment with the Apache Ecosystem will further encourage community
participation. We expect that over time Crail will attract a large
community.

Alignment

Crail is written in Java and is built for the Apache data processing
ecosystem. The basic storage services of Crail can be used seamlesslyfromSpark, Hadoop, Storm. The enhanced storage services require dedicateddataprocessing specific binding, which currently are available only forSpark.
We think that moving Crail to the Apache incubator will help to extend
Crail’s support for different data processing frameworks.

Known Risks

To-date, development has been sponsored by IBM and coordinated mostly by
the core team of researchers at the IBM Zurich Research Center. ForCrailto fully transition to an "Apache Way" governance model, it needs tostart
embracing the meritocracy-centric way of growing the community of
contributors.

Orphaned Products

The Crail developers have a long-term interest in use and maintenance of
the code and there is also hope that growing a diverse community aroundtheproject will become a guarantee against the project becoming orphaned.Wefeel that it is also important to put formal governance in place bothforthe project and the contributors as the project expands. We feel ASF isthe
best location for this.

Inexperience with Open Source

Several of the initial committers are experienced open source developers
(Linux Kernel, DPDK, etc.).

Relationships with Other Apache Products

As of now, Crail has been tested with Spark, Hadoop and Hive, but it is
designed to integrate with any of the Apache data processing frameworks.

Homogeneous Developers

The project already has a diverse developer base including contributions
from organizations and public developers.

An Excessive Fascination with the Apache Brand
Crail solves a real need for a generic approach to leverage modernnetworkand storage hardware effectively in the Apache Hadoop and Sparkecosystems.Our rationale for developing Crail as an Apache project is detailed intheRationale section. We believe that the Apache brand and communityprocess
will help to us to engage a larger community and facilitate closer ties
with various Apache data processing projects.

Documentation

Documentation regarding Crail is available at www.crail.io

Initial Source

Initial source is available on GitHub under the Apache License 2.0:

https://github.com/zrlio/crail
External Dependencies

Crail is written in Java and currently supports Apache Hadoop MapReduce
and Apache Spark runtimes. To the best of our knowledge, alldependencies
of Crail are distributed under Apache compatible licenses.

Required Resource

Mailing lists

priv...@crail.incubator.apache.org
d...@crail.incubator.apache.org
comm...@crail.incubator.apache.org
Git repository

https://git-wip-us.apache.org/repos/asf/incubator-crail.git
Issue Tracking

JIRA (Crail)
Initial Committers

Patrick Stuedi <stu AT ibm DOT zurich DOT com>
Animesh Trivedi <atr AT ibm DOT zurich DOT com>
Jonas Pfefferle <jpf AT ibm DOT zurich DOT com>
Bernard Metzler <bmt AT ibm DOT zurich DOT com>
Michael Kaufmann <kau AT ibm DOT zurich DOT com>
Adrian Schuepbach <dri AT ibm DOT zurich DOT com>
Patrick McArthur <patrick AT patrickmcarthur DOT net>
Ana Klimovic <anakli AT stanford DOT edu>
Yuval Degani <yuvaldeg AT mellanox DOT com>
Vu Pham <vuhuong AT mellanox DOT com>
Affiliations

IBM (Patrick, Stuedi, Animesh Trivedi, Jonas Pfefferle, Bernard Metzler,
Michael Kaufmann, Adrian Schuepbach)
University of New Hampshire (Patrick McArthur)
Stanford University (Ana Klimovic)
Mellanox (Yuval Degani, Vu Pham)
Sponsors

Champion

Luciano Resende <lresende AT apache DOT org>

Nominated Mentors

Luciano Resende <lresende AT apache DOT org>

Raphael Bircher <rbircher AT apache DOT org>

Julian Hyde <jhyde AT apache DOT org>

Sponsoring Entity

We would like to propose the Apache Incubator to sponsor this project.


--
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/



--
My introduction https://youtu.be/Ln4vly5sxYU

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [VOTE] Accept Crail into the Apache Incubator

Reply via email to