Re: [PROPOSAL] Flume for the Apache Incubator

Davanum Srinivas Mon, 30 May 2011 17:11:58 -0700

+1 (binding)

On Mon, May 30, 2011 at 7:18 PM, Yoav Shapira <yo...@apache.org> wrote:
> On Fri, May 27, 2011 at 10:18 AM, Jonathan Hsieh <j...@cloudera.com> wrote:
>> I would like to propose Flume to be an Apache Incubator project.  Flume is a
>> distributed, reliable, and available system for efficiently collecting,
>> aggregating, and moving large amounts of log data to scalable data storage
>> systems such as Apache Hadoop's HDFS.
>>
>> Here's a link to the proposal in the Incubator wiki
>> http://wiki.apache.org/incubator/FlumeProposal
>
> +1, cool stuff.
>
> Yoav
>
>>
>> I've also pasted the initial contents below.
>>
>> Thanks!
>> Jon.
>>
>> = Flume - A Distributed Log Collection System =
>>
>> == Abstract ==
>>
>> Flume is a distributed, reliable, and available system for efficiently
>> collecting, aggregating, and moving large amounts of log data to scalable
>> data storage systems such as Apache Hadoop's HDFS.
>>
>> == Proposal ==
>>
>> Flume is a distributed, reliable, and available system for efficiently
>> collecting, aggregating, and moving large amounts of log data from many
>> different sources to a centralized data store. Its main goal is to deliver
>> data from applications to Hadoop’s HDFS.  It has a simple and flexible
>> architecture for transporting streaming event data via flume nodes to the
>> data store.  It is robust and fault-tolerant with tunable reliability
>> mechanisms that rely upon many failover and recovery mechanisms. The system
>> is centrally configured and allows for intelligent dynamic management. It
>> uses a simple extensible data model that allows for lightweight online
>> analytic applications.  It provides a pluggable mechanism by which new
>> sources, destinations, and analytic functions which can be integrated within
>> a Flume pipeline.
>>
>> == Background ==
>>
>> Flume was initially developed by Cloudera to enable reliable and simplified
>> collection of log information from many distributed sources. It was later
>> open-sourced by Cloudera on GitHub as an Apache 2.0 licensed project in June
>> 2010. During this time Flume has been formally released five times as
>> versions 0.9.0 (June 2010), 0.9.1 (Aug 2010), 0.9.1u1 (Oct 2010), 0.9.2 (Nov
>> 2010), and 0.9.3 (Feb 2011).  These releases are also distributed by
>> Cloudera as source and binaries along with enhancements as part of Cloudera
>> Distribution including Apache Hadoop (CDH).
>>
>> == Rationale ==
>>
>> Collecting log information in a data center in a timely, reliable, and
>> efficient manner is a difficult challenge but important because when
>> aggregated and analyzed, log information can yield valuable business
>> insights.   We believe that users and operators need a manageable systematic
>> approach for log collection that simplifies the creation, the monitoring,
>> and the administration of reliable log data pipelines.  Oftentimes today,
>> this collection is attempted by periodically shipping data in batches and by
>> using potentially unreliable and inefficient ad-hoc methods.
>>
>> Log data is typically generated in various systems running within a data
>> center that can range from a few machines to hundreds of machines.  In
>> aggregate, the data acts like a large-volume continuous stream with contents
>> that can have highly-varied format and highly-varied content.  The volume
>> and variety of raw log data makes Apache Hadoop's HDFS file system an ideal
>> storage location before the eventual analysis.  Unfortunately, HDFS has
>> limitations with regards to durability as well as scaling limitations when
>> handling a large number of low-bandwidth connections or small files.
>>  Similar technical challenges are also suffered when attempting to write
>> data to other data storage services.
>>
>> Flume addresses these challenges by providing a reliable, scalable,
>> manageable, and extensible solution.  It uses a streaming design for
>> capturing and aggregating log information from varied sources in a
>> distributed environment and has centralized management features for minimal
>> configuration and management overhead.
>>
>> == Initial Goals ==
>>
>> Flume is currently in its first major release with a considerable number of
>> enhancement requests, tasks, and issues recorded towards its future
>> development. The initial goal of this project will be to continue to build
>> community in the spirit of the "Apache Way", and to address the highly
>> requested features and bug-fixes towards the next dot release.
>>
>> Some goals include:
>> * To stand up a sustaining Apache-based community around the Flume codebase.
>> * Implementing core functionality of a usable highly-available Flume master.
>> * Performance, usability, and robustness improvements.
>> * Improving the ability to monitor and diagnose problems as data is
>> transported.
>> * Providing a centralized place for contributed connectors and related
>> projects.
>>
>> = Current Status =
>>
>> == Meritocracy ==
>>
>> Flume was initially developed by Jonathan Hsieh in July 2009 along with
>> development team at Cloudera. Developers external to Cloudera provided
>> feedback, suggested features and fixes and implemented extensions of Flume.
>> Cloudera engineering team has since maintained the project with Jonathan
>> Hsieh, Henry Robinson, and Patrick Hunt dedicated towards its improvement.
>> Contributors to Flume and its connectors include developers from different
>> companies and different parts of the world.
>>
>> == Community ==
>>
>> Flume is currently used by a number of organizations all over the world.
>> Flume has an active and growing user and developer community with active
>> participation in [user|
>> https://groups.google.com/a/cloudera.org/group/flume-user/topics] and
>> [developer|https://groups.google.com/a/cloudera.org/group/flume-dev/topics]
>> mailing lists.  The users and developers also communicate via IRC on #flume
>> at irc.freenode.net.
>>
>> Since open sourcing the project, there have been over 15 different people
>> from diverse organizations who have contributed code. During this period,
>> the project team has hosted open, in-person, quarterly meetups to discuss
>> new features, new designs, and new use-case stories.
>>
>> == Core Developers ==
>>
>> The core developers for Flume project are:
>>  * Andrew Bayer: Andrew has a lot of expertise with build tools,
>> specifically Jenkins continuous integration and Maven.
>>  * Jonathan Hsieh: Jonathan designed and implemented much of the original
>> code.
>>  * Patrick Hunt: Patrick has improved the web interfaces of Flume components
>> and contributed several build quality  improvements.
>>  * Bruce Mitchener: Bruce has improved the internal logging infrastructure
>> as well as edited significant portions of the Flume manual.
>>  * Henry Robinson: Henry has implemented much of the ZooKeeper integration,
>> plugin mechanisms, as well as several Flume features and bug fixes.
>>  * Eric Sammer: Eric has implemented the Maven build, as well as several
>> Flume features and bug fixes.
>>
>> All core developers of the Flume project have contributed towards Hadoop or
>> related Apache projects and are very familiar with Apache principals and
>> philosophy for community driven software development.
>>
>> == Alignment ==
>>
>> Flume complements Hadoop Map-Reduce, Pig, Hive, HBase by providing a robust
>> mechanism to allow log data integration from external systems for effective
>> analysis.  Its design enable efficient integration of newly ingested data to
>> Hive's data warehouse.
>>
>> Flume's architecture is open and easily extensible.  This has encouraged
>> many users to contribute integrate plugins to other projects.  For example,
>> several users have contributed connectors to message queuing and bus
>> services, to several open source data stores, to incremental search indexes,
>> and to a stream analysis engines.
>>
>> = Known Risks =
>>
>> == Orphaned Products ==
>>
>> Flume is already deployed in production at multiple companies and they are
>> actively participating in feature requests and user led discussions. Flume
>> is getting traction with developers and thus the risks of it being orphaned
>> are minimal.
>>
>> == Inexperience with Open Source ==
>>
>> All code developed for Flume has is open sourced by Cloudera under Apache
>> 2.0 license.  All committers of Flume project are intimately familiar with
>> the Apache model for open-source development and are experienced with
>> working with new contributors.
>>
>> == Homogeneous Developers ==
>>
>> The initial set of committers is from a reduced set of organizations.
>> However, we expect that once approved for incubation, the project will
>> attract new contributors from diverse organizations and will thus grow
>> organically. The participation of developers from several different
>> organizations in the mailing list is a strong indication for this assertion.
>>
>> == Reliance on Salaried Developers ==
>>
>> It is expected that Flume will be developed on salaried and volunteer time,
>> although all of the initial developers will work on it mainly on salaried
>> time.
>>
>> == Relationships with Other Apache Products ==
>>
>> Flume depends upon other Apache Projects: Apache Hadoop, Apache Log4J,
>> Apache ZooKeeper, Apache Thrift, Apache Avro, multiple Apache Commons
>> components. Its build depends upon Apache Ant and Apache Maven.
>>
>> Flume users have created connectors that interact with several other Apache
>> projects including Apache HBase and Apache Cassandra.
>>
>> Flume's functionality has some indirect or direct overlap with the
>> functionality of Apache Chukwa but has several significant architectural
>> diffferences.  Both systems can be used to collect log data to write to
>> hdfs.  However, Chukwa's primary goals are the analytic and monitoring
>> aspects of a Hadoop cluster.  Instead of focusing on analytics, Flume
>> focuses primarily upon data transport and integration with a wide set of
>> data sources and data destinations.   Architecturally, Chukwa components are
>> individually and statically configured.  It also depends upon Hadoop
>> MapReduce for its core functionality.  In contrast, Flume's components are
>> dynamically and centrally configured and does not depend directly upon
>> Hadoop MapReduce.  Furthermore, Flume provides a more general model for
>> handling data and enables integration with projects such as Apache Hive,
>> data stores such as Apache HBase, Apache Cassandra and Voldemort, and
>> several Apache Lucene-related projects.
>>
>> == An Excessive Fascination with the Apache Brand ==
>>
>> We would like Flume to become an Apache project to further foster a healthy
>> community of contributors and consumers around the project.  Since Flume
>> directly interacts with many Apache Hadoop-related projects by solves an
>> important problem of many Hadoop users, residing in the the Apache Software
>> Foundation will increase interaction with the larger community.
>>
>> = Documentation =
>>
>>  * All Flume documentation (User Guide, Developer Guide, Cookbook, and
>> Windows Guide) is maintained within Flume sources and can be built directly.
>>  * Cloudera provides documentation specific to its distribution of Flume at:
>> http://archive.cloudera.com/cdh/3/flume/
>>  * Flume wiki at GitHub: https://github.com/cloudera/flume/wiki
>>  * Flume jira at Cloudera: https://issues.cloudera.org/browse/flume
>>
>> = Initial Source =
>>
>>  * https://github.com/cloudera/flume/tree/
>>
>> == Source and Intellectual Property Submission Plan ==
>>
>>  * The initial source is already licensed under the Apache License, Version
>> 2.0. https://github.com/cloudera/flume/blob/master/LICENSE
>>
>> == External Dependencies ==
>>
>> The required external dependencies are all Apache License or compatible
>> licenses. Following components with non-Apache licenses are enumerated:
>>
>>  * org.arabidopsis.ahocorasick : BSD-style
>>
>> Non-Apache build tools that are used by Flume are as follows:
>>
>>  * AsciiDoc: GNU GPLv2
>>  * FindBugs: GNU LGPL
>>  * Cobertura: GNU GPLv2
>>  * PMD : BSD-style
>>
>> == Cryptography ==
>>
>> Flume uses standard APIs and tools for SSH and SSL communication where
>> necessary.
>>
>> = Required  Resources =
>>
>> == Mailing lists ==
>>
>>  * flume-private (with moderated subscriptions)
>>  * flume-dev
>>  * flume-commits
>>  * flume-user
>>
>> == Subversion Directory ==
>>
>> https://svn.apache.org/repos/asf/incubator/flume
>>
>> == Issue Tracking ==
>>
>> JIRA Flume (FLUME)
>>
>> == Other Resources ==
>>
>> The existing code already has unit and integration tests so we would like a
>> Hudson instance to run them whenever a new patch is submitted. This can be
>> added after project creation.
>>
>> = Initial Committers =
>>
>>  * Andrew Bayer (abayer at cloudera dot com)
>>  * Jonathan Hsieh (jon at cloudera dot com)
>>  * Aaron Kimball (akimball83 at gmail dot com)
>>  * Bruce Mitchener (bruce.mitchener at gmail dot com)
>>  * Arvind Prabhakar (arvind at cloudera dot com)
>>  * Ahmed Radwan (ahmed at cloudera dot com)
>>  * Henry Robinson (henry at cloudera dot com)
>>  * Eric Sammer (esammer at cloudera dot com)
>>
>> = Affiliations =
>>
>>  * Andrew Bayer, Cloudera
>>  * Jonathan Hsieh, Cloudera
>>  * Aaron Kimball, Odiago
>>  * Bruce Mitchener, Independent
>>  * Arvind Prabhakar, Cloudera
>>  * Ahmed Radwan, Cloudera
>>  * Henry Robinson, Cloudera
>>  * Eric Sammer, Cloudera
>>
>>
>> = Sponsors =
>>
>> == Champion ==
>>
>>  * Nigel Daley
>>
>> == Nominated Mentors ==
>>
>>  * Tom White
>>  * Nigel Daley
>>
>> == Sponsoring Entity ==
>>
>>  * Apache Incubator PMC
>>
>>
>> --
>> // Jonathan Hsieh (shay)
>> // Software Engineer, Cloudera
>> // j...@cloudera.com
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>




-- 
Davanum Srinivas :: http://davanum.wordpress.com

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [PROPOSAL] Flume for the Apache Incubator

Reply via email to