Re: [PROPOSAL] Flume for the Apache Incubator

Jonathan Hsieh Sat, 28 May 2011 00:20:23 -0700

Ralph,

Definitely!  Please add yourself to the mentors section of the wiki page.


Thanks.
Jon.

On Fri, May 27, 2011 at 6:07 PM, Ralph Goers <ralph.go...@dslextreme.com>wrote:

> A hearty +1 from me.  Do you need another mentor?
>
> Ralph
>
> On May 27, 2011, at 7:18 AM, Jonathan Hsieh wrote:
>
> > Howdy!
> >
> > I would like to propose Flume to be an Apache Incubator project.  Flume
> is a
> > distributed, reliable, and available system for efficiently collecting,
> > aggregating, and moving large amounts of log data to scalable data
> storage
> > systems such as Apache Hadoop's HDFS.
> >
> > Here's a link to the proposal in the Incubator wiki
> > http://wiki.apache.org/incubator/FlumeProposal
> >
> > I've also pasted the initial contents below.
> >
> > Thanks!
> > Jon.
> >
> > = Flume - A Distributed Log Collection System =
> >
> > == Abstract ==
> >
> > Flume is a distributed, reliable, and available system for efficiently
> > collecting, aggregating, and moving large amounts of log data to scalable
> > data storage systems such as Apache Hadoop's HDFS.
> >
> > == Proposal ==
> >
> > Flume is a distributed, reliable, and available system for efficiently
> > collecting, aggregating, and moving large amounts of log data from many
> > different sources to a centralized data store. Its main goal is to
> deliver
> > data from applications to Hadoop’s HDFS.  It has a simple and flexible
> > architecture for transporting streaming event data via flume nodes to the
> > data store.  It is robust and fault-tolerant with tunable reliability
> > mechanisms that rely upon many failover and recovery mechanisms. The
> system
> > is centrally configured and allows for intelligent dynamic management. It
> > uses a simple extensible data model that allows for lightweight online
> > analytic applications.  It provides a pluggable mechanism by which new
> > sources, destinations, and analytic functions which can be integrated
> within
> > a Flume pipeline.
> >
> > == Background ==
> >
> > Flume was initially developed by Cloudera to enable reliable and
> simplified
> > collection of log information from many distributed sources. It was later
> > open-sourced by Cloudera on GitHub as an Apache 2.0 licensed project in
> June
> > 2010. During this time Flume has been formally released five times as
> > versions 0.9.0 (June 2010), 0.9.1 (Aug 2010), 0.9.1u1 (Oct 2010), 0.9.2
> (Nov
> > 2010), and 0.9.3 (Feb 2011).  These releases are also distributed by
> > Cloudera as source and binaries along with enhancements as part of
> Cloudera
> > Distribution including Apache Hadoop (CDH).
> >
> > == Rationale ==
> >
> > Collecting log information in a data center in a timely, reliable, and
> > efficient manner is a difficult challenge but important because when
> > aggregated and analyzed, log information can yield valuable business
> > insights.   We believe that users and operators need a manageable
> systematic
> > approach for log collection that simplifies the creation, the monitoring,
> > and the administration of reliable log data pipelines.  Oftentimes today,
> > this collection is attempted by periodically shipping data in batches and
> by
> > using potentially unreliable and inefficient ad-hoc methods.
> >
> > Log data is typically generated in various systems running within a data
> > center that can range from a few machines to hundreds of machines.  In
> > aggregate, the data acts like a large-volume continuous stream with
> contents
> > that can have highly-varied format and highly-varied content.  The volume
> > and variety of raw log data makes Apache Hadoop's HDFS file system an
> ideal
> > storage location before the eventual analysis.  Unfortunately, HDFS has
> > limitations with regards to durability as well as scaling limitations
> when
> > handling a large number of low-bandwidth connections or small files.
> > Similar technical challenges are also suffered when attempting to write
> > data to other data storage services.
> >
> > Flume addresses these challenges by providing a reliable, scalable,
> > manageable, and extensible solution.  It uses a streaming design for
> > capturing and aggregating log information from varied sources in a
> > distributed environment and has centralized management features for
> minimal
> > configuration and management overhead.
> >
> > == Initial Goals ==
> >
> > Flume is currently in its first major release with a considerable number
> of
> > enhancement requests, tasks, and issues recorded towards its future
> > development. The initial goal of this project will be to continue to
> build
> > community in the spirit of the "Apache Way", and to address the highly
> > requested features and bug-fixes towards the next dot release.
> >
> > Some goals include:
> > * To stand up a sustaining Apache-based community around the Flume
> codebase.
> > * Implementing core functionality of a usable highly-available Flume
> master.
> > * Performance, usability, and robustness improvements.
> > * Improving the ability to monitor and diagnose problems as data is
> > transported.
> > * Providing a centralized place for contributed connectors and related
> > projects.
> >
> > = Current Status =
> >
> > == Meritocracy ==
> >
> > Flume was initially developed by Jonathan Hsieh in July 2009 along with
> > development team at Cloudera. Developers external to Cloudera provided
> > feedback, suggested features and fixes and implemented extensions of
> Flume.
> > Cloudera engineering team has since maintained the project with Jonathan
> > Hsieh, Henry Robinson, and Patrick Hunt dedicated towards its
> improvement.
> > Contributors to Flume and its connectors include developers from
> different
> > companies and different parts of the world.
> >
> > == Community ==
> >
> > Flume is currently used by a number of organizations all over the world.
> > Flume has an active and growing user and developer community with active
> > participation in [user|
> > https://groups.google.com/a/cloudera.org/group/flume-user/topics] and
> > [developer|
> https://groups.google.com/a/cloudera.org/group/flume-dev/topics]
> > mailing lists.  The users and developers also communicate via IRC on
> #flume
> > at irc.freenode.net.
> >
> > Since open sourcing the project, there have been over 15 different people
> > from diverse organizations who have contributed code. During this period,
> > the project team has hosted open, in-person, quarterly meetups to discuss
> > new features, new designs, and new use-case stories.
> >
> > == Core Developers ==
> >
> > The core developers for Flume project are:
> > * Andrew Bayer: Andrew has a lot of expertise with build tools,
> > specifically Jenkins continuous integration and Maven.
> > * Jonathan Hsieh: Jonathan designed and implemented much of the original
> > code.
> > * Patrick Hunt: Patrick has improved the web interfaces of Flume
> components
> > and contributed several build quality  improvements.
> > * Bruce Mitchener: Bruce has improved the internal logging infrastructure
> > as well as edited significant portions of the Flume manual.
> > * Henry Robinson: Henry has implemented much of the ZooKeeper
> integration,
> > plugin mechanisms, as well as several Flume features and bug fixes.
> > * Eric Sammer: Eric has implemented the Maven build, as well as several
> > Flume features and bug fixes.
> >
> > All core developers of the Flume project have contributed towards Hadoop
> or
> > related Apache projects and are very familiar with Apache principals and
> > philosophy for community driven software development.
> >
> > == Alignment ==
> >
> > Flume complements Hadoop Map-Reduce, Pig, Hive, HBase by providing a
> robust
> > mechanism to allow log data integration from external systems for
> effective
> > analysis.  Its design enable efficient integration of newly ingested data
> to
> > Hive's data warehouse.
> >
> > Flume's architecture is open and easily extensible.  This has encouraged
> > many users to contribute integrate plugins to other projects.  For
> example,
> > several users have contributed connectors to message queuing and bus
> > services, to several open source data stores, to incremental search
> indexes,
> > and to a stream analysis engines.
> >
> > = Known Risks =
> >
> > == Orphaned Products ==
> >
> > Flume is already deployed in production at multiple companies and they
> are
> > actively participating in feature requests and user led discussions.
> Flume
> > is getting traction with developers and thus the risks of it being
> orphaned
> > are minimal.
> >
> > == Inexperience with Open Source ==
> >
> > All code developed for Flume has is open sourced by Cloudera under Apache
> > 2.0 license.  All committers of Flume project are intimately familiar
> with
> > the Apache model for open-source development and are experienced with
> > working with new contributors.
> >
> > == Homogeneous Developers ==
> >
> > The initial set of committers is from a reduced set of organizations.
> > However, we expect that once approved for incubation, the project will
> > attract new contributors from diverse organizations and will thus grow
> > organically. The participation of developers from several different
> > organizations in the mailing list is a strong indication for this
> assertion.
> >
> > == Reliance on Salaried Developers ==
> >
> > It is expected that Flume will be developed on salaried and volunteer
> time,
> > although all of the initial developers will work on it mainly on salaried
> > time.
> >
> > == Relationships with Other Apache Products ==
> >
> > Flume depends upon other Apache Projects: Apache Hadoop, Apache Log4J,
> > Apache ZooKeeper, Apache Thrift, Apache Avro, multiple Apache Commons
> > components. Its build depends upon Apache Ant and Apache Maven.
> >
> > Flume users have created connectors that interact with several other
> Apache
> > projects including Apache HBase and Apache Cassandra.
> >
> > Flume's functionality has some indirect or direct overlap with the
> > functionality of Apache Chukwa but has several significant architectural
> > diffferences.  Both systems can be used to collect log data to write to
> > hdfs.  However, Chukwa's primary goals are the analytic and monitoring
> > aspects of a Hadoop cluster.  Instead of focusing on analytics, Flume
> > focuses primarily upon data transport and integration with a wide set of
> > data sources and data destinations.   Architecturally, Chukwa components
> are
> > individually and statically configured.  It also depends upon Hadoop
> > MapReduce for its core functionality.  In contrast, Flume's components
> are
> > dynamically and centrally configured and does not depend directly upon
> > Hadoop MapReduce.  Furthermore, Flume provides a more general model for
> > handling data and enables integration with projects such as Apache Hive,
> > data stores such as Apache HBase, Apache Cassandra and Voldemort, and
> > several Apache Lucene-related projects.
> >
> > == An Excessive Fascination with the Apache Brand ==
> >
> > We would like Flume to become an Apache project to further foster a
> healthy
> > community of contributors and consumers around the project.  Since Flume
> > directly interacts with many Apache Hadoop-related projects by solves an
> > important problem of many Hadoop users, residing in the the Apache
> Software
> > Foundation will increase interaction with the larger community.
> >
> > = Documentation =
> >
> > * All Flume documentation (User Guide, Developer Guide, Cookbook, and
> > Windows Guide) is maintained within Flume sources and can be built
> directly.
> > * Cloudera provides documentation specific to its distribution of Flume
> at:
> > http://archive.cloudera.com/cdh/3/flume/
> > * Flume wiki at GitHub: https://github.com/cloudera/flume/wiki
> > * Flume jira at Cloudera: https://issues.cloudera.org/browse/flume
> >
> > = Initial Source =
> >
> > * https://github.com/cloudera/flume/tree/
> >
> > == Source and Intellectual Property Submission Plan ==
> >
> > * The initial source is already licensed under the Apache License,
> Version
> > 2.0. https://github.com/cloudera/flume/blob/master/LICENSE
> >
> > == External Dependencies ==
> >
> > The required external dependencies are all Apache License or compatible
> > licenses. Following components with non-Apache licenses are enumerated:
> >
> > * org.arabidopsis.ahocorasick : BSD-style
> >
> > Non-Apache build tools that are used by Flume are as follows:
> >
> > * AsciiDoc: GNU GPLv2
> > * FindBugs: GNU LGPL
> > * Cobertura: GNU GPLv2
> > * PMD : BSD-style
> >
> > == Cryptography ==
> >
> > Flume uses standard APIs and tools for SSH and SSL communication where
> > necessary.
> >
> > = Required  Resources =
> >
> > == Mailing lists ==
> >
> > * flume-private (with moderated subscriptions)
> > * flume-dev
> > * flume-commits
> > * flume-user
> >
> > == Subversion Directory ==
> >
> > https://svn.apache.org/repos/asf/incubator/flume
> >
> > == Issue Tracking ==
> >
> > JIRA Flume (FLUME)
> >
> > == Other Resources ==
> >
> > The existing code already has unit and integration tests so we would like
> a
> > Hudson instance to run them whenever a new patch is submitted. This can
> be
> > added after project creation.
> >
> > = Initial Committers =
> >
> > * Andrew Bayer (abayer at cloudera dot com)
> > * Jonathan Hsieh (jon at cloudera dot com)
> > * Aaron Kimball (akimball83 at gmail dot com)
> > * Bruce Mitchener (bruce.mitchener at gmail dot com)
> > * Arvind Prabhakar (arvind at cloudera dot com)
> > * Ahmed Radwan (ahmed at cloudera dot com)
> > * Henry Robinson (henry at cloudera dot com)
> > * Eric Sammer (esammer at cloudera dot com)
> >
> > = Affiliations =
> >
> > * Andrew Bayer, Cloudera
> > * Jonathan Hsieh, Cloudera
> > * Aaron Kimball, Odiago
> > * Bruce Mitchener, Independent
> > * Arvind Prabhakar, Cloudera
> > * Ahmed Radwan, Cloudera
> > * Henry Robinson, Cloudera
> > * Eric Sammer, Cloudera
> >
> >
> > = Sponsors =
> >
> > == Champion ==
> >
> > * Nigel Daley
> >
> > == Nominated Mentors ==
> >
> > * Tom White
> > * Nigel Daley
> >
> > == Sponsoring Entity ==
> >
> > * Apache Incubator PMC
> >
> >
> > --
> > // Jonathan Hsieh (shay)
> > // Software Engineer, Cloudera
> > // j...@cloudera.com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>


-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// j...@cloudera.com

Re: [PROPOSAL] Flume for the Apache Incubator

Reply via email to