Re: [VOTE] Accept CarbonData into the Apache Incubator

Sandeep Deshmukh Tue, 31 May 2016 22:53:53 -0700

+1 (non-binding)

Regards,
Sandeep


On Mon, May 30, 2016 at 7:04 PM, lidong <lid...@apache.org> wrote:

> +1 (non-binding)
>
>
> Thanks,
> Dong
> ---
> Apache Kylin - http://kylin.apache.org
> Kyligence Inc. - http://kyligence.io
>
>
> Original Message
> Sender:Jean-Baptiste Onofréj...@nanthrax.net
> Recipient:generalgene...@incubator.apache.org
> Date:Monday, May 30, 2016 14:07
> Subject:Re: [VOTE] Accept CarbonData into the Apache Incubator
>
>
> My own +1 (binding) ;) Regards JB On 05/25/2016 10:24 PM, Jean-Baptiste
> Onofré wrote:  Hi all,   following the discussion thread, I'm now calling a
> vote to accept  CarbonData into the Incubator.   [ ] +1 Accept CarbonData
> into the Apache Incubator  [ ] +0 Abstain  [ ] -1 Do not accept CarbonData
> into the Apache Incubator, because ...   This vote is open for 72 hours.
>  The proposal follows, you can also access the wiki page:
> https://wiki.apache.org/incubator/CarbonDataProposal   Thanks !  Regards
> JB   = Apache CarbonData =   == Abstract ==   Apache CarbonData is a new
> Apache Hadoop native file format for faster  interactive  query using
> advanced columnar storage, index, compression and encoding  techniques  to
> improve computing efficiency, in turn it will help speedup queries an
> order of  magnitude faster over PetaBytes of data.   CarbonData github
> address: https://github.com/HuaweiBigData/carbondata   == Background ==
>  Huawei is an ICT solution provider, we are committed to enhancing
> customer experiences for telecom carriers, enterprises, and consumers on
> big data, In order to satisfy the following customer requirements, we
> created a new Hadoop native file format:   * Support interactive OLAP-style
> query over big data in seconds.  * Support fast query on individual record
> which require touching all  fields.  * Fast data loading speed and support
> incremental load in period of  minutes.  * Support HDFS so that customer
> can leverage existing Hadoop cluster.  * Support time based data
> retention.   Based on these requirements, we investigated existing file
> formats in  the Hadoop eco-system, but we could not find a suitable
> solution that  satisfying requirements all at the same time, so we start
> designing  CarbonData.   == Rationale ==   CarbonData contains multiple
> modules, which are classified into two  categories:   1. CarbonData File
> Format: which contains core implementation for file  format such as
> columnar,index,dictionary,encoding+compression,API for  reading/writing
> etc.  2. CarbonData integration with big data processing framework such as
> Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract  the
> execution runtime.   === CarbonData File Format ===   CarbonData file
> format is a columnar store in HDFS, it has many features  that a modern
> columnar format has, such as splittable, compression  schema ,complex data
> type etc. And CarbonData has following unique  features:   ==== Indexing
> ====   In order to support fast interactive query, CarbonData leverage
> indexing  technology to reduce I/O scans. CarbonData files stores data
> along with  index, the index is not stored separately but the CarbonData
> file itself  contains the index. In current implementation, CarbonData
> supports 3  types of indexing:   1. Multi-dimensional Key (B+ Tree index)
> The Data block are written in sequence to the disk and within each  data
> blocks each column block is written in sequence. Finally, the  metadata
> block for the file is written with information about byte  positions of
> each block in the file, Min-Max statistics index and the  start and end MDK
> of each data block. Since, the entire data in the file  is in sorted order,
> the start and end MDK of each data block can be used  to construct a B+Tree
> and the file can be logically represented as a  B+Tree with the data blocks
> as leaf nodes (on disk) and the remaining  non-leaf nodes in memory.  2.
> Inverted index  Inverted index is widely used in search engine. By using
> this index,  it helps processing/query engine to do filtering inside one
> HDFS block.  Furthermore, query acceleration for count distinct like
> operation is  made possible when combining bitmap and inverted index in
> query time.  3. MinMax index  For all columns, minmax index is created so
> that processing/query  engine can skip scan that is not required.   ====
> Global Dictionary ====   Besides I/O reduction, CarbonData accelerates
> computation by using  global dictionary, which enables processing/query
> engines to perform all  processing on encoded data without having to
> convert the data (Late  Materialization). We have observed dramatic
> performance improvement for  OLAP analytic scenario where table contains
> many columns in string data  type. The data is converted back to the user
> readable form just before  processing/query engine returning results to
> user.   ==== Column Group ====   Sometimes users want to perform
> processing/query on multi-columns in one  table, for example, performing
> scan for individual record in  troubleshooting scenario. In this case, row
> format is more efficient  than columnar format since all columns will be
> touched by the workload.  To accelerate this, CarbonData supports storing a
> group of column in row  format, so data in column group is stored together
> and enable fast  retrieval.   ==== Optimized for multiple use cases ====
>  CarbonData indices and dictionary is highly configurable. To make  storage
> optimized for different use cases, user can configure what to  index, so
> user can decide and tune the format before loading data into  CarbonData.
>  For example   || Use Case || Supporting Features ||  || Interactive OLAP
> query || Columnar format, Multi-dimensional Key (B+  Tree index), Minmax
> index, Inverted index ||  || High throughput scan || Global dictionary,
> Minmax index ||  || Low latency point query || Multi-dimensional Key (B+
> Tree index),  Partitioning ||  || Individual record query || Column group,
> Global dictionary ||   === BigData Processing Framework Integration ===   *
> CarbonData provides InputFormat/OutputFormat interfaces for
> Reading/Writing data from the CarbonData files and at the same time
> provides abstract API for processing data stored as Carbondata format  with
> data processing framework.  * CarbonData provides deep integration with
> Apache Spark including  predicate push down, column pruning, aggregation
> push down etc. So users  can use Spark SQL to connect and query from
> CarbonData.  * CarbonData can integrate with various big data
> Query/Processing  framework on Hadoop eco-system such as Apache
> Spark,Apache Hive etc.   Example:
> https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala
>   == Initial Goals ==   Our initial goals are to bring CarbonData into the
> ASF, transition  internal engineering processes into the open, and foster a
> collaborative  development model according to the "Apache Way".   ==
> Current Status ==   CarbonData is production ready and already provide a
> large set of features.  The current license is already Apache 2.0.   ==
> Meritocracy ==   We intend to radically expand the initial developer and
> user community  by running the project in accordance with the "Apache Way".
> Users and  new contributors will be treated with respect and welcomed. By
> participating in the community and providing quality patches/support  that
> move the project forward, they will earn merit. They also will be
> encouraged to provide non-code contributions (documentation, events,
> community management, etc.) and will gain merit for doing so. Those with  a
> proven support and quality track record will be encouraged to become
> committers.   == Community ==   If CarbonData is accepted for incubation,
> the primary initial goal is to  build a large community. We really trust
> that CarbonData will become a  key project for big data column-like
> platforms, and so, we bet on a  large community of users and developers.
>  == Known Risks ==   Development has been sponsored mostly by a one
> company.For the project  to fully transition to the Apache Way governance
> model, development must  shift towards the meritocracy-centric model of
> growing a community of  contributors balanced with the needs for extreme
> stability and core  implementation coherency.   == Orphaned products ==
>  Huawei is fully committed CarbonData. Moreover, Huawei has a vested
> interest in making CarbonData succeed by driving its close integration
> with sister ASF projects. We expect this to further reduces the risk of
> orphaning the product.   == Inexperience with Open Source ==   Huawei has
> been developing and using open source software since a long  time.
> Additionally, several ASF veterans agreed to mentor the project  and are
> listed in this proposal. The project will rely on their guidance  and
> collective wisdom to quickly transition the entire team of initial
> committers towards practicing the Apache Way.   == Reliance on Salaried
> Developers ==   Most of the contributors are paid to work in big data
> space. While they  might wander from their current employers, they are
> unlikely to venture  far from their core expertises and thus will continue
> to be engaged with  the project regardless of their current employers.   ==
> An Excessive Fascination with the Apache Brand ==   While we intend to
> leverage the Apache ‘branding’ when talking to other  projects as testament
> of our project’s ‘neutrality’, we have no plans  for making use of Apache
> brand in press releases nor posting billboards  advertising acceptance of
> CarbonData into Apache Incubator.   == Initial Source ==
> https://github.com/HuaweiBigData/carbondata.git   == External
> Dependencies ==   All external dependencies are licensed under an Apache
> 2.0 license or  Apache-compatible license. As we grow the Carbondata
> community we will  configure our build process to require and validate all
> contributions  and dependencies are licensed under the Apache 2.0 license
> or are under  an Apache-compatible license.   * Apache Spark  * Apache
> Hadoop  * Apache Maven  * Apache Commons  * Apache Log4j  * Apache Thrift
> * Apache Zookeeper  * Scala  * Snappy  * Kettle (Pentaho)  * Eigenbase  *
> Fastutil  * GSON  * Jmockit  * Junit   == Required Resources ==   ===
> Mailing lists ===   * priv...@carbondata.incubator.apache.org (moderated
> subscriptions)  * comm...@carbondata.incubator.apache.org  *
> d...@carbondata.incubator.apache.org  *
> iss...@carbondata.incubator.apache.org   === Git Repository ===   *
> https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git   ===
> Issue Tracking ===   * JIRA Project CarbonData (CarbonData)   === Initial
> Committers ===   * Liang Chenliang  * Jean-Baptiste Onofré  * Henry
> Saputra  * Uma Maheswara Rao G  * Jenny MA  * Jacky Likun  * Vimal Das
> Kammath  * Jarray Qiuheng   === Affiliations ===   * Huawei: Liang
> Chenliang  * Talend: Jean-Baptiste Onofré  * Ebay: Henry Saputra  * Intel:
> Uma Maheswara Rao G   === Sponsors ===   === Champion ===   * Jean-Baptiste
> Onofré - Apache Member   === Mentors ===   * Henry Saputra (eBay)  *
> Jean-Baptiste Onofré (Talend)  * Uma Maheswara Rao G (Intel)   ===
> Sponsoring Entity ===   The Apache Incubator
>  ---------------------------------------------------------------------  To
> unsubscribe, e-mail: general-unsubscr...@incubator.apache.org  For
> additional commands, e-mail: general-h...@incubator.apache.org  --
> Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend
> - http://www.talend.com
> --------------------------------------------------------------------- To
> unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For
> additional commands, e-mail: general-h...@incubator.apache.org
>

Re: [VOTE] Accept CarbonData into the Apache Incubator

Reply via email to