+1 (non-binding) Regards, Sandeep
On Mon, May 30, 2016 at 7:04 PM, lidong <lid...@apache.org> wrote: > +1 (non-binding) > > > Thanks, > Dong > --- > Apache Kylin - http://kylin.apache.org > Kyligence Inc. - http://kyligence.io > > > Original Message > Sender:Jean-Baptiste Onofréj...@nanthrax.net > Recipient:generalgene...@incubator.apache.org > Date:Monday, May 30, 2016 14:07 > Subject:Re: [VOTE] Accept CarbonData into the Apache Incubator > > > My own +1 (binding) ;) Regards JB On 05/25/2016 10:24 PM, Jean-Baptiste > Onofré wrote: Hi all, following the discussion thread, I'm now calling a > vote to accept CarbonData into the Incubator. [ ] +1 Accept CarbonData > into the Apache Incubator [ ] +0 Abstain [ ] -1 Do not accept CarbonData > into the Apache Incubator, because ... This vote is open for 72 hours. > The proposal follows, you can also access the wiki page: > https://wiki.apache.org/incubator/CarbonDataProposal Thanks ! Regards > JB = Apache CarbonData = == Abstract == Apache CarbonData is a new > Apache Hadoop native file format for faster interactive query using > advanced columnar storage, index, compression and encoding techniques to > improve computing efficiency, in turn it will help speedup queries an > order of magnitude faster over PetaBytes of data. CarbonData github > address: https://github.com/HuaweiBigData/carbondata == Background == > Huawei is an ICT solution provider, we are committed to enhancing > customer experiences for telecom carriers, enterprises, and consumers on > big data, In order to satisfy the following customer requirements, we > created a new Hadoop native file format: * Support interactive OLAP-style > query over big data in seconds. * Support fast query on individual record > which require touching all fields. * Fast data loading speed and support > incremental load in period of minutes. * Support HDFS so that customer > can leverage existing Hadoop cluster. * Support time based data > retention. Based on these requirements, we investigated existing file > formats in the Hadoop eco-system, but we could not find a suitable > solution that satisfying requirements all at the same time, so we start > designing CarbonData. == Rationale == CarbonData contains multiple > modules, which are classified into two categories: 1. CarbonData File > Format: which contains core implementation for file format such as > columnar,index,dictionary,encoding+compression,API for reading/writing > etc. 2. CarbonData integration with big data processing framework such as > Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the > execution runtime. === CarbonData File Format === CarbonData file > format is a columnar store in HDFS, it has many features that a modern > columnar format has, such as splittable, compression schema ,complex data > type etc. And CarbonData has following unique features: ==== Indexing > ==== In order to support fast interactive query, CarbonData leverage > indexing technology to reduce I/O scans. CarbonData files stores data > along with index, the index is not stored separately but the CarbonData > file itself contains the index. In current implementation, CarbonData > supports 3 types of indexing: 1. Multi-dimensional Key (B+ Tree index) > The Data block are written in sequence to the disk and within each data > blocks each column block is written in sequence. Finally, the metadata > block for the file is written with information about byte positions of > each block in the file, Min-Max statistics index and the start and end MDK > of each data block. Since, the entire data in the file is in sorted order, > the start and end MDK of each data block can be used to construct a B+Tree > and the file can be logically represented as a B+Tree with the data blocks > as leaf nodes (on disk) and the remaining non-leaf nodes in memory. 2. > Inverted index Inverted index is widely used in search engine. By using > this index, it helps processing/query engine to do filtering inside one > HDFS block. Furthermore, query acceleration for count distinct like > operation is made possible when combining bitmap and inverted index in > query time. 3. MinMax index For all columns, minmax index is created so > that processing/query engine can skip scan that is not required. ==== > Global Dictionary ==== Besides I/O reduction, CarbonData accelerates > computation by using global dictionary, which enables processing/query > engines to perform all processing on encoded data without having to > convert the data (Late Materialization). We have observed dramatic > performance improvement for OLAP analytic scenario where table contains > many columns in string data type. The data is converted back to the user > readable form just before processing/query engine returning results to > user. ==== Column Group ==== Sometimes users want to perform > processing/query on multi-columns in one table, for example, performing > scan for individual record in troubleshooting scenario. In this case, row > format is more efficient than columnar format since all columns will be > touched by the workload. To accelerate this, CarbonData supports storing a > group of column in row format, so data in column group is stored together > and enable fast retrieval. ==== Optimized for multiple use cases ==== > CarbonData indices and dictionary is highly configurable. To make storage > optimized for different use cases, user can configure what to index, so > user can decide and tune the format before loading data into CarbonData. > For example || Use Case || Supporting Features || || Interactive OLAP > query || Columnar format, Multi-dimensional Key (B+ Tree index), Minmax > index, Inverted index || || High throughput scan || Global dictionary, > Minmax index || || Low latency point query || Multi-dimensional Key (B+ > Tree index), Partitioning || || Individual record query || Column group, > Global dictionary || === BigData Processing Framework Integration === * > CarbonData provides InputFormat/OutputFormat interfaces for > Reading/Writing data from the CarbonData files and at the same time > provides abstract API for processing data stored as Carbondata format with > data processing framework. * CarbonData provides deep integration with > Apache Spark including predicate push down, column pruning, aggregation > push down etc. So users can use Spark SQL to connect and query from > CarbonData. * CarbonData can integrate with various big data > Query/Processing framework on Hadoop eco-system such as Apache > Spark,Apache Hive etc. Example: > https://github.com/HuaweiBigData/carbondata/blob/master/examples/src/main/scala/org/carbondata/examples/CarbonExample.scala > == Initial Goals == Our initial goals are to bring CarbonData into the > ASF, transition internal engineering processes into the open, and foster a > collaborative development model according to the "Apache Way". == > Current Status == CarbonData is production ready and already provide a > large set of features. The current license is already Apache 2.0. == > Meritocracy == We intend to radically expand the initial developer and > user community by running the project in accordance with the "Apache Way". > Users and new contributors will be treated with respect and welcomed. By > participating in the community and providing quality patches/support that > move the project forward, they will earn merit. They also will be > encouraged to provide non-code contributions (documentation, events, > community management, etc.) and will gain merit for doing so. Those with a > proven support and quality track record will be encouraged to become > committers. == Community == If CarbonData is accepted for incubation, > the primary initial goal is to build a large community. We really trust > that CarbonData will become a key project for big data column-like > platforms, and so, we bet on a large community of users and developers. > == Known Risks == Development has been sponsored mostly by a one > company.For the project to fully transition to the Apache Way governance > model, development must shift towards the meritocracy-centric model of > growing a community of contributors balanced with the needs for extreme > stability and core implementation coherency. == Orphaned products == > Huawei is fully committed CarbonData. Moreover, Huawei has a vested > interest in making CarbonData succeed by driving its close integration > with sister ASF projects. We expect this to further reduces the risk of > orphaning the product. == Inexperience with Open Source == Huawei has > been developing and using open source software since a long time. > Additionally, several ASF veterans agreed to mentor the project and are > listed in this proposal. The project will rely on their guidance and > collective wisdom to quickly transition the entire team of initial > committers towards practicing the Apache Way. == Reliance on Salaried > Developers == Most of the contributors are paid to work in big data > space. While they might wander from their current employers, they are > unlikely to venture far from their core expertises and thus will continue > to be engaged with the project regardless of their current employers. == > An Excessive Fascination with the Apache Brand == While we intend to > leverage the Apache ‘branding’ when talking to other projects as testament > of our project’s ‘neutrality’, we have no plans for making use of Apache > brand in press releases nor posting billboards advertising acceptance of > CarbonData into Apache Incubator. == Initial Source == > https://github.com/HuaweiBigData/carbondata.git == External > Dependencies == All external dependencies are licensed under an Apache > 2.0 license or Apache-compatible license. As we grow the Carbondata > community we will configure our build process to require and validate all > contributions and dependencies are licensed under the Apache 2.0 license > or are under an Apache-compatible license. * Apache Spark * Apache > Hadoop * Apache Maven * Apache Commons * Apache Log4j * Apache Thrift > * Apache Zookeeper * Scala * Snappy * Kettle (Pentaho) * Eigenbase * > Fastutil * GSON * Jmockit * Junit == Required Resources == === > Mailing lists === * priv...@carbondata.incubator.apache.org (moderated > subscriptions) * comm...@carbondata.incubator.apache.org * > d...@carbondata.incubator.apache.org * > iss...@carbondata.incubator.apache.org === Git Repository === * > https://git-wip-us.apache.org/repos/asf/incubator-carbondata.git === > Issue Tracking === * JIRA Project CarbonData (CarbonData) === Initial > Committers === * Liang Chenliang * Jean-Baptiste Onofré * Henry > Saputra * Uma Maheswara Rao G * Jenny MA * Jacky Likun * Vimal Das > Kammath * Jarray Qiuheng === Affiliations === * Huawei: Liang > Chenliang * Talend: Jean-Baptiste Onofré * Ebay: Henry Saputra * Intel: > Uma Maheswara Rao G === Sponsors === === Champion === * Jean-Baptiste > Onofré - Apache Member === Mentors === * Henry Saputra (eBay) * > Jean-Baptiste Onofré (Talend) * Uma Maheswara Rao G (Intel) === > Sponsoring Entity === The Apache Incubator > --------------------------------------------------------------------- To > unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For > additional commands, e-mail: general-h...@incubator.apache.org -- > Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend > - http://www.talend.com > --------------------------------------------------------------------- To > unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For > additional commands, e-mail: general-h...@incubator.apache.org >