On Thu, Jun 7, 2018 at 11:55 PM, Li,De(BDG) <l...@baidu.com> wrote: > Hi, Jim > > Thank you for your response. > Actually, we start Palo in several years ago, and that time we developed > the storage engine based on Mesa technology. > Meanwhile we found Impala is a very good MPP SQL query engine, so we > integrated them together. >
>From what I can tell of the Palo source, it's not so much an integration as a copied-and-modified codebase, right? i.e Palo does not use Impala as a dependency, but rather shares a lot of code from the Impala project that has since diverged. > > With this integration, the goal of Palo is to implement a single, > full-featured, mysql protocol compatible data warehousing. > That sounds pretty similar to the goals of the Impala project. Impala isn't MySQL-compatible at the moment but that seems more like a particular feature that could be added rather than a distinct identity of the project. Otherwise, Impala's goal is to be a full featured data warehouse engine as well. Generally Apache has no rules against multiple projects fulfilling similar goals or use cases, even when those projects might compete. However I think it would be relatively unusual to incubate a project that appears to be derived from a fork of an existing project, at least without first considering whether the additional feature set could be contributed back to the existing community. -Todd > 在 2018/6/8 下午1:55, "Jim Apple" <jbap...@apache.org> 写入: > > >Hello! As a contributor to Impala, I’d be interested in hearing thoughts > >from the Palo community about integration between Impala and Palo. > > > >For instance, are there any apparent design goals of Impala that the Palo > >community thinks are fundamentally incompatible with Palo? > > > >Thanks, > >Jim > > > >On 2018/06/08 04:45:32, "Li,De(BDG)" <l...@baidu.com> wrote: > >> Hi all, > >> > >> I am Reed, as a developer worked with the team for Palo (a MPP-based > >>interactive SQL data warehousing). > >> https://github.com/baidu/palo/wiki/Palo-Overview > >> > >> We propose to contribute Palo as an Apache Incubator project, and > >> we are still looking for possible Champion if anyone would like to > >>volunteer. Thanks a lot. > >> > >> Best Regards, > >> Reed > >> > >> =================== > >> The draft of the proposal as below: > >> > >> #Apache Palo > >> > >> ##Abstract > >> > >> Palo is a MPP-based interactive SQL data warehousing for reporting and > >>analysis. > >> > >> ##Proposal > >> > >> We propose to contribute the Palo codebase and associated artifacts > >>(e.g. documentation, web-site content etc.) to the Apache Software > >>Foundation with the intent of forming a productive, meritocratic and > >>open community around Palo’s continued development, according to the > >>‘Apache Way’. > >> > >> Baidu owns several trademarks regarding Palo, and proposes to transfer > >>ownership of those trademarks in full to the ASF. > >> > >> ###Overview of Palo > >> > >> Palo’s implementation consists of two daemons: Frontend (FE) and > >>Backend (BE). > >> > >> **Frontend daemon** consists of query coordinator and catalog manager. > >>Query coordinator is responsible for receiving users’ sql queries, > >>compiling queries and managing queries execution. Catalog manager is > >>responsible for managing metadata such as databases, tables, partitions, > >>replicas and etc. Several frontend daemons could be deployed to > >>guarantee fault-tolerance, and load balancing. > >> > >> **Backend daemon** stores the data and executes the query fragments. > >>Many backend daemons could also be deployed to provide scalability and > >>fault-tolerance. > >> > >> A typical Palo cluster generally composes of several frontend daemons > >>and dozens to hundreds of backend daemons. > >> > >> Users can use MySQL client tools to connect any frontend daemon to > >>submit SQL query. Frontend receives the query and compiles it into query > >>plans executable by the Backend. Then Frontend sends the query plan > >>fragments to Backend. Backend will build a query execution DAG. Data is > >>fetched and pipelined into the DAG. The final result response is sent to > >>client via Frontend. The distribution of query fragment execution takes > >>minimizing data movement and maximizing scan locality as the main goal. > >> > >> ##Background > >> > >> At Baidu, Prior to Palo, different tools were deployed to solve diverse > >>requirements in many ways. And when a use case requires the simultaneous > >>availability of capabilities that cannot all be provided by a single > >>tool, users were forced to build hybrid architectures that stitch > >>multiple tools together, but we believe that they shouldn’t need to > >>accept such inherent complexity. A storage system built to provide great > >>performance across a broad range of workloads provides a more elegant > >>solution to the problems that hybrid architectures aim to solve. Palo is > >>the solution. > >> > >> Palo is designed to be a simple and single tightly coupled system, not > >>depending on other systems. Palo provides high concurrent low latency > >>point query performance, but also provides high throughput queries of > >>ad-hoc analysis. Palo provides bulk-batch data loading, but also > >>provides near real-time mini-batch data loading. Palo also provides high > >>availability, reliability, fault tolerance, and scalability. > >> > >> ##Rationale > >> > >> Palo mainly integrates the technology of Google Mesa and Apache Impala. > >> > >> Mesa is a highly scalable analytic data storage system that stores > >>critical measurement data related to Google's Internet advertising > >>business. Mesa is designed to satisfy complex and challenging set of > >>users’ and systems’ requirements, including near real-time data > >>ingestion and query ability, as well as high availability, reliability, > >>fault tolerance, and scalability for large data and query volumes. > >> > >> Impala is a modern, open-source MPP SQL engine architected from the > >>ground up for the Hadoop data processing environment. At present, by > >>virtue of its superior performance and rich functionality, Impala has > >>been comparable to many commercial MPP database query engine. Mesa can > >>satisfy the needs of many of our storage requirements, however Mesa > >>itself does not provide a SQL query engine; Impala is a very good MPP > >>SQL query engine, but the lack of a perfect distributed storage engine. > >>So in the end we chose the combination of these two technologies. > >> > >> Learning from Mesa’s data model, we developed a distributed storage > >>engine. Unlike Mesa, this storage engine does not rely on any > >>distributed file system. Then we deeply integrate this storage engine > >>with Impala query engine. Query compiling, query execution coordination > >>and catalog management of storage engine are integrated to be frontend > >>daemon; query execution and data storage are integrated to be backend > >>daemon. With this integration, we implemented a single, full-featured, > >>high performance state the art of MPP database, as well as maintaining > >>the simplicity. > >> > >> ##Current Status > >> > >> Palo has been an open source project on GitHub > >>(https://github.com/baidu/palo). > >> > >> ###Meritocracy > >> > >> Palo has been deployed in production at Baidu and is applying more than > >>200 lines of business. It has demonstrated great performance benefits > >>and has proved to be a better way for reporting and analysis based big > >>data. Still We look forward to growing a rich user and developer > >>community. > >> > >> ###Community > >> > >> Palo seeks to develop developer and user communities during incubation. > >> > >> ###Core Developers > >> > >> * Ruyue Ma (https://github.com/maruyue, > >>maru...@baidu.com<mailto:maru...@baidu.com>) > >> * Chun Zhao (https://github.com/imay, > >>buaa.zh...@gmail.com<mailto:buaa.zh...@gmail.com>) > >> * Mingyu Chen (https://github.com/morningman,chenmin...@baidu.com) > >> * De Li(https://github.com/lide-reed, > >>mailtol...@sina.com)<mailto:mailtol...@sina.com%EF%BC%89> > >> * Hao Chen (https://github.com/chenhao7253886, > >>chenha...@baidu.com<mailto:chenha...@baidu.com>) > >> * Chaoyong Li (https://github.com/cyongli, > >>lichaoy...@baidu.com<mailto:lichaoy...@baidu.com>) > >> * Bin Lin (https://github.com/lingbin, > >>lingbi...@gmail.com<mailto:lingbi...@gmail.com>) > >> > >> ###Alignment > >> > >> Palo is related to several other Apache projects: > >> > >> * Palo can also read data stored in Apache Hadoop clusters powered by > >>the HDFS filesystem. > >> * Palo is closely integrated with Impala, which is also being proposed > >>to the Incubator. > >> * Palo uses Apache Thrift as its RPC and serialization framework of > >>choice. > >> > >> ##Known Risks > >> > >> ###Orphaned Products > >> > >> The core developers of Palo team plan to work full time on this > >>project. There is very little risk of Palo getting orphaned since at > >>least one large company (Baidu) is extensively using it in their > >>production. For example, currently there are more than 200 use cases > >>using Palo in production. Furthermore, since Palo was open sourced at > >>the beginning of October 2017, it has received more than 660 stars and > >>been forked nearly 170 times. We plan to extend and diversify this > >>community further through Apache. > >> > >> ###Inexperience with Open Source > >> > >> The core developers are all active users and followers of open source. > >>They are already committers and contributors to the Palo Github project. > >>All have been involved with the source code that has been released under > >>an open source license, and several of them also have experience > >>developing code in an open source environment. Though the core set of > >>Developers do not have Apache Open Source experience, there are plans to > >>onboard individuals with Apache open source experience on to the project. > >> > >> ###Homogenous Developers > >> > >> The most of core developers are from Baidu, but after Palo was open > >>sourced, Palo received a lot of bug fixes and enhancements from other > >>developers not working at Baidu. > >> > >> ###Reliance on Salaried Developers > >> > >> Baidu invested in Palo as the OLAP solution and some of its key > >>engineers are working full time on the project. In addition, since there > >>is a growing Big Data need for scalable OLAP solutions, we look forward > >>to other Apache developers and researchers to contribute to the project. > >>Also key to addressing the risk associated with relying on Salaried > >>developers from a single entity is to increase the diversity of the > >>contributors and actively lobby for Domain experts in the BI space to > >>contribute. Apache Palo intends to do this. > >> > >> ###An Excessive Fascination with the Apache Brand > >> > >> Palo is proposing to enter incubation at Apache in order to help > >>efforts to diversify the committer-base, not so much to capitalize on > >>the Apache brand. The Palo project is in production use already inside > >>Baidu, but is not expected to be an Baidu product for external > >>customers. As such, the Palo project is not seeking to use the Apache > >>brand as a marketing tool. > >> > >> ##Documentation > >> > >> Information about Palo can be found at https://github.com/baidu/palo. > >>The following links provide more information about Palo in open source: > >> > >> * Palo wiki site: https://github.com/baidu/palo/wiki > >> * Codebase at Github: https://github.com/baidu/palo > >> * Issue Tracking: https://github.com/baidu/palo/issues > >> * Overview: https://github.com/baidu/palo/wiki/Palo-Overview > >> * FAQ: https://github.com/baidu/palo/wiki/Palo-FAQ > >> > >> ##Initial Source > >> > >> Palo has been under development since 2017 by a team of engineers at > >>Baidu Inc. It is currently hosted on Github.com under an Apache license > >>at https://github.com/baidu/palo. > >> > >> ##External Dependencies > >> > >> Palo has the following external dependencies. > >> > >> * Google gflags (BSD) > >> * Google glog (BSD) > >> * Apache Thrift (Apache Software License v2.0) > >> * Apache Commons (Apache Software License v2.0) > >> * Boost (Boost Software License) > >> * OpenLdap (OpenLDAP Software License) > >> * rapidjson (Tencent) > >> * Google RE2 (BSD-style) > >> * lz4 (BSD) > >> * snappy (BSD) > >> * cyrus-sasl (CMU License) > >> * Twitter Bootstrap (Apache Software License v2.0) > >> * d3 (BSD) > >> * LLVM (BSD-like) > >> > >> Build and test dependencies: > >> > >> * ant (Apache Software License v2.0) > >> * Apache Maven (Apache Software License v2.0) > >> * cmake (BSD) > >> * clang (BSD) > >> * Google gtest (Apache Software License v2.0) > >> > >> ##Required Resources > >> > >> ###Mailing List > >> > >> There are currently no mailing lists. The usual mailing lists are > >>expected to be set up when entering incubation: > >> > >> > >>priv...@palo.incubator.apache.org<mailto:private@ > palo.incubator.apache.or > >>g> > >> d...@palo.incubator.apache.org<mailto:d...@palo.incubator.apache.org> > >> > >>comm...@palo.incubator.apache.org<mailto:commits@ > palo.incubator.apache.or > >>g> > >> > >> ###Subversion Directory > >> > >> Upon entering incubation: https://github.com/baidu/palo. > >> After incubation, we want to move the existing repo from > >>https://github.com/baidu/palo to Apache infrastructure. > >> > >> ###Issue Tracking > >> > >> Palo currently uses GitHub to track issues. Would like to continue to > >>do so while we discuss migration possibilities with the ASF Infra > >>committee. > >> > >> ###Other Resources > >> > >> The existing code already has unit tests so we will make use of > >>existing Apache continuous testing infrastructure. The resulting load > >>should not be very large. > >> > >> ##Initial Committers > >> > >> * Ruyue Ma (https://github.com/maruyue, > >>maru...@baidu.com<mailto:maru...@baidu.com>) > >> * Chun Zhao (https://github.com/imay, > >>buaa.zh...@gmail.com<mailto:buaa.zh...@gmail.com>) > >> * Mingyu Chen (https://github.com/morningman,chenmin...@baidu.com) > >> * De Li(https://github.com/lide-reed, > >>mailtol...@sina.com)<mailto:mailtol...@sina.com%EF%BC%89> > >> * Hao Chen (https://github.com/chenhao7253886, > >>chenha...@baidu.com<mailto:chenha...@baidu.com>) > >> * Chaoyong Li (https://github.com/cyongli, > >>lichaoy...@baidu.com<mailto:lichaoy...@baidu.com>) > >> * Bin Lin (https://github.com/lingbin, > >>lingbi...@gmail.com<mailto:lingbi...@gmail.com>) > >> > >> ##Affiliations > >> > >> The initial committers are employees of Baidu Inc.. The nominated > >>mentors are employees of TODO. > >> > >> ##Sponsors > >> > >> ###Champion > >> > >> TODO > >> > >> ###Nominated Mentors > >> > >> * sijie guo, guosi...@gmail.com<mailto:guosi...@gmail.com> > >> * Luke Han, luke...@apache.org<mailto:luke...@apache.org> > >> * Zheng Shao, zs...@apache.org<mailto:zs...@apache.org> > >> > >> ###Sponsoring Entity > >> > >> We are requesting the Incubator to sponsor this project. > >> > > > >--------------------------------------------------------------------- > >To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > >For additional commands, e-mail: general-h...@incubator.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > -- Todd Lipcon Software Engineer, Cloudera