Re: [VOTE] Accept Wayang into the Apache Incubator

Jean-Baptiste Onofre Fri, 11 Dec 2020 08:51:31 -0800

+1 (binding)

Regards
JB


> Le 11 déc. 2020 à 17:33, Christofer Dutz <christofer.d...@c-ware.de> a écrit :
> 
> Hi all,
> 
> following up the [DISCUSS] thread on Wayang 
> (https://lists.apache.org/thread.html/r5fc03ae014f44c7c31a509a6db4ac07faedb2e1c6245cd917b744826%40%3Cgeneral.incubator.apache.org%3E)
>  I would like to call a VOTE to accept Wayang Aka Rheem into the Apache 
> Incubator.
> 
> Please cast your vote:
> 
>  [ ] +1, bring Wayang into the Incubator
>  [ ] +0, I don't care either way
>  [ ] -1, do not bring Wayang into the Incubator, because...
> 
> The vote will open at least for 72 hours and only votes from the Incubator 
> PMC are binding, but votes from everyone are welcome.
> 
> Chris
> 
> -----
> 
> Wayang Proposal 
> (https://cwiki.apache.org/confluence/display/INCUBATOR/WayangProposal)
> 
> == Abstract ==
> 
> Wayang is a cross-platform data processing system that aims at decoupling the 
> business logic of data analytics applications from concrete data processing 
> platforms, such as Apache Flink or Apache Spark. Hence, it tames the 
> complexity that arises from the "Cambrian explosion" of novel data processing 
> platforms that we currently witness.
> 
> Note that Wayang project is the Rheem project, but we have renamed the 
> project because of trademark issues.
> 
> You can find the project web page at: https://rheem-ecosystem.github.io/
> 
> = Proposal =
> 
> Wayang is a cross-platform system that provides an abstraction over data 
> processing platforms to free users from the burdens of (i) performing tedious 
> and costly data migration and integration tasks to run their applications, 
> and (ii) choosing the right data processing platforms for their applications. 
> To achieve this, Wayang: (1) provides an abstraction on top of existing data 
> processing platforms that allows users to specify their data analytics tasks 
> in a form of a DAG of operators; (2) comes with a cross-platform optimizer 
> for automating the selection of suitable/efficient platforms; and (3) and 
> finally takes care of executing the optimized plan, including communication 
> across platforms. In summary, Wayang has the following salient features:
> 
> - Flexible Data Model - It considers a flexible and simple data model based 
> on data quanta. A data quantum is an atomic processing unit in the system, 
> that can represent a large spectrum of data formats, such as data points for 
> a machine learning application, tuples for a database application, or RDF 
> triples. Hence, Wayang is able to express a wide range of data analytics 
> tasks.
> - Platform independence - It provides a simple interface (currently Java and 
> Scala) that is inspired by established programming models, such as that of 
> Apache Spark and Apache Flink. Users represent their data analytic tasks as a 
> DAG (Wayang plan), where vertices correspond to Wayang operators and edges 
> represent data flows (data quanta flowing) among these operators. A Wayang 
> operator defines a particular kind of data transformation over an input data 
> quantum, ranging from basic functionality (e.g., transformations, filters, 
> joins) to complex, extensible tasks (e.g., PageRank).
> - Cross-platform execution - Besides running a data analytic task on any data 
> processing platform, it also comes with an optimizer that can decide to 
> execute a single data analytic task using multiple data processing platforms. 
> This allows for exploiting the capabilities of different data processing 
> platforms to perform complex data analytic tasks more efficiently.
> Self-tuning UDF-based cost model - Its optimizer uses a cost model fully 
> based on UDFs. This not only enables Wayang to learn the cost functions of 
> newly added data processing platforms, but also allows developers to tune the 
> optimizer at will.
> - Extensibility - It treats data processing platforms as plugins to allow 
> users (developers) to easily incorporate new data processing platforms into 
> the system. This is achieved by exposing the functionalities of data 
> processing platforms as operators (execution operators). The same approach is 
> followed at the Wayang interface, where users can also extend Wayang 
> capabilities, i.e., the operators, easily.
> 
> We plan to work on the stability of all these features as well as extending 
> Wayang with more advanced features. Furthermore, Wayang currently supports 
> Apache Spark, Standalone Java, GraphChi, relational databases (via JDBC). We 
> plan to incorporate more data processing platforms, such as Apache Flink and 
> Apache Hive.
> 
> === Background ===
> 
> Many organizations and companies collect or produce large variety of data to 
> apply data analytics over them. This is because insights from data rapidly 
> allow them to make better decisions. Thus, the pursuit for efficient and 
> scalable data analytics as well as the one-size-does-not-fit-all philosophy 
> has given rise to a plethora of data processing platforms. Examples of these 
> specialized processing platforms range from DBMSs to MapReduce-like platforms.
> 
> However, today's data analytics are moving beyond the limits of a single data 
> processing platform. More and more applications need to perform complex data 
> analytics over several data processing platforms. For example, IBM reported 
> that North York hospital needs to process 50 diverse datasets, which are on a 
> dozen different internal systems, (ii) oil & gas companies stated they need 
> to process large amounts of data they produce everyday, e.g., a single oil 
> company can produce more than 1.5TB of diverse (structured and unstructured) 
> data per day, (iii) Fortune magazine stated that airlines need to analyze 
> large datasets, which are produced by different departments, are of different 
> data formats, and reside on multiple data sources, to produce global reports 
> for decision makers, and (iv) Hewlett Packard has claimed that, according to 
> its customer portfolio, business intelligence typically require a single 
> analytics pipeline using different processing platforms at different parts of 
> the pipeline. These are just a few examples of emerging applications that 
> require a diversity of data processing platforms.
> 
> Today, developers have to deal with this myriad of data processing platforms. 
> That is, they have to choose the right data processing platform for their 
> applications (or data analytic tasks) and to familiarize with the intricacies 
> of the different platforms to achieve high efficiency and scalability. 
> Several systems have also appeared with the goal of helping users to easily 
> glue several platforms together, such as Apache Drill, PrestoDB, and Luigi. 
> Nevertheless, all these systems still require quite good expertise from users 
> to decide which data processing platforms to use for the data analytic task 
> at hand. In consequence, great engineering effort is required to unify the 
> data from various sources, to combine the processing capabilities of 
> different platforms, and to maintain those applications, so as to unleash the 
> full potential of the data. In the worst case, such applications are not 
> built in the first place, as it seems too much of a daunting endeavor.
> 
> === Rationale ===
> 
> It is evident that there is an urgent need to release developers from the 
> burden of knowing all the intricacies of choosing and glueing together data 
> processing platforms for supporting their applications (data analytic tasks). 
> Developers must focus only on the logics of their applications. Surprisingly, 
> there is no open source system trying to satisfy this urgent need. Wayang 
> aims at filling this gap. It copes with this urgent need by providing both a 
> common interface over data processing platforms and an optimizer to execute 
> data analytic tasks on the right data processing platform(s) seamlessly. As 
> Apache is the place where most of the important big data systems are, we then 
> consider Apache as the right place for Wayang.
> 
> === Current Status ===
> 
> The current version of Wayang (v0.5.0) was initially co-developed by staff, 
> students, and interns at the Qatar Computing Research Institute (QCRI) and 
> the Hasso-Plattner Institute (HPI). The project was initiated at and 
> sponsored by QCRI in 2015 with the goal of freeing data scientists and 
> developers from the intricacies of data processing platforms to support their 
> analytic tasks. The first open source release of Wayang was made only one 
> year and a half later, in June 13th of 2016, under the Apache Software 
> License 2.0. Since we have made several releases, the latest release was done 
> on January 23th, 2019.
> 
> ** Meritocracy **
> 
> All current Wayang developers are familiar with this development process at 
> Apache and are currently trying to follow this meritocracy process as much as 
> possible. For example, Wayang already follows a committer principle where any 
> pull request is analyzed by at least one Wayang core developer. This was one 
> of the reasons for choosing Apache for Wayang as we all want to encourage and 
> keep this style of development for Wayang.
> 
> ** Community **
> 
> Wayang started as a pure research project, but it quickly started developing 
> into a community. People from HPI quickly joined our efforts almost from the 
> very beginning to make this project a reality. Recently, the Berlin Institute 
> of Technology (TU Berlin) and the Pontifical Catholic University of 
> Valparaiso (PUCV) in Chile have also joined our efforts for developing 
> Wayang. A company, called Scalytics, has been created around Wayang. 
> Currently, we are intensively seeking to further develop both developer and 
> user communities. To keep broadening the community, we plan to also exploit 
> our ongoing academic collaborations with multiple universities in Berlin and 
> companies that we collaborate with. For instance, Wayang is already being 
> utilized for accessing multiple data sources in the context of a large data 
> analytics project led by TU Berlin and Huawei. We also believe that Wayang's 
> extensible architecture (i.e., adding new operators and platforms) will 
> further encourage community participation. During incubation we plan to have 
> Wayang adopted by at least one company and will explicitly seek more 
> industrial participation.
> 
> ** Core Developers **
> 
> The initial developers of the project are diverse, they are from four 
> different institutions (TU Berlin, Scalytics, PUCV, and HBKU). We will work 
> aggressively to grow the community during the incubation by recruiting more 
> developers from other institutions.
> 
> ** Alignment **
> 
> We believe Apache is the most natural home for taking Wayang to the next 
> level. Apache is currently hosting the most important big data systems. 
> Hadoop, Spark, Flink, HBase, Hive, Tez, Reef, Storm, Drill, and Ignite are 
> just some examples of these technologies. Wayang fills a significant gap - it 
> provides a common abstraction for all these platforms and decides on which 
> platforms to run a single data analytic task - that exist in the big data 
> open source world. Wayang is now being developed following the Apache-style 
> development model. Also, it is well-aligned with the Apache principle of 
> building a community to impact the big data community.
> 
> === Known Risks ===
> 
> ** Orphaned Products **
> 
> Currently, Wayang is the core technology behind Scalytics inc.. As a result, 
> a team of two engineers are working on a full time basis on this project. 
> Recently, three more developers have joined our efforts in building Wayang. 
> Thus, the risk of Wayang becoming orphaned is relatively very low. Still, 
> people outside Scalytics (from TU Berlin and HBKU) have also joined the 
> project, which makes the risk of abandoning the project even lower. The PUCV 
> in Chile is also beginning to contribute to the code base and to develop a 
> declarative query language on top of Wayang. The project is constantly being 
> monitored by email and frequent Skype meetings as well as by weekly meetings 
> with Scalytics people. Additionally, at the end of each year, we meet to 
> discuss the status of the project as well as to plan the most important 
> aspects we should work on during the year after.
> 
> ** Inexperience with Open Source **
> 
> Wayang quickly started being developed in open source under the Apache 
> Software License 2.0. The source code is available on Github. Also few of the 
> initial committers have contributed to other open source projects: Hadoop and 
> Flume
> 
> ** Homogeneous Developers **
> 
> The initial committers are already geographically distributed among Chile, 
> Germany, and Qatar. During incubation, one of our main goals is to increase 
> the heterogeneity of the current community and we will work hard to achieve 
> it.
> 
> ** Reliance on salaried developers **
> 
> Wayang is already being developed by a mix of full time and volunteer time. 
> Only 2 of the initial committers are working full time on this project 
> (Scalytics). So, we are confident that the project will not decrease its 
> development pace. Furthermore, we are committed to recruit additional 
> committers to significantly increase the development pace of the project.
> 
> ** Relationships with other Apache products **
> 
> Wayang is somehow related to Apache Spark as its developing interface is 
> inspired from Spark. In contrast to Spark, Wayang is not a data processing 
> platform, but a mediator between user applications and data processing 
> platforms. In this sense, Wayang is similar to the Apache Drill project, and 
> Apache Beam. However, Wayang significantly differs from Apache Drill in two 
> main aspects. First, Apache Drill provides only a common interface to query 
> multiple data storages and hence users have to specify in their query the 
> data to fetch. Then, Apache Drill translates the query to the processing 
> platforms where the data is stored, e.g. into mongoDB query representation. 
> In contrast, in Wayang, users only specify the data path and Wayang decides 
> which are the best (performance-wise) data processing platforms to use to 
> process such data. Second, the query interface in Apache Drill is SQL. Wayang 
> uses an interface based on operators forming DAGs. In this latter point, we 
> are currently developing a PIGLatin-like query language for Wayang. In 
> addition, in contrast to Apache Beam, Wayang not only allows users to use 
> multiple data processing platforms at the same time, but also it provides an 
> optimizer to choose the most efficient platform for the task at hand. In 
> Apache Beam, users have to specify an appropriate runner (platform).
> Given these similarities with the two Apache projects mentioned above, we are 
> looking forward to collaborating with those communities. Still, we are open 
> and would also love to collaborate with other Apache communities as well.
> ** An excessive fascination with the Apache Brand **
> 
> Wayang solves a real problem that currently users and developers have to deal 
> with at a high cost: monetary cost, high design and development efforts, and 
> very time consuming. Therefore, we believe that Wayang can be successful in 
> building a large community around it. We are convinced that the Apache brand 
> and community process will significantly help us in building such a community 
> and to establish the project in the long-term. We simply believe that ASF is 
> the right home for Wayang to achieve this.
> 
> === Documentation ===
> 
> Further details, documentation, and publications related to Wayang can be 
> found at https://docs.rheem.io/rheem/
> 
> === Initial Source ===
> 
> The current source code of Wayang resides in Github:
> https://github.com/rheem-ecosystem/rheem
> 
> === External Dependencies ===
> 
> Wayang depends on the following Apache projects:
> 
> * Maven
> * HDFS
> * Hadoop
> * Spark
> 
> Wayang depends on the following other open source projects organized by 
> license:
> 
> org.json.json: Json (http://json.org/license.html) 
> SnakeYAML: Apache 2.0
> Java Unified Expression Language API (Juel): Apache 2.0
> ProfileDB Instrumentation: Apache 2.0
> Gson: Apache 2.0
> Hadoop: Apache 2.0
> Scala: Apache 2.0
> Antlr 4: BSD
> Jackson: Apache 2.0
> Junit 5: EPL 2.0
> Mockito: MIT
> Assertj: Apache 2.0
> logback-classic: EPL 1.0 LGPL 2.1
> slf4j: MIT
> GNU Trove: LGPL 2.1
> graphchi: Apache 2.0
> SQLite JDBC: Apache 2.0
> PostgreSQL: BSD 2-clause
> jcommander: Apache 2.0
> Koloboke Collections API: Apache 2.0
> Snappy Java: Apache 2.0
> Apache Spark: Apache 2.0
> HyperSQL Database: BSD Modified (http://hsqldb.org/web/hsqlLicense.html) 
> Apache Giraph: Apache 2.0
> Apache Flink: Apache 2.0
> Apache Commons IO: Apache 2.0
> Apache Commons Lang: Apache 2.0
> Apache Maven: Apache 2.0
> 
> === Cryptography ===
> 
> (not applicable)
> 
> === Required Resources ===
> 
> ** Mailing Lists **
> 
> * mailto:priv...@wayang.incubator.apache.org
> * mailto:d...@wayang.incubator.apache.org
> * mailto:comm...@wayang.incubator.apache.org
> 
> ** Git repositories **
> 
> git://git.apache.org/repos/asf/incubator/wayang
> 
> ** Issue tracking **
> 
> https://issues.apache.org/jira/browse/RHEEM
> 
> === Initial Committers ===
> 
> The following list gives the planned initial committers (in alphabetical 
> order):
> 
> * Bertty Contreras-Rojas <bertty@http://scalytics.io>
> * Rodrigo Pardo-Meza <rodrigo@http://scalytics.io>
> * Alexander Alten-Lorenz <alo@http://scalytics.io>
> * Zoi Kaoudi <zoi.kaoudi@http://tu-berlin.de>
> * Haralampos Gavriilidis <gavriilidis@http://tu-berlin.de>
> * Jorge-Arnulfo Quiane-Ruiz <jorge.quiane@http://tu-berlin.de>
> * Anis Troudi <atroudi@http://hbku.edu.qa>
> * Wenceslao Palma-Muñoz <wenceslao.palma@http://pucv.cl>
> 
> ** Affiliations **
> 
> * Scalytics Inc.
> ** Bertty Contreras-Rojas
> ** Rodrigo Pardo-Meza
> ** Alexander Alten-Lorenz
> * Berlin Institute of Technology (TU Berlin)
> ** Zoi Kaoudi
> ** Haralampos Gavriilidis
> ** Jorge-Arnulfo Quiane-Ruiz
> * Hamad Bin Khalifa University (HBKU)
> ** Anis Troudi
> * Pontifical Catholic University of Valparaiso, Chile (PUCV)
> ** Wenceslao Palma-Muñoz
> 
> === Sponsors ===
> 
> ** Champion **
> 
> * Christofer Dutz (christofer.dutz at c-ware dot de)
> 
> ** Mentors **
> 
> . (cdutz) Christofer Dutz
> . (larsgeorge) Lars George
> . (berndf) Fondermann
> . (jbonofre) Jean-Baptiste Onofré
> 
> ** Sponsoring Entity **
> 
> The Apache Incubator
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [VOTE] Accept Wayang into the Apache Incubator

Reply via email to