The Tez incubator proposal seems to have a lot in common with the work on https://issues.apache.org/jira/browse/OOZIE-1178
> It is useful to have a workflow application master, which will be capable > of running a DAG of jobs. The workflow client submits a DAG request to the > AM and then the AM will manage the life cycle of this application in terms > of requesting the needed resources from the RM, and starting, monitoring > and retrying the application's individual tasks. > > Compared to running Oozie with the current MapReduce Application Master, > these are some of the advantages: > > - Less number of consumed resources, since only one application master > will be spawned for the whole workflow. > - Reuse of resources, since the same resources can be used by multiple > consecutive jobs in the workflow (no need to request/wait for resources for > every individual job from the central RM). > - More optimization opportunities in terms of collective resource > requests. > - Optimization opportunities in terms of rewriting and composing jobs > in the workflow (e.g. pushing down Mappers). > - This Application Master can be reused/extended by higher systems > like Pig and hive to provide an optimized way of running their workflows. > > So, is this the 'yapp' proposal that was discussed on that thread? ~avik On Mon, Feb 18, 2013 at 9:40 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > This seems like a reasonable project (basically it is the long fabled > map-reduce-reduce or MCR* in google terminology). > > But it is *very* heavy with Hortonworks developers. By my count, the > proportion is over half from HW with only token representation from other > companies: > > 13 Hortonworks > 4 Yahoo > 3 Facebook > 2 Microsoft > 1 Cloudera > > Shouldn't this be a bit broader to start with? Or is that an incubation > task? > > On Mon, Feb 18, 2013 at 9:29 PM, Arun C Murthy <a...@hortonworks.com> > wrote: > > > Folks, > > > > I'd like to propose adding Tez to the Apache Incubator: > > http://wiki.apache.org/incubator/TezProposal > > > > Essentially, it's the next step to improve projects in the Apache Hadoop > > ecosystem such as Apache Hive, Apache Pig, Cascading (ASL2, but not ASF > > project) by providing a more complex DAG of 'tasks' in a single > application > > to process data, there-by providing significant advantages for them. > > > > During the time I've spent working on MapReduce, I've forever heard > > complaints from Pig/Hive folks about the fact that MapReduce provides a > > very constrained task graph which results in excessive number of > MapReduce > > jobs... *smile*. It's very exciting to take this next step, and I would > be > > thrilled to have it happen in the ASF - as you can see in the proposal > this > > effort has broad support from members of MapReduce, Hive & Pig > communities, > > many of whom are eager to participate and have already contributed their > > efforts during the initial prototype. > > > > I welcome your feedback/discussion and look forward to it! > > > > thanks, > > Arun > > (proposed Champion) > > > > ---- > > > > = Tez = > > > > == Abstract == > > Tez is an effort to develop a generic application framework which can be > > used > > to process arbitrarily complex data-processing tasks and also a re-usable > > set > > of data-processing primitives which can be used by other projects. > > > > == Proposal == > > Tez is a proposal to develop a generic application which can be used to > > process complex data-processing task DAGs and runs natively on Apache > > Hadoop > > YARN. YARN is a generic resource-management system on which currently > > applications like MapReduce already exist. MapReduce is a specific, and > > constrained, DAG - which is not optimal for several frameworks like > Apache > > Hive > > and Apache Pig. Furthermore, we propose to develop a re-usable set of > > libraries of data-processing primitives such as sorting, merging, > > data-shuffling, intermediate data management etc. which are necessary for > > Tez > > which we envision can be used directly by other projects. > > > > == Background == > > Apache Hadoop MapReduce has emerged as the assembly-language on which > other > > frameworks like Apache Pig and Apache Hive have been built. However, it > has > > been well accepted that MapReduce produces very constrained task DAGs for > > each > > job which results in Apache Pig and Apache Hive requiring multiple > > MapReduce > > jobs for several queries. By providing a more expressive DAG of tasks > for a > > job, Tez attempts to provide significantly enhanced data-processing > > capabilities for projects like Apache Pig, Apache Hive, Cascading etc. > > > > == Rationale == > > There is an important gap that Tez fulfills in the Apache Hadoop > ecosystem > > of > > allowing for more expressive task DAGs for data-processing applications > > such > > as Apache Pig, Apache Hive, Cascading etc. > > > > With emergence of Apache Hadoop YARN, there is a strong need for a > > common DAG application which can then be shared by Apache Pig, Apache > Hive, > > Cascading etc. > > > > == Initial Goals == > > The initial goals for this project are to specify the detailed > requirements > > and architecture, and then develop the initial implementation including > the > > DAG ApplicationMaster to run natively inside Apache Hadoop YARN. > > > > == Current Status == > > Significant work has been completed to identify the initial requirements > > and > > define the overall system architecture. There is a patch available in the > > internal Hortonworks git repository which can act as the initial seed. > > > > === Meritocracy === > > We plan to invest in supporting a meritocracy. We will discuss the > > requirements > > in an open forum. Several companies have already expressed interest in > this > > project, and we intend to invite additional developers to participate. > > We will encourage and monitor community participation so that privileges > > can be > > extended to those that contribute. > > > > === Community === > > The need for a generic DAG application for data processing in the open > > source is > > tremendous, so there is a potential for a very large community. We > believe > > that Tez's extensible architecture will further encourage community > > participation. > > Also, related Apache projects (eg, Pig, Hive) have very large and active > > communities, and we expect that over time Tez will also attract a large > > community. > > > > === Core Developers === > > The developers on the initial committers list include people very > > experienced > > in the Apache Hadoop ecosystem: > > > > * Alan Gates <gates at apache dot org> > > * Arun C Murthy <acmurthy at apache dot org> > > * Ashutosh Chauhan <hashutosh at apache dot org> > > * Bikas Saha <bikas at apache dot org> > > * Chris Douglas <cdouglas at apache dot org> > > * Daryn Sharp <daryn at apache dot org> > > * Devaraj Das <ddas at apache dot org> > > * Gopal Vijayaraghavan <gopal at hortonworks dot com> > > * Gunther Hagleitner <ghagleitner at hortonworks dot com> > > * Hitesh Shah <hitesh at apache dot org> > > * Jason Lowe <jlowe at apache dot org> > > * Jean Xu <jeanxu at facebook dot com> > > * Jitendra Pandey <jitendra at apache dot org> > > * Kevin Wilfong <kevinwilfong at apache dot org> > > * Mike Liddell <mike dot lidell at microsoft dot com> > > * Namit Jain <namit at apache dot org> > > * Owen O'Malley <omalley at apache dot org> > > * Robert Evans <bobby at apache dot org> > > * Siddharth Seth <sseth at apache dot org> > > * Tom White <tomwhite at apache dot org> > > * Thomas Graves <tgraves at apache dot org> > > * Vikram Dixit <vikram at apache dot org> > > * Vinod Kumar Vavilapalli <vinodkv at apache dot org> > > > > We realize that though we have significant employer diversity already, > > additional diversity is always better, and we will work > > aggressively to recruit developers from additional companies. > > > > === Alignment === > > The initial committers strongly believe that a standard task DAG > > application on Apache Hadoop YARN will gain broader adoption as an open > > source, > > community driven project, where the community can contribute not only to > > the > > core components, but also to a growing collection of applications which > > will > > be based on top of Tez. Our hope is that the Apache Hive, Apache Pig, > > Cascading and other communities will find tremendous value in Tez and > will > > adopt > > it en masse. > > > > == Known Risks == > > > > === Orphaned Products === > > The contributors are leading users and vendors in the Apache Hadoop > > ecosystem, > > with significant open source experience, so the risk of being orphaned is > > relatively low. The project could be at risk if vendors decided to change > > their strategies in the market. In such an event, the current committers > > plan to continue working on the project on their own time, though the > > progress will likely be slower. We plan to mitigate this risk by > > recruiting additional committers. > > > > === Inexperience with Open Source === > > The initial committers include veteran Apache members (Committers, PMC > > members > > and Apache Members) and other developers who have varying degrees of > > experience > > with open source projects. All have been involved with source code that > has > > been released under an open source license, and several also have > > experience > > developing code with an open source development process. > > > > === Homogenous Developers === > > The initial committers are employed by a number of companies, including > > Cloudera, Facebook, Hortonworks, Microsoft and Yahoo. We are committed to > > recruiting additional committers from other companies based on their > > contributions to the project even though we do have significant diversity > > already. > > > > === Reliance on Salaried Developers === > > It is expected that Tez development will occur on both salaried time and > on > > volunteer time, after hours. The majority of initial committers are paid > by > > their employer to contribute to this project. However, they are all > > passionate > > about the project, and we are confident that the project will continue > > even if > > no salaried developers contribute to the project. We are committed to > > recruiting > > additional committers including non-salaried developers. > > > > === Relationships with Other Apache Products === > > As mentioned in the Alignment section, Tez is closely integrated with > > Hadoop, > > Hive and Pig in a numerous ways. We look forward to collaborating with > > those communities, as well as other Apache communities. > > > > === An Excessive Fascination with the Apache Brand === > > Tez solves a real need for generic task DAG management in the Apache > Hadoop > > ecosystem, something which has been addressed in a very ad hoc manner so > > far > > by multiple Apache projects. Our rationale for developing Tez as an > Apache > > project is detailed in the Rationale section. We believe that the Apache > > brand > > and community process will help us attract more contributors to this > > project, > > and help establish ubiquitous APIs. > > > > == Documentation == > > http://wiki.apache.org/incubator/TezProposal > > > > == Initial Source == > > Available as a patch. > > > > == Cryptography == > > Tez will eventually support encryption on the wire. This is not one of > the > > initial > > goals, and we do not expect Tez to be a controlled export item due to the > > use > > of encryption. > > > > == Required Resources == > > > > === Mailing List === > > * tez-private > > * tez-dev > > * tez-user > > > > === Subversion Directory === > > Git is the preferred source control system: git://git.apache.org/tez > > > > === Issue Tracking === > > > > JIRA Tez (TEZ) > > > > == Initial Committers == > > * Alan Gates <gates at apache dot org> > > * Arun C Murthy <acmurthy at apache dot org> > > * Ashutosh Chauhan <hashutosh at apache dot org> > > * Bikas Saha <bikas at apache dot org> > > * Chris Douglas <cdouglas at apache dot org> > > * Daryn Sharp <daryn at apache dot org> > > * Devaraj Das <ddas at apache dot org> > > * Gopal Vijayaraghavan <gopal at hortonworks dot com> > > * Gunther Hagleitner <ghagleitner at hortonworks dot com> > > * Hitesh Shah <hitesh at apache dot org> > > * Jason Lowe <jlowe at apache dot org> > > * Jean Xu <jeanxu at facebook dot com> > > * Jitendra Pandey <jitendra at apache dot org> > > * Kevin Wilfong <kevinwilfong at apache dot org> > > * Mike Liddell <mike dot lidell at microsoft dot com> > > * Namit Jain <namit at apache dot org> > > * Owen O'Malley <omalley at apache dot org> > > * Robert Evans <bobby at apache dot org> > > * Siddharth Seth <sseth at apache dot org> > > * Tom White <tomwhite at apache dot org> > > * Thomas Graves <tgraves at apache dot org> > > * Vikram Dixit <vikram at apache dot org> > > * Vinod Kumar Vavilapalli <vinodkv at apache dot org> > > > > == Affiliations == > > The initial committers are employees of Cloudera, Facebook, Hortonworks, > > Microsoft and Yahoo Inc. > > > > * Alan Gates - Hortonworks > > * Arun C Murthy - Hortonworks > > * Ashutosh Chauhan - Hortonworks > > * Bikas Saha - Hortonworks > > * Chris Douglas - Microsoft > > * Daryn Sharp - Yahoo > > * Devaraj Das - Hortonworks > > * Gopal Vijayaraghavan - Hortonworks > > * Gunther Hagleitner - Hortonworks > > * Hitesh Shah - Hortonworks > > * Jason Lowe - Yahoo > > * Jean Xu - Facebook > > * Jitendra Pandey - Hortonworks > > * Kevin Wilfong - Facebook > > * Mike Liddell - Microsoft > > * Namit Jain - Facebook > > * Owen O'Malley - Hortonworks > > * Robert Evans - Yahoo > > * Siddharth Seth - Hortonworks > > * Tom White - Cloudera > > * Thomas Graves - Yahoo > > * Vikram Dixit - Hortonworks > > * Vinod Kumar Vavilapalli - Hortonworks > > > > The nominated mentors are employees of Hortonworks, > > NASA JPL and Microsoft. > > > > * Alan Gates - Hortonworks > > * Arun C Murthy - Hortonworks > > * Chris Douglas - Microsoft > > * Chris Mattman - NASA JPL > > * Owen O'Malley - Hortonworks > > > > == Sponsors == > > > > === Champion === > > Arun C Murthy <acmurthy at apache dot org> > > > > === Nominated Mentors === > > * Alan Gates <gates at apache dot org> – Architect at Hortonworks. > > Committer for Pig. > > * Arun C Murthy <acmurthy at apache dot org> – Architect at > > Hortonworks. Committer for Hadoop. > > * Chris Douglas <cdouglas at apache dot org> - Sr. Research Engineer at > > Microsoft. Committer for Hadoop. > > * Chris Mattman <mattmann at apache dot org> - Sr. Computer Scientist, > > NASA JPL. Committer for Nutch, OODT and Tika. > > * Owen O'Malley <omalley at apache dot org> – Architect at > Hortonworks. > > Committer for Hadoop, Ambari. > > > > === Sponsoring Entity === > > Incubator > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > >