[VOTE] Release Apache log4php 2.1.0 (RC2)
Dear all, It is my pleasure to announce the second release candidate for Apache log4php 2.1.0. Since we are short of PMC members at the log4php project, I would appreciate if other PMCs would join in so that we may pass this vote. Fixes compared to RC1: * included build.xml in source packages * fixed version number in title of api docs Significant changes in this release include: * a new logging level: trace * a new appender: MongoDB (thanks to Vladimir Gorej) * a plethora of bugfixes and other code improvements * most of the site docs have been rewritten to make log4php more accessible to users Apache log4php 2.1.0 RC2 is available for review here: * http://people.apache.org/builds/logging/log4php/2.1.0/RC2/ The tag for this release is available here: * http://svn.apache.org/viewvc/logging/log4php/tags/apache-log4php-2.1.0/ The site for this version can be found at: * http://people.apache.org/~ihabunek/apache-log4php-2.1.0-RC2/site/ According to the process, please vote with: [ ] +1 Yes go ahead and release the artifacts [ ] -1 No, because... Best regards, Ivan - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [PROPOSAL] Oozie for the Apache Incubator
You might want to reconsider the name. In English (British English at least) ooze is an unpleasant thing often related to a body wound or a stagnant river. The formal definition is not so bad [1], but in common (UK) usage it's unpleasant. Ross [1] http://dictionary.reference.com/browse/ooze On 29 June 2011 03:07, arv...@cloudera.com arv...@cloudera.com wrote: +1 (non-binding). Thanks, Arvind On Fri, Jun 24, 2011 at 12:46 PM, Mohammad Islam misla...@yahoo.com wrote: Hi, I would like to propose Oozie to be an Apache Incubator project. Oozie is a server-based workflow scheduling and coordination system to manage data processing jobs for Apache Hadoop. Here's a link to the proposal in the Incubator wiki http://wiki.apache.org/incubator/OozieProposal I've also pasted the initial contents below. Regards, Mohammad Islam Start of Oozie Proposal Abstract Oozie is a server-based workflow scheduling and coordination system to manage data processing jobs for Apache HadoopTM. Proposal Oozie is an extensible, scalable and reliable system to define, manage, schedule, and execute complex Hadoop workloads via web services. More specifically, this includes: * XML-based declarative framework to specify a job or a complex workflow of dependent jobs. * Support different types of job such as Hadoop Map-Reduce, Pipe, Streaming, Pig, Hive and custom java applications. * Workflow scheduling based on frequency and/or data availability. * Monitoring capability, automatic retry and failure handing of jobs. * Extensible and pluggable architecture to allow arbitrary grid programming paradigms. * Authentication, authorization, and capacity-aware load throttling to allow multi-tenant software as a service. Background Most data processing applications require multiple jobs to achieve their goals, with inherent dependencies among the jobs. A dependency could be sequential, where one job can only start after another job has finished. Or it could be conditional, where the execution of a job depends on the return value or status of another job. In other cases, parallel execution of multiple jobs may be permitted – or desired – to exploit the massive pool of compute nodes provided by Hadoop. These job dependencies are often expressed as a Directed Acyclic Graph, also called a workflow. A node in the workflow is typically a job (a computation on the grid) or another type of action such as an eMail notification. Computations can be expressed in map/reduce, Pig, Hive or any other programming paradigm available on the grid. Edges of the graph represent transitions from one node to the next, as the execution of a workflow proceeds. Describing a workflow in a declarative way has the advantage of decoupling job dependencies and execution control from application logic. Furthermore, the workflow is modularized into jobs that can be reused within the same workflow or across different workflows. Execution of the workflow is then driven by a runtime system without understanding the application logic of the jobs. This runtime system specializes in reliable and predictable execution: It can retry actions that have failed or invoke a cleanup action after termination of the workflow; it can monitor progress, success, or failure of a workflow, and send appropriate alerts to an administrator. The application developer is relieved from implementing these generic procedures. Furthermore, some applications or workflows need to run in periodic intervals or when dependent data is available. For example, a workflow could be executed every day as soon as output data from the previous 24 instances of another, hourly workflow is available. The workflow coordinator provides such scheduling features, along with prioritization, load balancing and throttling to optimize utilization of resources in the cluster. This makes it easier to maintain, control, and coordinate complex data applications. Nearly three years ago, a team of Yahoo! developers addressed these critical requirements for Hadoop-based data processing systems by developing a new workflow management and scheduling system called Oozie. While it was initially developed as a Yahoo!-internal project, it was designed and implemented with the intention of open-sourcing. Oozie was released as a GitHub project in early 2010. Oozie is used in production within Yahoo and since it has been open-sourced it has been gaining adoption with external developers Rationale Commonly, applications that run on Hadoop require multiple Hadoop jobs in order to obtain the desired results. Furthermore, these Hadoop jobs are commonly a combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs and shell scripts. Because of
Re: [VOTE] Kafka to join the Incubator
+1 (binding) Tommaso 2011/6/28 Jun Rao jun...@gmail.com Hi all, Since the discussion on the thread of the Kafka incubator proposal is winding down, I'd like to call a vote. At the end of this mail, I've put a copy of the current proposal. Here is a link to the document in the wiki: http://wiki.apache.org/incubator/KafkaProposal And here is a link to the discussion thread: http://www.mail-archive.com/general@incubator.apache.org/msg29594.html Please cast your votes: [ ] +1 Accept Kafka for incubation [ ] +0 Indifferent to Kafka incubation [ ] -1 Reject Kafka for incubation This vote will close 72 hours from now. Thanks, Jun == Abstract == Kafka is a distributed publish-subscribe system for processing large amounts of streaming data. == Proposal == Kafka provides an extremely high throughput distributed publish/subscribe messaging system. Additionally, it supports relatively long term persistence of messages to support a wide variety of consumers, partitioning of the message stream across servers and consumers, and functionality for loading data into Apache Hadoop for offline, batch processing. == Background == Kafka was developed at LinkedIn to process the large amounts of events generated by that company's website and provide a common repository for many types of consumers to access and process those events. Kafka has been used in production at LinkedIn scale to handle dozens of types of events including page views, searches and social network activity. Kafka clusters at LinkedIn currently process more than two billion events per day. Kafka fills the gap between messaging systems such as Apache ActiveMQ, which provide low latency message delivery but don't focus on throughput, and log processing systems such as Scribe and Flume, which do not provide adequate latency for our diverse set of consumers. Kafka can also be inserted into traditional log-processing systems, acting as an intermediate step before further processing. Kafka focuses relentlessly on performance and throughput by not introspecting into message content, nor indexing them on the broker. We also achieve high performance by depending on Java's sendFile/transferTo capabilities to minimize intermediate buffer copies and relying on the OS's pagecache to efficiently serve up message contents to consumers. Kafka is also designed to be scalable and it depends on Apache ZooKeeper for coordination amongst its producers, brokers and consumers. Kafka is written in Scala. It was developed internally at LinkedIn to meet our particular use cases, but will be useful to many organizations facing a similar need to reliably process large amounts of streaming data. Therefore, we would like to share it the ASF and begin developing a community of developers and users within Apache. == Rationale == Many organizations can benefit from a reliable stream processing system such as Kafka. While our use case of processing events from a very large website like LinkedIn has driven the design of Kafka, its uses are varied and we expect many new use cases to emerge. Kafka provides a natural bridge between near real-time event processing and offline batch processing and will appeal to many users. == Current Status == === Meritocracy === Our intent with this incubator proposal is to start building a diverse developer community around Kafka following the Apache meritocracy model. Since Kafka was open sourced we have solicited contributions via the website and presentations given to user groups and technical audiences. We have had positive responses to these and have received several contributions and clients for other languages. We plan to continue this support for new contributors and work with those who contribute significantly to the project to make them committers. === Community === Kafka is currently being used by developed by engineers within LinkedIn and used in production in that company. Additionally, we have active users in or have received contributions from a diverse set of companies including MediaSift, SocialTwist, Clearspring and Urban Airship. Recent public presentations of Kafka and its goals garnered much interest from potential contributors. We hope to extend our contributor base significantly and invite all those who are interested in building high-throughput distributed systems to participate. We have begun receiving contributions from outside of LinkedIn, including clients for several languages including Ruby, PHP, Clojure, .NET and Python. To further this goal, we use GitHub issue tracking and branching facilities, as well as maintaining a public mailing list via Google Groups. === Core Developers === Kafka is currently being developed by four engineers at LinkedIn: Neha Narkhede, Jun Rao, Jakob Homan and Jay Kreps. Jun has experience within Apache as a Cassandra committer and PMC member. Neha has been an active
Re: [VOTE] Kafka to join the Incubator
+1 (Binding) By the way, it is great to see another Scala-based project coming to Apache. Dick VP Apache ESME On Wed, Jun 29, 2011 at 12:35 PM, Tommaso Teofili tommaso.teof...@gmail.com wrote: +1 (binding) Tommaso 2011/6/28 Jun Rao jun...@gmail.com Hi all, Since the discussion on the thread of the Kafka incubator proposal is winding down, I'd like to call a vote. At the end of this mail, I've put a copy of the current proposal. Here is a link to the document in the wiki: http://wiki.apache.org/incubator/KafkaProposal And here is a link to the discussion thread: http://www.mail-archive.com/general@incubator.apache.org/msg29594.html Please cast your votes: [ ] +1 Accept Kafka for incubation [ ] +0 Indifferent to Kafka incubation [ ] -1 Reject Kafka for incubation This vote will close 72 hours from now. Thanks, Jun == Abstract == Kafka is a distributed publish-subscribe system for processing large amounts of streaming data. == Proposal == Kafka provides an extremely high throughput distributed publish/subscribe messaging system. Additionally, it supports relatively long term persistence of messages to support a wide variety of consumers, partitioning of the message stream across servers and consumers, and functionality for loading data into Apache Hadoop for offline, batch processing. == Background == Kafka was developed at LinkedIn to process the large amounts of events generated by that company's website and provide a common repository for many types of consumers to access and process those events. Kafka has been used in production at LinkedIn scale to handle dozens of types of events including page views, searches and social network activity. Kafka clusters at LinkedIn currently process more than two billion events per day. Kafka fills the gap between messaging systems such as Apache ActiveMQ, which provide low latency message delivery but don't focus on throughput, and log processing systems such as Scribe and Flume, which do not provide adequate latency for our diverse set of consumers. Kafka can also be inserted into traditional log-processing systems, acting as an intermediate step before further processing. Kafka focuses relentlessly on performance and throughput by not introspecting into message content, nor indexing them on the broker. We also achieve high performance by depending on Java's sendFile/transferTo capabilities to minimize intermediate buffer copies and relying on the OS's pagecache to efficiently serve up message contents to consumers. Kafka is also designed to be scalable and it depends on Apache ZooKeeper for coordination amongst its producers, brokers and consumers. Kafka is written in Scala. It was developed internally at LinkedIn to meet our particular use cases, but will be useful to many organizations facing a similar need to reliably process large amounts of streaming data. Therefore, we would like to share it the ASF and begin developing a community of developers and users within Apache. == Rationale == Many organizations can benefit from a reliable stream processing system such as Kafka. While our use case of processing events from a very large website like LinkedIn has driven the design of Kafka, its uses are varied and we expect many new use cases to emerge. Kafka provides a natural bridge between near real-time event processing and offline batch processing and will appeal to many users. == Current Status == === Meritocracy === Our intent with this incubator proposal is to start building a diverse developer community around Kafka following the Apache meritocracy model. Since Kafka was open sourced we have solicited contributions via the website and presentations given to user groups and technical audiences. We have had positive responses to these and have received several contributions and clients for other languages. We plan to continue this support for new contributors and work with those who contribute significantly to the project to make them committers. === Community === Kafka is currently being used by developed by engineers within LinkedIn and used in production in that company. Additionally, we have active users in or have received contributions from a diverse set of companies including MediaSift, SocialTwist, Clearspring and Urban Airship. Recent public presentations of Kafka and its goals garnered much interest from potential contributors. We hope to extend our contributor base significantly and invite all those who are interested in building high-throughput distributed systems to participate. We have begun receiving contributions from outside of LinkedIn, including clients for several languages including Ruby, PHP, Clojure, .NET and Python. To further this goal, we use GitHub issue tracking and branching facilities, as well as maintaining a public mailing list via Google Groups. === Core Developers === Kafka is currently being
Re: [PROPOSAL] Deft for incubation
On Mon, Jun 27, 2011 at 12:37 PM, Niklas Gustavsson nik...@protocol7.com wrote: As you will note, the list of mentors is in need of some volunteers, so if you find this interesting, feel free to sign up. Needless to say, the same of course goes for committers. We're still in need for more mentors to sign up. Anyone willing? /niklas - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [PROPOSAL] Oozie for the Apache Incubator
On Wed, Jun 29, 2011 at 11:22:39AM +0100, Ross Gardler wrote: You might want to reconsider the name. In English (British English at least) ooze is an unpleasant thing often related to a body wound or a stagnant river. The formal definition is not so bad [1], but in common (UK) usage it's unpleasant. And I thought at first that it was a reference to the Uzi, a submachine gun. It's apparently the Burmese term for an elephant handler. http://en.wikipedia.org/wiki/Mahout In Burma, the profession is called oozie; in Thailand kwan-chang; and in Vietnam quản tượng. We had a good laugh about all this in the #lucy_dev IRC channel a couple days ago. One of the participants (who free-associated Oozie with sucking chest wound) suggested that Hadoop projects might consider referencing stuffed animals rather than elephants. :) Marvin Humphrey - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [PROPOSAL] Deft for incubation
Hi Niklas... You can sign me in. On Wed, Jun 29, 2011 at 2:02 PM, Niklas Gustavsson nik...@protocol7.com wrote: On Mon, Jun 27, 2011 at 12:37 PM, Niklas Gustavsson nik...@protocol7.com wrote: As you will note, the list of mentors is in need of some volunteers, so if you find this interesting, feel free to sign up. Needless to say, the same of course goes for committers. We're still in need for more mentors to sign up. Anyone willing? /niklas - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org -- Thanks - Mohammad Nour Author of (WebSphere Application Server Community Edition 2.0 User Guide) http://www.redbooks.ibm.com/abstracts/sg247585.html - LinkedIn: http://www.linkedin.com/in/mnour - Blog: http://tadabborat.blogspot.com Life is like riding a bicycle. To keep your balance you must keep moving - Albert Einstein Writing clean code is what you must do in order to call yourself a professional. There is no reasonable excuse for doing anything less than your best. - Clean Code: A Handbook of Agile Software Craftsmanship Stay hungry, stay foolish. - Steve Jobs - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [PROPOSAL] Oozie for the Apache Incubator
And I was thinking of the Ray Ozzie, the former Microsoft CTO. Elephant handler is perhaps apt. -Riob On Wed, Jun 29, 2011 at 9:32 AM, Marvin Humphrey mar...@rectangular.com wrote: On Wed, Jun 29, 2011 at 11:22:39AM +0100, Ross Gardler wrote: You might want to reconsider the name. In English (British English at least) ooze is an unpleasant thing often related to a body wound or a stagnant river. The formal definition is not so bad [1], but in common (UK) usage it's unpleasant. And I thought at first that it was a reference to the Uzi, a submachine gun. It's apparently the Burmese term for an elephant handler. http://en.wikipedia.org/wiki/Mahout In Burma, the profession is called oozie; in Thailand kwan-chang; and in Vietnam quản tượng. We had a good laugh about all this in the #lucy_dev IRC channel a couple days ago. One of the participants (who free-associated Oozie with sucking chest wound) suggested that Hadoop projects might consider referencing stuffed animals rather than elephants. :) Marvin Humphrey - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
[VOTE] Oozie to join the Incubator
Hi All, The discussion about Oozie proposal is settling down. Therefore I would like to initiate a vote to accept Oozie as an Apache Incubator project. The latest proposal is pasted at the end and it could be found in the wiki as well: http://wiki.apache.org/incubator/OozieProposal The related discussion thread is at: http://www.mail-archive.com/general@incubator.apache.org/msg29633.html Please cast your votes: [ ] +1 Accept Oozie for incubation [ ] +0 Indifferent to Oozie incubation [ ] -1 Reject Oozie for incubation This vote will close 72 hours from now. Regards, Mohammad Abstract Oozie is a server-based workflow scheduling and coordination system to manage data processing jobs for Apache HadoopTM. Proposal Oozie is an extensible, scalable and reliable system to define, manage, schedule, and execute complex Hadoop workloads via web services. More specifically, this includes: * XML-based declarative framework to specify a job or a complex workflow of dependent jobs. * Support different types of job such as Hadoop Map-Reduce, Pipe, Streaming, Pig, Hive and custom java applications. * Workflow scheduling based on frequency and/or data availability. * Monitoring capability, automatic retry and failure handing of jobs. * Extensible and pluggable architecture to allow arbitrary grid programming paradigms. * Authentication, authorization, and capacity-aware load throttling to allow multi-tenant software as a service. Background Most data processing applications require multiple jobs to achieve their goals, with inherent dependencies among the jobs. A dependency could be sequential, where one job can only start after another job has finished. Or it could be conditional, where the execution of a job depends on the return value or status of another job. In other cases, parallel execution of multiple jobs may be permitted – or desired – to exploit the massive pool of compute nodes provided by Hadoop. These job dependencies are often expressed as a Directed Acyclic Graph, also called a workflow. A node in the workflow is typically a job (a computation on the grid) or another type of action such as an eMail notification. Computations can be expressed in map/reduce, Pig, Hive or any other programming paradigm available on the grid. Edges of the graph represent transitions from one node to the next, as the execution of a workflow proceeds. Describing a workflow in a declarative way has the advantage of decoupling job dependencies and execution control from application logic. Furthermore, the workflow is modularized into jobs that can be reused within the same workflow or across different workflows. Execution of the workflow is then driven by a runtime system without understanding the application logic of the jobs. This runtime system specializes in reliable and predictable execution: It can retry actions that have failed or invoke a cleanup action after termination of the workflow; it can monitor progress, success, or failure of a workflow, and send appropriate alerts to an administrator. The application developer is relieved from implementing these generic procedures. Furthermore, some applications or workflows need to run in periodic intervals or when dependent data is available. For example, a workflow could be executed every day as soon as output data from the previous 24 instances of another, hourly workflow is available. The workflow coordinator provides such scheduling features, along with prioritization, load balancing and throttling to optimize utilization of resources in the cluster. This makes it easier to maintain, control, and coordinate complex data applications. Nearly three years ago, a team of Yahoo! developers addressed these critical requirements for Hadoop-based data processing systems by developing a new workflow management and scheduling system called Oozie. While it was initially developed as a Yahoo!-internal project, it was designed and implemented with the intention of open-sourcing. Oozie was released as a GitHub project in early 2010. Oozie is used in production within Yahoo and since it has been open-sourced it has been gaining adoption with external developers Rationale Commonly, applications that run on Hadoop require multiple Hadoop jobs in order to obtain the desired results. Furthermore, these Hadoop jobs are commonly a combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs and shell scripts. Because of this, developers find themselves writing ad-hoc glue programs to combine these Hadoop jobs. These ad-hoc programs are difficult to schedule, manage, monitor and recover. Workflow management and scheduling is an essential feature for large-scale data processing applications. Such applications could
Re: [VOTE] Oozie to join the Incubator
+1 (binding). Good luck guys! Cheers, Chris Sent from my iPad On Jun 29, 2011, at 12:10 PM, Mohammad Islam misla...@yahoo.com wrote: Hi All, The discussion about Oozie proposal is settling down. Therefore I would like to initiate a vote to accept Oozie as an Apache Incubator project. The latest proposal is pasted at the end and it could be found in the wiki as well: http://wiki.apache.org/incubator/OozieProposal The related discussion thread is at: http://www.mail-archive.com/general@incubator.apache.org/msg29633.html Please cast your votes: [ ] +1 Accept Oozie for incubation [ ] +0 Indifferent to Oozie incubation [ ] -1 Reject Oozie for incubation This vote will close 72 hours from now. Regards, Mohammad Abstract Oozie is a server-based workflow scheduling and coordination system to manage data processing jobs for Apache HadoopTM. Proposal Oozie is an extensible, scalable and reliable system to define, manage, schedule, and execute complex Hadoop workloads via web services. More specifically, this includes: * XML-based declarative framework to specify a job or a complex workflow of dependent jobs. * Support different types of job such as Hadoop Map-Reduce, Pipe, Streaming, Pig, Hive and custom java applications. * Workflow scheduling based on frequency and/or data availability. * Monitoring capability, automatic retry and failure handing of jobs. * Extensible and pluggable architecture to allow arbitrary grid programming paradigms. * Authentication, authorization, and capacity-aware load throttling to allow multi-tenant software as a service. Background Most data processing applications require multiple jobs to achieve their goals, with inherent dependencies among the jobs. A dependency could be sequential, where one job can only start after another job has finished. Or it could be conditional, where the execution of a job depends on the return value or status of another job. In other cases, parallel execution of multiple jobs may be permitted – or desired – to exploit the massive pool of compute nodes provided by Hadoop. These job dependencies are often expressed as a Directed Acyclic Graph, also called a workflow. A node in the workflow is typically a job (a computation on the grid) or another type of action such as an eMail notification. Computations can be expressed in map/reduce, Pig, Hive or any other programming paradigm available on the grid. Edges of the graph represent transitions from one node to the next, as the execution of a workflow proceeds. Describing a workflow in a declarative way has the advantage of decoupling job dependencies and execution control from application logic. Furthermore, the workflow is modularized into jobs that can be reused within the same workflow or across different workflows. Execution of the workflow is then driven by a runtime system without understanding the application logic of the jobs. This runtime system specializes in reliable and predictable execution: It can retry actions that have failed or invoke a cleanup action after termination of the workflow; it can monitor progress, success, or failure of a workflow, and send appropriate alerts to an administrator. The application developer is relieved from implementing these generic procedures. Furthermore, some applications or workflows need to run in periodic intervals or when dependent data is available. For example, a workflow could be executed every day as soon as output data from the previous 24 instances of another, hourly workflow is available. The workflow coordinator provides such scheduling features, along with prioritization, load balancing and throttling to optimize utilization of resources in the cluster. This makes it easier to maintain, control, and coordinate complex data applications. Nearly three years ago, a team of Yahoo! developers addressed these critical requirements for Hadoop-based data processing systems by developing a new workflow management and scheduling system called Oozie. While it was initially developed as a Yahoo!-internal project, it was designed and implemented with the intention of open-sourcing. Oozie was released as a GitHub project in early 2010. Oozie is used in production within Yahoo and since it has been open-sourced it has been gaining adoption with external developers Rationale Commonly, applications that run on Hadoop require multiple Hadoop jobs in order to obtain the desired results. Furthermore, these Hadoop jobs are commonly a combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs and shell scripts. Because of this, developers find themselves writing ad-hoc glue programs to combine these Hadoop jobs. These ad-hoc programs
Re: [VOTE] Oozie to join the Incubator
Hi Mohammad, I am interested to contribute to this project, since any one did not vote yet, can I add my name to the Initial Committers? Thanks, Suresh On Jun 29, 2011, at 3:10 PM, Mohammad Islam wrote: Hi All, The discussion about Oozie proposal is settling down. Therefore I would like to initiate a vote to accept Oozie as an Apache Incubator project. The latest proposal is pasted at the end and it could be found in the wiki as well: http://wiki.apache.org/incubator/OozieProposal The related discussion thread is at: http://www.mail-archive.com/general@incubator.apache.org/msg29633.html Please cast your votes: [ ] +1 Accept Oozie for incubation [ ] +0 Indifferent to Oozie incubation [ ] -1 Reject Oozie for incubation This vote will close 72 hours from now. Regards, Mohammad Abstract Oozie is a server-based workflow scheduling and coordination system to manage data processing jobs for Apache HadoopTM. Proposal Oozie is an extensible, scalable and reliable system to define, manage, schedule, and execute complex Hadoop workloads via web services. More specifically, this includes: * XML-based declarative framework to specify a job or a complex workflow of dependent jobs. * Support different types of job such as Hadoop Map-Reduce, Pipe, Streaming, Pig, Hive and custom java applications. * Workflow scheduling based on frequency and/or data availability. * Monitoring capability, automatic retry and failure handing of jobs. * Extensible and pluggable architecture to allow arbitrary grid programming paradigms. * Authentication, authorization, and capacity-aware load throttling to allow multi-tenant software as a service. Background Most data processing applications require multiple jobs to achieve their goals, with inherent dependencies among the jobs. A dependency could be sequential, where one job can only start after another job has finished. Or it could be conditional, where the execution of a job depends on the return value or status of another job. In other cases, parallel execution of multiple jobs may be permitted – or desired – to exploit the massive pool of compute nodes provided by Hadoop. These job dependencies are often expressed as a Directed Acyclic Graph, also called a workflow. A node in the workflow is typically a job (a computation on the grid) or another type of action such as an eMail notification. Computations can be expressed in map/reduce, Pig, Hive or any other programming paradigm available on the grid. Edges of the graph represent transitions from one node to the next, as the execution of a workflow proceeds. Describing a workflow in a declarative way has the advantage of decoupling job dependencies and execution control from application logic. Furthermore, the workflow is modularized into jobs that can be reused within the same workflow or across different workflows. Execution of the workflow is then driven by a runtime system without understanding the application logic of the jobs. This runtime system specializes in reliable and predictable execution: It can retry actions that have failed or invoke a cleanup action after termination of the workflow; it can monitor progress, success, or failure of a workflow, and send appropriate alerts to an administrator. The application developer is relieved from implementing these generic procedures. Furthermore, some applications or workflows need to run in periodic intervals or when dependent data is available. For example, a workflow could be executed every day as soon as output data from the previous 24 instances of another, hourly workflow is available. The workflow coordinator provides such scheduling features, along with prioritization, load balancing and throttling to optimize utilization of resources in the cluster. This makes it easier to maintain, control, and coordinate complex data applications. Nearly three years ago, a team of Yahoo! developers addressed these critical requirements for Hadoop-based data processing systems by developing a new workflow management and scheduling system called Oozie. While it was initially developed as a Yahoo!-internal project, it was designed and implemented with the intention of open-sourcing. Oozie was released as a GitHub project in early 2010. Oozie is used in production within Yahoo and since it has been open-sourced it has been gaining adoption with external developers Rationale Commonly, applications that run on Hadoop require multiple Hadoop jobs in order to obtain the desired results. Furthermore, these Hadoop jobs are commonly a combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs and
Re: [VOTE] Oozie to join the Incubator
+1 (binding) Sent from my iPad On Jun 29, 2011, at 12:10 PM, Mohammad Islam misla...@yahoo.com wrote: Hi All, The discussion about Oozie proposal is settling down. Therefore I would like to initiate a vote to accept Oozie as an Apache Incubator project. The latest proposal is pasted at the end and it could be found in the wiki as well: http://wiki.apache.org/incubator/OozieProposal The related discussion thread is at: http://www.mail-archive.com/general@incubator.apache.org/msg29633.html Please cast your votes: [ ] +1 Accept Oozie for incubation [ ] +0 Indifferent to Oozie incubation [ ] -1 Reject Oozie for incubation This vote will close 72 hours from now. Regards, Mohammad Abstract Oozie is a server-based workflow scheduling and coordination system to manage data processing jobs for Apache HadoopTM. Proposal Oozie is an extensible, scalable and reliable system to define, manage, schedule, and execute complex Hadoop workloads via web services. More specifically, this includes: * XML-based declarative framework to specify a job or a complex workflow of dependent jobs. * Support different types of job such as Hadoop Map-Reduce, Pipe, Streaming, Pig, Hive and custom java applications. * Workflow scheduling based on frequency and/or data availability. * Monitoring capability, automatic retry and failure handing of jobs. * Extensible and pluggable architecture to allow arbitrary grid programming paradigms. * Authentication, authorization, and capacity-aware load throttling to allow multi-tenant software as a service. Background Most data processing applications require multiple jobs to achieve their goals, with inherent dependencies among the jobs. A dependency could be sequential, where one job can only start after another job has finished. Or it could be conditional, where the execution of a job depends on the return value or status of another job. In other cases, parallel execution of multiple jobs may be permitted – or desired – to exploit the massive pool of compute nodes provided by Hadoop. These job dependencies are often expressed as a Directed Acyclic Graph, also called a workflow. A node in the workflow is typically a job (a computation on the grid) or another type of action such as an eMail notification. Computations can be expressed in map/reduce, Pig, Hive or any other programming paradigm available on the grid. Edges of the graph represent transitions from one node to the next, as the execution of a workflow proceeds. Describing a workflow in a declarative way has the advantage of decoupling job dependencies and execution control from application logic. Furthermore, the workflow is modularized into jobs that can be reused within the same workflow or across different workflows. Execution of the workflow is then driven by a runtime system without understanding the application logic of the jobs. This runtime system specializes in reliable and predictable execution: It can retry actions that have failed or invoke a cleanup action after termination of the workflow; it can monitor progress, success, or failure of a workflow, and send appropriate alerts to an administrator. The application developer is relieved from implementing these generic procedures. Furthermore, some applications or workflows need to run in periodic intervals or when dependent data is available. For example, a workflow could be executed every day as soon as output data from the previous 24 instances of another, hourly workflow is available. The workflow coordinator provides such scheduling features, along with prioritization, load balancing and throttling to optimize utilization of resources in the cluster. This makes it easier to maintain, control, and coordinate complex data applications. Nearly three years ago, a team of Yahoo! developers addressed these critical requirements for Hadoop-based data processing systems by developing a new workflow management and scheduling system called Oozie. While it was initially developed as a Yahoo!-internal project, it was designed and implemented with the intention of open-sourcing. Oozie was released as a GitHub project in early 2010. Oozie is used in production within Yahoo and since it has been open-sourced it has been gaining adoption with external developers Rationale Commonly, applications that run on Hadoop require multiple Hadoop jobs in order to obtain the desired results. Furthermore, these Hadoop jobs are commonly a combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs and shell scripts. Because of this, developers find themselves writing ad-hoc glue programs to combine these Hadoop
Bluessky calls for a new mentor!!!
Hi,all, Now, Bluesky project calls for a new mentor to guide us to complete the release work of 4th version. BlueSky is an e-learning solution designed to help solve the disparity in availability of qualified education between well-developed cities and poorer regions of China (e.g., countryside of Western China). BlueSky is already deployed to 12 + 5 primary/high schools with positive reviews.BlueSky was originally created by Qinghua Zheng and Jun Liu in September 2005. The BlueSky development is being done at XJTU-IBM Open Technology and Application Joint Develop Center, more than 20 developers are involved. And it entered incubation on 2008-01-12. BlueSky is consisted with two subsystems -- RealClass system and MersMp system, both of which contains a set of flexible, extensible applications such as Distance Collaboration System, Collaboration player, Collaboration recording tool, Resources Sharing and Management Platform and Access of mobile terminal, designed by engineers and educators with years of experience in the problem domain, as well as a framework that makes it possible to create new applications that exist with others. Currently,Bluesky project has evolved into the 4th version system with more flexible applications and stabilities. We had developed the core applications of bluesky system with QT to support both Windows and Linux. What's more important, the new version system has the feature of integrated applications, which means that user can record a video during the interactive process and then play it in VOD method.The third advancement is that the 4th version sysytem supports Android platform and we can further advance Blursky in the mobile domain. We propose to move future development of BlueSky to the Apache Software Foundation in order to build a broader user and developer community. We hope to encourage contributions and use of Bluesky by other developing countries with similar education needs. Currently, the new version of BlueSky system is already handy to release.We really need to add an enthusiastic and responsive mentor to help us to complete full cycle of successful release.So we hope that with the help of developers all around the world, the system could become more powerful and functional in education area. We appreciate your attentions to Bluesky project. Best regards.
Bluesky calls for a new mentor!
Hi,all, Now, Bluesky project calls for a new mentor to guide us to complete the release work of 4th version. BlueSky is an e-learning solution designed to help solve the disparity in availability of qualified education between well-developed cities and poorer regions of China (e.g., countryside of Western China). BlueSky is already deployed to 12 + 5 primary/high schools with positive reviews.BlueSky was originally created by Qinghua Zheng and Jun Liu in September 2005. The BlueSky development is being done at XJTU-IBM Open Technology and Application Joint Develop Center, more than 20 developers are involved. And it entered incubation on 2008-01-12. BlueSky is consisted with two subsystems -- RealClass system and MersMp system, both of which contains a set of flexible, extensible applications such as Distance Collaboration System, Collaboration player, Collaboration recording tool, Resources Sharing and Management Platform and Access of mobile terminal, designed by engineers and educators with years of experience in the problem domain, as well as a framework that makes it possible to create new applications that exist with others. Currently,Bluesky project has evolved into the 4th version system with more flexible applications and stabilities. We had developed the core applications of bluesky system with QT to support both Windows and Linux. What's more important, the new version system has the feature of integrated applications, which means that user can record a video during the interactive process and then play it in VOD method.The third advancement is that the 4th version sysytem supports Android platform and we can further advance Blursky in the mobile domain. We propose to move future development of BlueSky to the Apache Software Foundation in order to build a broader user and developer community. We hope to encourage contributions and use of Bluesky by other developing countries with similar education needs. Currently, the new version of BlueSky system is already handy to release.We really need to add an enthusiastic and responsive mentor to help us to complete full cycle of successful release.So we hope that with the help of developers all around the world, the system could become more powerful and functional in education area. We appreciate your attentions to Bluesky project. Best regards.
Re: [VOTE] Release Apache log4php 2.1.0 (RC2)
On Tue, Jun 28, 2011 at 5:35 PM, Ivan Habunek ivan.habu...@gmail.com wrote: Dear all, It is my pleasure to announce the second release candidate for Apache log4php 2.1.0. Since we are short of PMC members at the log4php project, I would appreciate if other PMCs would join in so that we may pass this vote. Fixes compared to RC1: * included build.xml in source packages * fixed version number in title of api docs Significant changes in this release include: * a new logging level: trace * a new appender: MongoDB (thanks to Vladimir Gorej) * a plethora of bugfixes and other code improvements * most of the site docs have been rewritten to make log4php more accessible to users Apache log4php 2.1.0 RC2 is available for review here: * http://people.apache.org/builds/logging/log4php/2.1.0/RC2/ The tag for this release is available here: * http://svn.apache.org/viewvc/logging/log4php/tags/apache-log4php-2.1.0/ The site for this version can be found at: * http://people.apache.org/~ihabunek/apache-log4php-2.1.0-RC2/site/ According to the process, please vote with: [ X ] +1 Yes go ahead and release the artifacts I'm not a PHP expert, but I want to help out, so I played with it a bit. Didn't see anything wrong. Not exactly exhaustive testing, but looks fine. Yoav - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: Bluesky calls for a new mentor!
Sorry, I'm going to pass on this. During the entire time you guys have been at the ASF you have not managed to develop *any* project governance that could be charitably be described as open. You are supposed to be doing your development work in the ASF subversion repository, using ASF mailing lists, as peers. Looking at the (limited) commit history, there is a total imbalance between the number of people associated with the development work (20+) and the number of people with Apache accounts here (2). I don't see any attempts to rectify this other than to say there are cultural issues at play and that from now on you'll be using the mailing list to discuss *results*. Frankly this particular issue isn't something that can be written off so easily. What we really need you to discuss are *plans*, how you will implement them, who will implement them, and how you will collaborate in the codebase as peers. That is what open development is all about, and the main reason why your mentors are looking to shut down the project at this point. - Original Message From: Chen Liu liuchen0...@gmail.com To: general@incubator.apache.org Sent: Wed, June 29, 2011 9:12:37 PM Subject: Bluesky calls for a new mentor! Hi,all, Now, Bluesky project calls for a new mentor to guide us to complete the release work of 4th version. BlueSky is an e-learning solution designed to help solve the disparity in availability of qualified education between well-developed cities and poorer regions of China (e.g., countryside of Western China). BlueSky is already deployed to 12 + 5 primary/high schools with positive reviews.BlueSky was originally created by Qinghua Zheng and Jun Liu in September 2005. The BlueSky development is being done at XJTU-IBM Open Technology and Application Joint Develop Center, more than 20 developers are involved. And it entered incubation on 2008-01-12. BlueSky is consisted with two subsystems -- RealClass system and MersMp system, both of which contains a set of flexible, extensible applications such as Distance Collaboration System, Collaboration player, Collaboration recording tool, Resources Sharing and Management Platform and Access of mobile terminal, designed by engineers and educators with years of experience in the problem domain, as well as a framework that makes it possible to create new applications that exist with others. Currently,Bluesky project has evolved into the 4th version system with more flexible applications and stabilities. We had developed the core applications of bluesky system with QT to support both Windows and Linux. What's more important, the new version system has the feature of integrated applications, which means that user can record a video during the interactive process and then play it in VOD method.The third advancement is that the 4th version sysytem supports Android platform and we can further advance Blursky in the mobile domain. We propose to move future development of BlueSky to the Apache Software Foundation in order to build a broader user and developer community. We hope to encourage contributions and use of Bluesky by other developing countries with similar education needs. Currently, the new version of BlueSky system is already handy to release.We really need to add an enthusiastic and responsive mentor to help us to complete full cycle of successful release.So we hope that with the help of developers all around the world, the system could become more powerful and functional in education area. We appreciate your attentions to Bluesky project. Best regards. - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
RE: Bluesky calls for a new mentor!
Joe Schaefer wrote: Chen Liu wrote: We propose to move future development of BlueSky to the Apache Software Foundation in order to build a broader user and developer community. You are supposed to be doing your development work in the ASF subversion repository, using ASF mailing lists, as peers. Chen, as Joe points out, these are what BlueSky should have been doing for the past three (3) years, and yet we still here a proposal for the future. Looking at the (limited) commit history, there is a total imbalance between the number of people associated with the development work (20+) and the number of people with Apache accounts here (2). Again, as Joe points out, ALL of BlueSky development should been done via the ASF infrastructure, not periodically synchronized. We are a development community, not a remote archive. What we really need you to discuss are *plans*, how you will implement them, who will implement them, and how you will collaborate in the codebase as peers. Joe, again, has this on the money. The BlueSky project must immediately make significant strides to rectify these issues. Now, not later. We should see: 1) All current code in the ASF repository. 2) All development via ASF accounts (get the rest of the people signed up). 3) Ddevelopment discussion on the mailing list. 4) All licensing issues cleaned up. --- Noel - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: Bluesky calls for a new mentor!
Hi, Noel: 2011/6/30 Noel J. Bergman n...@devtech.com Joe Schaefer wrote: Chen Liu wrote: We propose to move future development of BlueSky to the Apache Software Foundation in order to build a broader user and developer community. You are supposed to be doing your development work in the ASF subversion repository, using ASF mailing lists, as peers. Chen, as Joe points out, these are what BlueSky should have been doing for the past three (3) years, and yet we still here a proposal for the future. Looking at the (limited) commit history, there is a total imbalance between the number of people associated with the development work (20+) and the number of people with Apache accounts here (2). I guess i can explain that. Most of the developers of BlueSky project are students. As you all know, students come when they join in school and go after they graduate. So the active developers are around 10. Like we used to have 5 committers, but now we only have 2 committers in active. Again, as Joe points out, ALL of BlueSky development should been done via the ASF infrastructure, not periodically synchronized. We are a development community, not a remote archive. What we really need you to discuss are *plans*, how you will implement them, who will implement them, and how you will collaborate in the codebase as peers. Joe, again, has this on the money. The BlueSky project must immediately make significant strides to rectify these issues. Now, not later. We should see: 1) All current code in the ASF repository. 2) All development via ASF accounts (get the rest of the people signed up). 3) Ddevelopment discussion on the mailing list. 4) All licensing issues cleaned up. According to what you've listed, i would forward your suggestion to bluesky dev list and wish we could make a quick response after discussion. Appreciate your help. regards, Kevin --- Noel - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org -- Bowen Ma a.k.a Samuel Kevin @ Bluesky Dev TeamXJTU Shaanxi Province Key Lab. of Satellite and Terrestrial Network Tech http://incubator.apache.org/bluesky/
Re: Bluesky calls for a new mentor!
We really appreciate both of your suggestions on Bluesky's development in ASF. We think that we are supposed to finish code release work of the 4th version in the end of July.Due to the effective work we really need a experienced mentor to guide our release. In addition, there are about 8-9 developers to do the release work, some of whom are green hands to release code in ASF owing to change of team members.We consider that it is important if there is a disscussion about how to fast acquire this guide instructions. By the way, we have been developing the bluesky project. After the release work of 4th version. We would like to discuss the deployment of the 5th version including IPTV function and so on. We appreciate collaboraters in ASF to give us some suggestions about mobile terminal and IPTV. 2011/6/30 Noel J. Bergman n...@devtech.com Joe Schaefer wrote: Chen Liu wrote: We propose to move future development of BlueSky to the Apache Software Foundation in order to build a broader user and developer community. You are supposed to be doing your development work in the ASF subversion repository, using ASF mailing lists, as peers. Chen, as Joe points out, these are what BlueSky should have been doing for the past three (3) years, and yet we still here a proposal for the future. Looking at the (limited) commit history, there is a total imbalance between the number of people associated with the development work (20+) and the number of people with Apache accounts here (2). Again, as Joe points out, ALL of BlueSky development should been done via the ASF infrastructure, not periodically synchronized. We are a development community, not a remote archive. What we really need you to discuss are *plans*, how you will implement them, who will implement them, and how you will collaborate in the codebase as peers. Joe, again, has this on the money. The BlueSky project must immediately make significant strides to rectify these issues. Now, not later. We should see: 1) All current code in the ASF repository. 2) All development via ASF accounts (get the rest of the people signed up). 3) Ddevelopment discussion on the mailing list. 4) All licensing issues cleaned up. --- Noel - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: Bluesky calls for a new mentor!
Sorry, but the explanation below makes things sound even worse. Apache projects are not here to give students a place to do school work. What you have described is not a community. If the project cannot build a community of people who are interested in the project for more than a school term then it doesn't belong here. Ralph On Jun 29, 2011, at 8:12 PM, SamuelKevin wrote: Hi, Noel: 2011/6/30 Noel J. Bergman n...@devtech.com Joe Schaefer wrote: Chen Liu wrote: We propose to move future development of BlueSky to the Apache Software Foundation in order to build a broader user and developer community. You are supposed to be doing your development work in the ASF subversion repository, using ASF mailing lists, as peers. Chen, as Joe points out, these are what BlueSky should have been doing for the past three (3) years, and yet we still here a proposal for the future. Looking at the (limited) commit history, there is a total imbalance between the number of people associated with the development work (20+) and the number of people with Apache accounts here (2). I guess i can explain that. Most of the developers of BlueSky project are students. As you all know, students come when they join in school and go after they graduate. So the active developers are around 10. Like we used to have 5 committers, but now we only have 2 committers in active. Again, as Joe points out, ALL of BlueSky development should been done via the ASF infrastructure, not periodically synchronized. We are a development community, not a remote archive. What we really need you to discuss are *plans*, how you will implement them, who will implement them, and how you will collaborate in the codebase as peers. Joe, again, has this on the money. The BlueSky project must immediately make significant strides to rectify these issues. Now, not later. We should see: 1) All current code in the ASF repository. 2) All development via ASF accounts (get the rest of the people signed up). 3) Ddevelopment discussion on the mailing list. 4) All licensing issues cleaned up. According to what you've listed, i would forward your suggestion to bluesky dev list and wish we could make a quick response after discussion. Appreciate your help. regards, Kevin --- Noel - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org -- Bowen Ma a.k.a Samuel Kevin @ Bluesky Dev TeamXJTU Shaanxi Province Key Lab. of Satellite and Terrestrial Network Tech http://incubator.apache.org/bluesky/ - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: Bluesky calls for a new mentor!
On Wed, Jun 29, 2011 at 8:18 PM, Chen Liu liuchen0...@gmail.com wrote: We really appreciate both of your suggestions on Bluesky's development in ASF. We think that we are supposed to finish code release work of the 4th version in the end of July.Due to the effective work we really need a experienced mentor to guide our release. In addition, there are about 8-9 developers to do the release work, some of whom are green hands to release code in ASF owing to change of team members.We consider that it is important if there is a disscussion about how to fast acquire this guide instructions. By the way, we have been developing the bluesky project. After the release work of 4th version. We would like to discuss the deployment of the 5th version including IPTV function and so on. We appreciate collaboraters in ASF to give us some suggestions about mobile terminal and IPTV. Who are the current students/developers that worked on the so called 4th version of the code ? The interesting part is that the last JIRA was created on June 2010 (BLUESKY-10), which raises the question on who is actually committing the other students code to the repository. -- Luciano Resende http://people.apache.org/~lresende http://twitter.com/lresende1975 http://lresende.blogspot.com/ - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: Bluesky calls for a new mentor!
Hi, Luciano: Currently, we have 4 committers. And 2 of them stay inactive. The newest source code would be committed by me after the students grant warrant. regards, Kevin 2011/6/30 Luciano Resende luckbr1...@gmail.com On Wed, Jun 29, 2011 at 8:18 PM, Chen Liu liuchen0...@gmail.com wrote: We really appreciate both of your suggestions on Bluesky's development in ASF. We think that we are supposed to finish code release work of the 4th version in the end of July.Due to the effective work we really need a experienced mentor to guide our release. In addition, there are about 8-9 developers to do the release work, some of whom are green hands to release code in ASF owing to change of team members.We consider that it is important if there is a disscussion about how to fast acquire this guide instructions. By the way, we have been developing the bluesky project. After the release work of 4th version. We would like to discuss the deployment of the 5th version including IPTV function and so on. We appreciate collaboraters in ASF to give us some suggestions about mobile terminal and IPTV. Who are the current students/developers that worked on the so called 4th version of the code ? The interesting part is that the last JIRA was created on June 2010 (BLUESKY-10), which raises the question on who is actually committing the other students code to the repository. -- Luciano Resende http://people.apache.org/~lresende http://twitter.com/lresende1975 http://lresende.blogspot.com/ - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org -- Bowen Ma a.k.a Samuel Kevin @ Bluesky Dev TeamXJTU Shaanxi Province Key Lab. of Satellite and Terrestrial Network Tech http://incubator.apache.org/bluesky/
Re: [VOTE] Oozie to join the Incubator
+1 (non-binding) On Thu, Jun 30, 2011 at 12:40 AM, Mohammad Islam misla...@yahoo.com wrote: Hi All, The discussion about Oozie proposal is settling down. Therefore I would like to initiate a vote to accept Oozie as an Apache Incubator project. The latest proposal is pasted at the end and it could be found in the wiki as well: http://wiki.apache.org/incubator/OozieProposal The related discussion thread is at: http://www.mail-archive.com/general@incubator.apache.org/msg29633.html Please cast your votes: [ ] +1 Accept Oozie for incubation [ ] +0 Indifferent to Oozie incubation [ ] -1 Reject Oozie for incubation This vote will close 72 hours from now. Regards, Mohammad Abstract Oozie is a server-based workflow scheduling and coordination system to manage data processing jobs for Apache HadoopTM. Proposal Oozie is an extensible, scalable and reliable system to define, manage, schedule, and execute complex Hadoop workloads via web services. More specifically, this includes: * XML-based declarative framework to specify a job or a complex workflow of dependent jobs. * Support different types of job such as Hadoop Map-Reduce, Pipe, Streaming, Pig, Hive and custom java applications. * Workflow scheduling based on frequency and/or data availability. * Monitoring capability, automatic retry and failure handing of jobs. * Extensible and pluggable architecture to allow arbitrary grid programming paradigms. * Authentication, authorization, and capacity-aware load throttling to allow multi-tenant software as a service. Background Most data processing applications require multiple jobs to achieve their goals, with inherent dependencies among the jobs. A dependency could be sequential, where one job can only start after another job has finished. Or it could be conditional, where the execution of a job depends on the return value or status of another job. In other cases, parallel execution of multiple jobs may be permitted – or desired – to exploit the massive pool of compute nodes provided by Hadoop. These job dependencies are often expressed as a Directed Acyclic Graph, also called a workflow. A node in the workflow is typically a job (a computation on the grid) or another type of action such as an eMail notification. Computations can be expressed in map/reduce, Pig, Hive or any other programming paradigm available on the grid. Edges of the graph represent transitions from one node to the next, as the execution of a workflow proceeds. Describing a workflow in a declarative way has the advantage of decoupling job dependencies and execution control from application logic. Furthermore, the workflow is modularized into jobs that can be reused within the same workflow or across different workflows. Execution of the workflow is then driven by a runtime system without understanding the application logic of the jobs. This runtime system specializes in reliable and predictable execution: It can retry actions that have failed or invoke a cleanup action after termination of the workflow; it can monitor progress, success, or failure of a workflow, and send appropriate alerts to an administrator. The application developer is relieved from implementing these generic procedures. Furthermore, some applications or workflows need to run in periodic intervals or when dependent data is available. For example, a workflow could be executed every day as soon as output data from the previous 24 instances of another, hourly workflow is available. The workflow coordinator provides such scheduling features, along with prioritization, load balancing and throttling to optimize utilization of resources in the cluster. This makes it easier to maintain, control, and coordinate complex data applications. Nearly three years ago, a team of Yahoo! developers addressed these critical requirements for Hadoop-based data processing systems by developing a new workflow management and scheduling system called Oozie. While it was initially developed as a Yahoo!-internal project, it was designed and implemented with the intention of open-sourcing. Oozie was released as a GitHub project in early 2010. Oozie is used in production within Yahoo and since it has been open-sourced it has been gaining adoption with external developers Rationale Commonly, applications that run on Hadoop require multiple Hadoop jobs in order to obtain the desired results. Furthermore, these Hadoop jobs are commonly a combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs and shell scripts. Because of this, developers find themselves writing ad-hoc glue programs to combine these Hadoop jobs. These ad-hoc programs are difficult to schedule, manage, monitor and recover.
Re: [VOTE] Oozie to join the Incubator
+1 (non-binding) On Wed, Jun 29, 2011 at 10:18 PM, Ashish paliwalash...@gmail.com wrote: +1 (non-binding) On Thu, Jun 30, 2011 at 12:40 AM, Mohammad Islam misla...@yahoo.com wrote: Hi All, The discussion about Oozie proposal is settling down. Therefore I would like to initiate a vote to accept Oozie as an Apache Incubator project. The latest proposal is pasted at the end and it could be found in the wiki as well: http://wiki.apache.org/incubator/OozieProposal The related discussion thread is at: http://www.mail-archive.com/general@incubator.apache.org/msg29633.html Please cast your votes: [ ] +1 Accept Oozie for incubation [ ] +0 Indifferent to Oozie incubation [ ] -1 Reject Oozie for incubation This vote will close 72 hours from now. Regards, Mohammad Abstract Oozie is a server-based workflow scheduling and coordination system to manage data processing jobs for Apache HadoopTM. Proposal Oozie is an extensible, scalable and reliable system to define, manage, schedule, and execute complex Hadoop workloads via web services. More specifically, this includes: * XML-based declarative framework to specify a job or a complex workflow of dependent jobs. * Support different types of job such as Hadoop Map-Reduce, Pipe, Streaming, Pig, Hive and custom java applications. * Workflow scheduling based on frequency and/or data availability. * Monitoring capability, automatic retry and failure handing of jobs. * Extensible and pluggable architecture to allow arbitrary grid programming paradigms. * Authentication, authorization, and capacity-aware load throttling to allow multi-tenant software as a service. Background Most data processing applications require multiple jobs to achieve their goals, with inherent dependencies among the jobs. A dependency could be sequential, where one job can only start after another job has finished. Or it could be conditional, where the execution of a job depends on the return value or status of another job. In other cases, parallel execution of multiple jobs may be permitted – or desired – to exploit the massive pool of compute nodes provided by Hadoop. These job dependencies are often expressed as a Directed Acyclic Graph, also called a workflow. A node in the workflow is typically a job (a computation on the grid) or another type of action such as an eMail notification. Computations can be expressed in map/reduce, Pig, Hive or any other programming paradigm available on the grid. Edges of the graph represent transitions from one node to the next, as the execution of a workflow proceeds. Describing a workflow in a declarative way has the advantage of decoupling job dependencies and execution control from application logic. Furthermore, the workflow is modularized into jobs that can be reused within the same workflow or across different workflows. Execution of the workflow is then driven by a runtime system without understanding the application logic of the jobs. This runtime system specializes in reliable and predictable execution: It can retry actions that have failed or invoke a cleanup action after termination of the workflow; it can monitor progress, success, or failure of a workflow, and send appropriate alerts to an administrator. The application developer is relieved from implementing these generic procedures. Furthermore, some applications or workflows need to run in periodic intervals or when dependent data is available. For example, a workflow could be executed every day as soon as output data from the previous 24 instances of another, hourly workflow is available. The workflow coordinator provides such scheduling features, along with prioritization, load balancing and throttling to optimize utilization of resources in the cluster. This makes it easier to maintain, control, and coordinate complex data applications. Nearly three years ago, a team of Yahoo! developers addressed these critical requirements for Hadoop-based data processing systems by developing a new workflow management and scheduling system called Oozie. While it was initially developed as a Yahoo!-internal project, it was designed and implemented with the intention of open-sourcing. Oozie was released as a GitHub project in early 2010. Oozie is used in production within Yahoo and since it has been open-sourced it has been gaining adoption with external developers Rationale Commonly, applications that run on Hadoop require multiple Hadoop jobs in order to obtain the desired results. Furthermore, these Hadoop jobs are commonly a combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes map-reduce jobs, Pig jobs, Hive jobs, HDFS
Re: [VOTE] Oozie to join the Incubator
+1 (non-binding) Thanks, Arvind On Wed, Jun 29, 2011 at 12:10 PM, Mohammad Islam misla...@yahoo.com wrote: Hi All, The discussion about Oozie proposal is settling down. Therefore I would like to initiate a vote to accept Oozie as an Apache Incubator project. The latest proposal is pasted at the end and it could be found in the wiki as well: http://wiki.apache.org/incubator/OozieProposal The related discussion thread is at: http://www.mail-archive.com/general@incubator.apache.org/msg29633.html Please cast your votes: [ ] +1 Accept Oozie for incubation [ ] +0 Indifferent to Oozie incubation [ ] -1 Reject Oozie for incubation This vote will close 72 hours from now. Regards, Mohammad Abstract Oozie is a server-based workflow scheduling and coordination system to manage data processing jobs for Apache HadoopTM. Proposal Oozie is an extensible, scalable and reliable system to define, manage, schedule, and execute complex Hadoop workloads via web services. More specifically, this includes: * XML-based declarative framework to specify a job or a complex workflow of dependent jobs. * Support different types of job such as Hadoop Map-Reduce, Pipe, Streaming, Pig, Hive and custom java applications. * Workflow scheduling based on frequency and/or data availability. * Monitoring capability, automatic retry and failure handing of jobs. * Extensible and pluggable architecture to allow arbitrary grid programming paradigms. * Authentication, authorization, and capacity-aware load throttling to allow multi-tenant software as a service. Background Most data processing applications require multiple jobs to achieve their goals, with inherent dependencies among the jobs. A dependency could be sequential, where one job can only start after another job has finished. Or it could be conditional, where the execution of a job depends on the return value or status of another job. In other cases, parallel execution of multiple jobs may be permitted – or desired – to exploit the massive pool of compute nodes provided by Hadoop. These job dependencies are often expressed as a Directed Acyclic Graph, also called a workflow. A node in the workflow is typically a job (a computation on the grid) or another type of action such as an eMail notification. Computations can be expressed in map/reduce, Pig, Hive or any other programming paradigm available on the grid. Edges of the graph represent transitions from one node to the next, as the execution of a workflow proceeds. Describing a workflow in a declarative way has the advantage of decoupling job dependencies and execution control from application logic. Furthermore, the workflow is modularized into jobs that can be reused within the same workflow or across different workflows. Execution of the workflow is then driven by a runtime system without understanding the application logic of the jobs. This runtime system specializes in reliable and predictable execution: It can retry actions that have failed or invoke a cleanup action after termination of the workflow; it can monitor progress, success, or failure of a workflow, and send appropriate alerts to an administrator. The application developer is relieved from implementing these generic procedures. Furthermore, some applications or workflows need to run in periodic intervals or when dependent data is available. For example, a workflow could be executed every day as soon as output data from the previous 24 instances of another, hourly workflow is available. The workflow coordinator provides such scheduling features, along with prioritization, load balancing and throttling to optimize utilization of resources in the cluster. This makes it easier to maintain, control, and coordinate complex data applications. Nearly three years ago, a team of Yahoo! developers addressed these critical requirements for Hadoop-based data processing systems by developing a new workflow management and scheduling system called Oozie. While it was initially developed as a Yahoo!-internal project, it was designed and implemented with the intention of open-sourcing. Oozie was released as a GitHub project in early 2010. Oozie is used in production within Yahoo and since it has been open-sourced it has been gaining adoption with external developers Rationale Commonly, applications that run on Hadoop require multiple Hadoop jobs in order to obtain the desired results. Furthermore, these Hadoop jobs are commonly a combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs and shell scripts. Because of this, developers find themselves writing ad-hoc glue programs to combine these Hadoop jobs. These ad-hoc programs are difficult to schedule, manage, monitor and
Re: [VOTE] Oozie to join the Incubator
Cool project, +1 On Thu, Jun 30, 2011 at 2:23 PM, Arvind Prabhakar arv...@apache.org wrote: +1 (non-binding) Thanks, Arvind On Wed, Jun 29, 2011 at 12:10 PM, Mohammad Islam misla...@yahoo.com wrote: Hi All, The discussion about Oozie proposal is settling down. Therefore I would like to initiate a vote to accept Oozie as an Apache Incubator project. The latest proposal is pasted at the end and it could be found in the wiki as well: http://wiki.apache.org/incubator/OozieProposal The related discussion thread is at: http://www.mail-archive.com/general@incubator.apache.org/msg29633.html Please cast your votes: [ ] +1 Accept Oozie for incubation [ ] +0 Indifferent to Oozie incubation [ ] -1 Reject Oozie for incubation This vote will close 72 hours from now. Regards, Mohammad Abstract Oozie is a server-based workflow scheduling and coordination system to manage data processing jobs for Apache HadoopTM. Proposal Oozie is an extensible, scalable and reliable system to define, manage, schedule, and execute complex Hadoop workloads via web services. More specifically, this includes: * XML-based declarative framework to specify a job or a complex workflow of dependent jobs. * Support different types of job such as Hadoop Map-Reduce, Pipe, Streaming, Pig, Hive and custom java applications. * Workflow scheduling based on frequency and/or data availability. * Monitoring capability, automatic retry and failure handing of jobs. * Extensible and pluggable architecture to allow arbitrary grid programming paradigms. * Authentication, authorization, and capacity-aware load throttling to allow multi-tenant software as a service. Background Most data processing applications require multiple jobs to achieve their goals, with inherent dependencies among the jobs. A dependency could be sequential, where one job can only start after another job has finished. Or it could be conditional, where the execution of a job depends on the return value or status of another job. In other cases, parallel execution of multiple jobs may be permitted – or desired – to exploit the massive pool of compute nodes provided by Hadoop. These job dependencies are often expressed as a Directed Acyclic Graph, also called a workflow. A node in the workflow is typically a job (a computation on the grid) or another type of action such as an eMail notification. Computations can be expressed in map/reduce, Pig, Hive or any other programming paradigm available on the grid. Edges of the graph represent transitions from one node to the next, as the execution of a workflow proceeds. Describing a workflow in a declarative way has the advantage of decoupling job dependencies and execution control from application logic. Furthermore, the workflow is modularized into jobs that can be reused within the same workflow or across different workflows. Execution of the workflow is then driven by a runtime system without understanding the application logic of the jobs. This runtime system specializes in reliable and predictable execution: It can retry actions that have failed or invoke a cleanup action after termination of the workflow; it can monitor progress, success, or failure of a workflow, and send appropriate alerts to an administrator. The application developer is relieved from implementing these generic procedures. Furthermore, some applications or workflows need to run in periodic intervals or when dependent data is available. For example, a workflow could be executed every day as soon as output data from the previous 24 instances of another, hourly workflow is available. The workflow coordinator provides such scheduling features, along with prioritization, load balancing and throttling to optimize utilization of resources in the cluster. This makes it easier to maintain, control, and coordinate complex data applications. Nearly three years ago, a team of Yahoo! developers addressed these critical requirements for Hadoop-based data processing systems by developing a new workflow management and scheduling system called Oozie. While it was initially developed as a Yahoo!-internal project, it was designed and implemented with the intention of open-sourcing. Oozie was released as a GitHub project in early 2010. Oozie is used in production within Yahoo and since it has been open-sourced it has been gaining adoption with external developers Rationale Commonly, applications that run on Hadoop require multiple Hadoop jobs in order to obtain the desired results. Furthermore, these Hadoop jobs are commonly a combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs and shell scripts. Because of this, developers find themselves writing ad-hoc glue programs
Re: [VOTE] Oozie to join the Incubator
+1 (non-binding) On Wed, Jun 29, 2011 at 12:10 PM, Mohammad Islam misla...@yahoo.com wrote: Hi All, The discussion about Oozie proposal is settling down. Therefore I would like to initiate a vote to accept Oozie as an Apache Incubator project. The latest proposal is pasted at the end and it could be found in the wiki as well: http://wiki.apache.org/incubator/OozieProposal The related discussion thread is at: http://www.mail-archive.com/general@incubator.apache.org/msg29633.html Please cast your votes: [ ] +1 Accept Oozie for incubation [ ] +0 Indifferent to Oozie incubation [ ] -1 Reject Oozie for incubation This vote will close 72 hours from now. Regards, Mohammad Abstract Oozie is a server-based workflow scheduling and coordination system to manage data processing jobs for Apache HadoopTM. Proposal Oozie is an extensible, scalable and reliable system to define, manage, schedule, and execute complex Hadoop workloads via web services. More specifically, this includes: * XML-based declarative framework to specify a job or a complex workflow of dependent jobs. * Support different types of job such as Hadoop Map-Reduce, Pipe, Streaming, Pig, Hive and custom java applications. * Workflow scheduling based on frequency and/or data availability. * Monitoring capability, automatic retry and failure handing of jobs. * Extensible and pluggable architecture to allow arbitrary grid programming paradigms. * Authentication, authorization, and capacity-aware load throttling to allow multi-tenant software as a service. Background Most data processing applications require multiple jobs to achieve their goals, with inherent dependencies among the jobs. A dependency could be sequential, where one job can only start after another job has finished. Or it could be conditional, where the execution of a job depends on the return value or status of another job. In other cases, parallel execution of multiple jobs may be permitted – or desired – to exploit the massive pool of compute nodes provided by Hadoop. These job dependencies are often expressed as a Directed Acyclic Graph, also called a workflow. A node in the workflow is typically a job (a computation on the grid) or another type of action such as an eMail notification. Computations can be expressed in map/reduce, Pig, Hive or any other programming paradigm available on the grid. Edges of the graph represent transitions from one node to the next, as the execution of a workflow proceeds. Describing a workflow in a declarative way has the advantage of decoupling job dependencies and execution control from application logic. Furthermore, the workflow is modularized into jobs that can be reused within the same workflow or across different workflows. Execution of the workflow is then driven by a runtime system without understanding the application logic of the jobs. This runtime system specializes in reliable and predictable execution: It can retry actions that have failed or invoke a cleanup action after termination of the workflow; it can monitor progress, success, or failure of a workflow, and send appropriate alerts to an administrator. The application developer is relieved from implementing these generic procedures. Furthermore, some applications or workflows need to run in periodic intervals or when dependent data is available. For example, a workflow could be executed every day as soon as output data from the previous 24 instances of another, hourly workflow is available. The workflow coordinator provides such scheduling features, along with prioritization, load balancing and throttling to optimize utilization of resources in the cluster. This makes it easier to maintain, control, and coordinate complex data applications. Nearly three years ago, a team of Yahoo! developers addressed these critical requirements for Hadoop-based data processing systems by developing a new workflow management and scheduling system called Oozie. While it was initially developed as a Yahoo!-internal project, it was designed and implemented with the intention of open-sourcing. Oozie was released as a GitHub project in early 2010. Oozie is used in production within Yahoo and since it has been open-sourced it has been gaining adoption with external developers Rationale Commonly, applications that run on Hadoop require multiple Hadoop jobs in order to obtain the desired results. Furthermore, these Hadoop jobs are commonly a combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs and shell scripts. Because of this, developers find themselves writing ad-hoc glue programs to combine these Hadoop jobs. These ad-hoc programs are difficult to schedule, manage, monitor and recover. Workflow