[VOTE] Release Apache log4php 2.1.0 (RC2)

2011-06-29 Thread Ivan Habunek
Dear all,

It is my pleasure to announce the second release candidate for Apache
log4php 2.1.0.

Since we are short of PMC members at the log4php project, I would
appreciate if other PMCs would join in so that we may pass this vote.

Fixes compared to RC1:
 * included build.xml in source packages
 * fixed version number in title of api docs

Significant changes in this release include:
 * a new logging level: trace
 * a new appender: MongoDB (thanks to Vladimir Gorej)
 * a plethora of bugfixes and other code improvements
 * most of the site docs have been rewritten to make log4php more
accessible to users

Apache log4php 2.1.0 RC2 is available for review here:
 * http://people.apache.org/builds/logging/log4php/2.1.0/RC2/

The tag for this release is available here:
 * http://svn.apache.org/viewvc/logging/log4php/tags/apache-log4php-2.1.0/

The site for this version can be found at:
  * http://people.apache.org/~ihabunek/apache-log4php-2.1.0-RC2/site/

According to the process, please vote with:
[ ] +1 Yes go ahead and release the artifacts
[ ] -1 No, because...

Best regards,
Ivan

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [PROPOSAL] Oozie for the Apache Incubator

2011-06-29 Thread Ross Gardler
You might want to reconsider the name.

In English (British English at least) ooze is an unpleasant thing
often related to a body wound or a stagnant river. The formal
definition is not so bad [1], but in common (UK) usage it's
unpleasant.

Ross

[1] http://dictionary.reference.com/browse/ooze

On 29 June 2011 03:07, arv...@cloudera.com arv...@cloudera.com wrote:
 +1 (non-binding).

 Thanks,
 Arvind

 On Fri, Jun 24, 2011 at 12:46 PM, Mohammad Islam misla...@yahoo.com wrote:
 Hi,

 I would like to propose Oozie to be an Apache Incubator project.
 Oozie is a server-based workflow scheduling and coordination system to manage
 data processing jobs for Apache Hadoop.


 Here's a link to the proposal in the Incubator wiki
 http://wiki.apache.org/incubator/OozieProposal


 I've also pasted the initial contents below.

 Regards,

 Mohammad Islam


 Start of Oozie Proposal

 Abstract
 Oozie is a server-based workflow scheduling and coordination system to manage
 data processing jobs for Apache HadoopTM.

 Proposal
 Oozie is an  extensible, scalable and reliable system to define, manage,
 schedule,  and execute complex Hadoop workloads via web services. More
 specifically, this includes:

        * XML-based declarative framework to specify a job or a complex 
 workflow of
 dependent jobs.

        * Support different types of job such as Hadoop Map-Reduce, Pipe, 
 Streaming,
 Pig, Hive and custom java applications.

        * Workflow scheduling based on frequency and/or data availability.
        * Monitoring capability, automatic retry and failure handing of jobs.
        * Extensible and pluggable architecture to allow arbitrary grid 
 programming
 paradigms.

        * Authentication, authorization, and capacity-aware load throttling 
 to allow
 multi-tenant software as a service.

 Background
 Most data  processing applications require multiple jobs to achieve their 
 goals,
 with inherent dependencies among the jobs. A dependency could be  sequential,
 where one job can only start after another job has finished.  Or it could be
 conditional, where the execution of a job depends on the  return value or 
 status
 of another job. In other cases, parallel  execution of multiple jobs may be
 permitted – or desired – to exploit  the massive pool of compute nodes 
 provided
 by Hadoop.

 These  job dependencies are often expressed as a Directed Acyclic Graph, also
 called a workflow. A node in the workflow is typically a job (a  computation 
 on
 the grid) or another type of action such as an eMail  notification. 
 Computations
 can be expressed in map/reduce, Pig, Hive or  any other programming paradigm
 available on the grid. Edges of the graph  represent transitions from one 
 node
 to the next, as the execution of a  workflow proceeds.

 Describing  a workflow in a declarative way has the advantage of decoupling 
 job
 dependencies and execution control from application logic. Furthermore,  the
 workflow is modularized into jobs that can be reused within the same  
 workflow
 or across different workflows. Execution of the workflow is  then driven by a
 runtime system without understanding the application  logic of the jobs. This
 runtime system specializes in reliable and  predictable execution: It can 
 retry
 actions that have failed or invoke a  cleanup action after termination of the
 workflow; it can monitor  progress, success, or failure of a workflow, and 
 send
 appropriate alerts  to an administrator. The application developer is 
 relieved
 from  implementing these generic procedures.

 Furthermore,  some applications or workflows need to run in periodic 
 intervals
 or  when dependent data is available. For example, a workflow could be  
 executed
 every day as soon as output data from the previous 24 instances  of another,
 hourly workflow is available. The workflow coordinator  provides such 
 scheduling
 features, along with prioritization, load  balancing and throttling to 
 optimize
 utilization of resources in the  cluster. This makes it easier to maintain,
 control, and coordinate  complex data applications.

 Nearly  three years ago, a team of Yahoo! developers addressed these critical
 requirements for Hadoop-based data processing systems by developing a  new
 workflow management and scheduling system called Oozie. While it was  
 initially
 developed as a Yahoo!-internal project, it was designed and  implemented with
 the intention of open-sourcing. Oozie was released as a GitHub project in 
 early
 2010. Oozie is used in production within Yahoo and  since it has been
 open-sourced it has been gaining adoption with  external developers

 Rationale
 Commonly,  applications that run on Hadoop require multiple Hadoop jobs in 
 order
 to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  
 a
 combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
 map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and 
 shell
 scripts.

 Because  of 

Re: [VOTE] Kafka to join the Incubator

2011-06-29 Thread Tommaso Teofili
+1 (binding)
Tommaso

2011/6/28 Jun Rao jun...@gmail.com

 Hi all,


 Since the discussion on the thread of the Kafka incubator proposal is
 winding down, I'd like to call a vote.

 At the end of this mail, I've put a copy of the current proposal.  Here is
 a link to the document in the wiki:
 http://wiki.apache.org/incubator/KafkaProposal

 And here is a link to the discussion thread:
 http://www.mail-archive.com/general@incubator.apache.org/msg29594.html

 Please cast your votes:

 [  ] +1 Accept Kafka for incubation
 [  ] +0 Indifferent to Kafka incubation
 [  ]  -1 Reject Kafka for incubation

 This vote will close 72 hours from now.

 Thanks,

 Jun

 == Abstract ==
 Kafka is a distributed publish-subscribe system for processing large
 amounts
 of streaming data.

 == Proposal ==
 Kafka provides an extremely high throughput distributed publish/subscribe
 messaging system.  Additionally, it supports relatively long term
 persistence of messages to support a wide variety of consumers,
 partitioning
 of the message stream across servers and consumers, and functionality for
 loading data into Apache Hadoop for offline, batch processing.

 == Background ==
 Kafka was developed at LinkedIn to process the large amounts of events
 generated by that company's website and provide a common repository for
 many
 types of consumers to access and process those events. Kafka has been used
 in production at LinkedIn scale to handle dozens of types of events
 including page views, searches and social network activity. Kafka clusters
 at LinkedIn currently process more than two billion events per day.

 Kafka fills the gap between messaging systems such as Apache ActiveMQ,
 which
 provide low latency message delivery but don't focus on throughput, and log
 processing systems such as Scribe and Flume, which do not provide adequate
 latency for our diverse set of consumers.  Kafka can also be inserted into
 traditional log-processing systems, acting as an intermediate step before
 further processing. Kafka focuses relentlessly on performance and
 throughput
 by not introspecting into message content, nor indexing them on the broker.
  We also achieve high performance by depending on Java's
 sendFile/transferTo
 capabilities to minimize intermediate buffer copies and relying on the OS's
 pagecache to efficiently serve up message contents to consumers. Kafka is
 also designed to be scalable and it depends on Apache ZooKeeper for
 coordination amongst its producers, brokers and consumers.

 Kafka is written in Scala. It was developed internally at LinkedIn to meet
 our particular use cases, but will be useful to many organizations facing a
 similar need to reliably process large amounts of streaming data.
  Therefore, we would like to share it the ASF and begin developing a
 community of developers and users within Apache.

 == Rationale ==
 Many organizations can benefit from a reliable stream processing system
 such
 as Kafka.  While our use case of processing events from a very large
 website
 like LinkedIn has driven the design of Kafka, its uses are varied and we
 expect many new use cases to emerge.  Kafka provides a natural bridge
 between near real-time event processing and offline batch processing and
 will appeal to many users.

 == Current Status ==
 === Meritocracy ===
 Our intent with this incubator proposal is to start building a diverse
 developer community around Kafka following the Apache meritocracy model.
 Since Kafka was open sourced we have solicited contributions via the
 website
 and presentations given to user groups and technical audiences.  We have
 had
 positive responses to these and have received several contributions and
 clients for other languages.  We plan to continue this support for new
 contributors and work with those who contribute significantly to the
 project
 to make them committers.

 === Community ===
 Kafka is currently being used by developed by engineers within LinkedIn and
 used in production in that company. Additionally, we have active users in
 or
 have received contributions from a diverse set of companies including
 MediaSift, SocialTwist, Clearspring and Urban Airship. Recent public
 presentations of Kafka and its goals garnered much interest from potential
 contributors. We hope to extend our contributor base significantly and
 invite all those who are interested in building high-throughput distributed
 systems to participate.  We have begun receiving contributions from outside
 of LinkedIn, including clients for several languages including Ruby, PHP,
 Clojure, .NET and Python.

 To further this goal, we use GitHub issue tracking and branching
 facilities,
 as well as maintaining a public mailing list via Google Groups.

 === Core Developers ===
 Kafka is currently being developed by four engineers at LinkedIn: Neha
 Narkhede, Jun Rao, Jakob Homan and Jay Kreps. Jun has experience within
 Apache as a Cassandra committer and PMC member. Neha has been an active
 

Re: [VOTE] Kafka to join the Incubator

2011-06-29 Thread Richard Hirsch
+1 (Binding)

By the way, it is great to see another Scala-based project coming to Apache.

Dick

VP Apache ESME

On Wed, Jun 29, 2011 at 12:35 PM, Tommaso Teofili
tommaso.teof...@gmail.com wrote:
 +1 (binding)
 Tommaso

 2011/6/28 Jun Rao jun...@gmail.com

 Hi all,


 Since the discussion on the thread of the Kafka incubator proposal is
 winding down, I'd like to call a vote.

 At the end of this mail, I've put a copy of the current proposal.  Here is
 a link to the document in the wiki:
 http://wiki.apache.org/incubator/KafkaProposal

 And here is a link to the discussion thread:
 http://www.mail-archive.com/general@incubator.apache.org/msg29594.html

 Please cast your votes:

 [  ] +1 Accept Kafka for incubation
 [  ] +0 Indifferent to Kafka incubation
 [  ]  -1 Reject Kafka for incubation

 This vote will close 72 hours from now.

 Thanks,

 Jun

 == Abstract ==
 Kafka is a distributed publish-subscribe system for processing large
 amounts
 of streaming data.

 == Proposal ==
 Kafka provides an extremely high throughput distributed publish/subscribe
 messaging system.  Additionally, it supports relatively long term
 persistence of messages to support a wide variety of consumers,
 partitioning
 of the message stream across servers and consumers, and functionality for
 loading data into Apache Hadoop for offline, batch processing.

 == Background ==
 Kafka was developed at LinkedIn to process the large amounts of events
 generated by that company's website and provide a common repository for
 many
 types of consumers to access and process those events. Kafka has been used
 in production at LinkedIn scale to handle dozens of types of events
 including page views, searches and social network activity. Kafka clusters
 at LinkedIn currently process more than two billion events per day.

 Kafka fills the gap between messaging systems such as Apache ActiveMQ,
 which
 provide low latency message delivery but don't focus on throughput, and log
 processing systems such as Scribe and Flume, which do not provide adequate
 latency for our diverse set of consumers.  Kafka can also be inserted into
 traditional log-processing systems, acting as an intermediate step before
 further processing. Kafka focuses relentlessly on performance and
 throughput
 by not introspecting into message content, nor indexing them on the broker.
  We also achieve high performance by depending on Java's
 sendFile/transferTo
 capabilities to minimize intermediate buffer copies and relying on the OS's
 pagecache to efficiently serve up message contents to consumers. Kafka is
 also designed to be scalable and it depends on Apache ZooKeeper for
 coordination amongst its producers, brokers and consumers.

 Kafka is written in Scala. It was developed internally at LinkedIn to meet
 our particular use cases, but will be useful to many organizations facing a
 similar need to reliably process large amounts of streaming data.
  Therefore, we would like to share it the ASF and begin developing a
 community of developers and users within Apache.

 == Rationale ==
 Many organizations can benefit from a reliable stream processing system
 such
 as Kafka.  While our use case of processing events from a very large
 website
 like LinkedIn has driven the design of Kafka, its uses are varied and we
 expect many new use cases to emerge.  Kafka provides a natural bridge
 between near real-time event processing and offline batch processing and
 will appeal to many users.

 == Current Status ==
 === Meritocracy ===
 Our intent with this incubator proposal is to start building a diverse
 developer community around Kafka following the Apache meritocracy model.
 Since Kafka was open sourced we have solicited contributions via the
 website
 and presentations given to user groups and technical audiences.  We have
 had
 positive responses to these and have received several contributions and
 clients for other languages.  We plan to continue this support for new
 contributors and work with those who contribute significantly to the
 project
 to make them committers.

 === Community ===
 Kafka is currently being used by developed by engineers within LinkedIn and
 used in production in that company. Additionally, we have active users in
 or
 have received contributions from a diverse set of companies including
 MediaSift, SocialTwist, Clearspring and Urban Airship. Recent public
 presentations of Kafka and its goals garnered much interest from potential
 contributors. We hope to extend our contributor base significantly and
 invite all those who are interested in building high-throughput distributed
 systems to participate.  We have begun receiving contributions from outside
 of LinkedIn, including clients for several languages including Ruby, PHP,
 Clojure, .NET and Python.

 To further this goal, we use GitHub issue tracking and branching
 facilities,
 as well as maintaining a public mailing list via Google Groups.

 === Core Developers ===
 Kafka is currently being 

Re: [PROPOSAL] Deft for incubation

2011-06-29 Thread Niklas Gustavsson
On Mon, Jun 27, 2011 at 12:37 PM, Niklas Gustavsson
nik...@protocol7.com wrote:
 As you will note, the list of mentors is in need of some volunteers,
 so if you find this interesting, feel free to sign up. Needless to
 say, the same of course goes for committers.

We're still in need for more mentors to sign up. Anyone willing?

/niklas

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [PROPOSAL] Oozie for the Apache Incubator

2011-06-29 Thread Marvin Humphrey
On Wed, Jun 29, 2011 at 11:22:39AM +0100, Ross Gardler wrote:
 You might want to reconsider the name.
 
 In English (British English at least) ooze is an unpleasant thing
 often related to a body wound or a stagnant river. The formal
 definition is not so bad [1], but in common (UK) usage it's
 unpleasant.

And I thought at first that it was a reference to the Uzi, a submachine gun.

It's apparently the Burmese term for an elephant handler.

http://en.wikipedia.org/wiki/Mahout

In Burma, the profession is called oozie; in Thailand kwan-chang; and in
Vietnam quản tượng.

We had a good laugh about all this in the #lucy_dev IRC channel a couple days
ago.  One of the participants (who free-associated Oozie with sucking chest
wound) suggested that Hadoop projects might consider referencing stuffed
animals rather than elephants.  :)

Marvin Humphrey


-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [PROPOSAL] Deft for incubation

2011-06-29 Thread Mohammad Nour El-Din
Hi Niklas...

   You can sign me in.

On Wed, Jun 29, 2011 at 2:02 PM, Niklas Gustavsson nik...@protocol7.com wrote:
 On Mon, Jun 27, 2011 at 12:37 PM, Niklas Gustavsson
 nik...@protocol7.com wrote:
 As you will note, the list of mentors is in need of some volunteers,
 so if you find this interesting, feel free to sign up. Needless to
 say, the same of course goes for committers.

 We're still in need for more mentors to sign up. Anyone willing?

 /niklas

 -
 To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
 For additional commands, e-mail: general-h...@incubator.apache.org





-- 
Thanks
- Mohammad Nour
  Author of (WebSphere Application Server Community Edition 2.0 User Guide)
  http://www.redbooks.ibm.com/abstracts/sg247585.html
- LinkedIn: http://www.linkedin.com/in/mnour
- Blog: http://tadabborat.blogspot.com

Life is like riding a bicycle. To keep your balance you must keep moving
- Albert Einstein

Writing clean code is what you must do in order to call yourself a
professional. There is no reasonable excuse for doing anything less
than your best.
- Clean Code: A Handbook of Agile Software Craftsmanship

Stay hungry, stay foolish.
- Steve Jobs

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [PROPOSAL] Oozie for the Apache Incubator

2011-06-29 Thread Rob Weir
And I was thinking of the Ray Ozzie, the former Microsoft CTO.
Elephant handler is perhaps apt.

-Riob


On Wed, Jun 29, 2011 at 9:32 AM, Marvin Humphrey mar...@rectangular.com wrote:
 On Wed, Jun 29, 2011 at 11:22:39AM +0100, Ross Gardler wrote:
 You might want to reconsider the name.

 In English (British English at least) ooze is an unpleasant thing
 often related to a body wound or a stagnant river. The formal
 definition is not so bad [1], but in common (UK) usage it's
 unpleasant.

 And I thought at first that it was a reference to the Uzi, a submachine gun.

 It's apparently the Burmese term for an elephant handler.

    http://en.wikipedia.org/wiki/Mahout

    In Burma, the profession is called oozie; in Thailand kwan-chang; and in
    Vietnam quản tượng.

 We had a good laugh about all this in the #lucy_dev IRC channel a couple days
 ago.  One of the participants (who free-associated Oozie with sucking chest
 wound) suggested that Hadoop projects might consider referencing stuffed
 animals rather than elephants.  :)

 Marvin Humphrey


 -
 To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
 For additional commands, e-mail: general-h...@incubator.apache.org



-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



[VOTE] Oozie to join the Incubator

2011-06-29 Thread Mohammad Islam
Hi All,

The discussion about Oozie proposal is settling down. Therefore I would like to 
initiate a vote to accept Oozie as an Apache Incubator project.

The latest proposal is pasted at the end and it could be found in the wiki as 
well:
 
http://wiki.apache.org/incubator/OozieProposal


The related discussion thread is at:
http://www.mail-archive.com/general@incubator.apache.org/msg29633.html


Please cast your votes:

[  ] +1 Accept Oozie for incubation
[  ] +0 Indifferent to Oozie incubation
[  ] -1 Reject Oozie for incubation

This vote will close 72 hours  from now.

Regards,
Mohammad


Abstract
Oozie is a server-based workflow scheduling and coordination system to manage 
data processing jobs for Apache HadoopTM. 

Proposal
Oozie is an  extensible, scalable and reliable system to define, manage, 
schedule,  and execute complex Hadoop workloads via web services. More  
specifically, this includes: 

* XML-based declarative framework to specify a job or a complex 
workflow of 
dependent jobs. 

* Support different types of job such as Hadoop Map-Reduce, Pipe, 
Streaming, 
Pig, Hive and custom java applications. 

* Workflow scheduling based on frequency and/or data availability. 
* Monitoring capability, automatic retry and failure handing of jobs. 
* Extensible and pluggable architecture to allow arbitrary grid 
programming 
paradigms. 

* Authentication, authorization, and capacity-aware load throttling to 
allow 
multi-tenant software as a service. 

Background
Most data  processing applications require multiple jobs to achieve their 
goals,  
with inherent dependencies among the jobs. A dependency could be  sequential, 
where one job can only start after another job has finished.  Or it could be 
conditional, where the execution of a job depends on the  return value or 
status 
of another job. In other cases, parallel  execution of multiple jobs may be 
permitted – or desired – to exploit  the massive pool of compute nodes provided 
by Hadoop. 

These  job dependencies are often expressed as a Directed Acyclic Graph, also  
called a workflow. A node in the workflow is typically a job (a  computation on 
the grid) or another type of action such as an eMail  notification. 
Computations 
can be expressed in map/reduce, Pig, Hive or  any other programming paradigm 
available on the grid. Edges of the graph  represent transitions from one node 
to the next, as the execution of a  workflow proceeds. 

Describing  a workflow in a declarative way has the advantage of decoupling job 
 
dependencies and execution control from application logic. Furthermore,  the 
workflow is modularized into jobs that can be reused within the same  workflow 
or across different workflows. Execution of the workflow is  then driven by a 
runtime system without understanding the application  logic of the jobs. This 
runtime system specializes in reliable and  predictable execution: It can retry 
actions that have failed or invoke a  cleanup action after termination of the 
workflow; it can monitor  progress, success, or failure of a workflow, and send 
appropriate alerts  to an administrator. The application developer is relieved 
from  implementing these generic procedures. 

Furthermore,  some applications or workflows need to run in periodic intervals 
or  when dependent data is available. For example, a workflow could be  
executed 
every day as soon as output data from the previous 24 instances  of another, 
hourly workflow is available. The workflow coordinator  provides such 
scheduling 
features, along with prioritization, load  balancing and throttling to optimize 
utilization of resources in the  cluster. This makes it easier to maintain, 
control, and coordinate  complex data applications. 

Nearly  three years ago, a team of Yahoo! developers addressed these critical  
requirements for Hadoop-based data processing systems by developing a  new 
workflow management and scheduling system called Oozie. While it was  initially 
developed as a Yahoo!-internal project, it was designed and  implemented with 
the intention of open-sourcing. Oozie was released as a GitHub project in early 
2010. Oozie is used in production within Yahoo and  since it has been 
open-sourced it has been gaining adoption with  external developers 

Rationale
Commonly,  applications that run on Hadoop require multiple Hadoop jobs in 
order 
to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  a 
combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes  
map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and shell 
scripts. 

Because  of this, developers find themselves writing ad-hoc glue programs to  
combine these Hadoop jobs. These ad-hoc programs are difficult to  schedule, 
manage, monitor and recover. 

Workflow  management and scheduling is an essential feature for large-scale 
data  
processing applications. Such applications could 

Re: [VOTE] Oozie to join the Incubator

2011-06-29 Thread Mattmann, Chris A (388J)
+1 (binding). Good luck guys!

Cheers,
Chris 

Sent from my iPad

On Jun 29, 2011, at 12:10 PM, Mohammad Islam misla...@yahoo.com wrote:

 Hi All,
 
 The discussion about Oozie proposal is settling down. Therefore I would like 
 to
 initiate a vote to accept Oozie as an Apache Incubator project.
 
 The latest proposal is pasted at the end and it could be found in the wiki as
 well:
 
 http://wiki.apache.org/incubator/OozieProposal
 
 
 The related discussion thread is at:
 http://www.mail-archive.com/general@incubator.apache.org/msg29633.html
 
 
 Please cast your votes:
 
 [  ] +1 Accept Oozie for incubation
 [  ] +0 Indifferent to Oozie incubation
 [  ] -1 Reject Oozie for incubation
 
 This vote will close 72 hours  from now.
 
 Regards,
 Mohammad
 
 
 Abstract
 Oozie is a server-based workflow scheduling and coordination system to manage
 data processing jobs for Apache HadoopTM.
 
 Proposal
 Oozie is an  extensible, scalable and reliable system to define, manage,
 schedule,  and execute complex Hadoop workloads via web services. More
 specifically, this includes:
 
* XML-based declarative framework to specify a job or a complex 
 workflow of
 dependent jobs.
 
* Support different types of job such as Hadoop Map-Reduce, Pipe, 
 Streaming,
 Pig, Hive and custom java applications.
 
* Workflow scheduling based on frequency and/or data availability.
* Monitoring capability, automatic retry and failure handing of jobs.
* Extensible and pluggable architecture to allow arbitrary grid 
 programming
 paradigms.
 
* Authentication, authorization, and capacity-aware load throttling to 
 allow
 multi-tenant software as a service.
 
 Background
 Most data  processing applications require multiple jobs to achieve their 
 goals,
 with inherent dependencies among the jobs. A dependency could be  sequential,
 where one job can only start after another job has finished.  Or it could be
 conditional, where the execution of a job depends on the  return value or 
 status
 of another job. In other cases, parallel  execution of multiple jobs may be
 permitted – or desired – to exploit  the massive pool of compute nodes 
 provided
 by Hadoop.
 
 These  job dependencies are often expressed as a Directed Acyclic Graph, also
 called a workflow. A node in the workflow is typically a job (a  computation 
 on
 the grid) or another type of action such as an eMail  notification. 
 Computations
 can be expressed in map/reduce, Pig, Hive or  any other programming paradigm
 available on the grid. Edges of the graph  represent transitions from one node
 to the next, as the execution of a  workflow proceeds.
 
 Describing  a workflow in a declarative way has the advantage of decoupling 
 job
 dependencies and execution control from application logic. Furthermore,  the
 workflow is modularized into jobs that can be reused within the same  workflow
 or across different workflows. Execution of the workflow is  then driven by a
 runtime system without understanding the application  logic of the jobs. This
 runtime system specializes in reliable and  predictable execution: It can 
 retry
 actions that have failed or invoke a  cleanup action after termination of the
 workflow; it can monitor  progress, success, or failure of a workflow, and 
 send
 appropriate alerts  to an administrator. The application developer is relieved
 from  implementing these generic procedures.
 
 Furthermore,  some applications or workflows need to run in periodic intervals
 or  when dependent data is available. For example, a workflow could be  
 executed
 every day as soon as output data from the previous 24 instances  of another,
 hourly workflow is available. The workflow coordinator  provides such 
 scheduling
 features, along with prioritization, load  balancing and throttling to 
 optimize
 utilization of resources in the  cluster. This makes it easier to maintain,
 control, and coordinate  complex data applications.
 
 Nearly  three years ago, a team of Yahoo! developers addressed these critical
 requirements for Hadoop-based data processing systems by developing a  new
 workflow management and scheduling system called Oozie. While it was  
 initially
 developed as a Yahoo!-internal project, it was designed and  implemented with
 the intention of open-sourcing. Oozie was released as a GitHub project in 
 early
 2010. Oozie is used in production within Yahoo and  since it has been
 open-sourced it has been gaining adoption with  external developers
 
 Rationale
 Commonly,  applications that run on Hadoop require multiple Hadoop jobs in 
 order
 to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  a
 combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
 map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and 
 shell
 scripts.
 
 Because  of this, developers find themselves writing ad-hoc glue programs to
 combine these Hadoop jobs. These ad-hoc programs 

Re: [VOTE] Oozie to join the Incubator

2011-06-29 Thread Suresh Marru
Hi Mohammad,

I am interested to contribute to this project, since any one did not vote yet, 
can I add my name to the Initial Committers? 

Thanks,
Suresh

On Jun 29, 2011, at 3:10 PM, Mohammad Islam wrote:

 Hi All,
 
 The discussion about Oozie proposal is settling down. Therefore I would like 
 to 
 initiate a vote to accept Oozie as an Apache Incubator project.
 
 The latest proposal is pasted at the end and it could be found in the wiki as 
 well:
 
 http://wiki.apache.org/incubator/OozieProposal
 
 
 The related discussion thread is at:
 http://www.mail-archive.com/general@incubator.apache.org/msg29633.html
 
 
 Please cast your votes:
 
 [  ] +1 Accept Oozie for incubation
 [  ] +0 Indifferent to Oozie incubation
 [  ] -1 Reject Oozie for incubation
 
 This vote will close 72 hours  from now.
 
 Regards,
 Mohammad
 
 
 Abstract
 Oozie is a server-based workflow scheduling and coordination system to manage 
 data processing jobs for Apache HadoopTM. 
 
 Proposal
 Oozie is an  extensible, scalable and reliable system to define, manage, 
 schedule,  and execute complex Hadoop workloads via web services. More  
 specifically, this includes: 
 
   * XML-based declarative framework to specify a job or a complex 
 workflow of 
 dependent jobs. 
 
   * Support different types of job such as Hadoop Map-Reduce, Pipe, 
 Streaming, 
 Pig, Hive and custom java applications. 
 
   * Workflow scheduling based on frequency and/or data availability. 
   * Monitoring capability, automatic retry and failure handing of jobs. 
   * Extensible and pluggable architecture to allow arbitrary grid 
 programming 
 paradigms. 
 
   * Authentication, authorization, and capacity-aware load throttling to 
 allow 
 multi-tenant software as a service. 
 
 Background
 Most data  processing applications require multiple jobs to achieve their 
 goals,  
 with inherent dependencies among the jobs. A dependency could be  sequential, 
 where one job can only start after another job has finished.  Or it could be 
 conditional, where the execution of a job depends on the  return value or 
 status 
 of another job. In other cases, parallel  execution of multiple jobs may be 
 permitted – or desired – to exploit  the massive pool of compute nodes 
 provided 
 by Hadoop. 
 
 These  job dependencies are often expressed as a Directed Acyclic Graph, also 
  
 called a workflow. A node in the workflow is typically a job (a  computation 
 on 
 the grid) or another type of action such as an eMail  notification. 
 Computations 
 can be expressed in map/reduce, Pig, Hive or  any other programming paradigm 
 available on the grid. Edges of the graph  represent transitions from one 
 node 
 to the next, as the execution of a  workflow proceeds. 
 
 Describing  a workflow in a declarative way has the advantage of decoupling 
 job  
 dependencies and execution control from application logic. Furthermore,  the 
 workflow is modularized into jobs that can be reused within the same  
 workflow 
 or across different workflows. Execution of the workflow is  then driven by a 
 runtime system without understanding the application  logic of the jobs. This 
 runtime system specializes in reliable and  predictable execution: It can 
 retry 
 actions that have failed or invoke a  cleanup action after termination of the 
 workflow; it can monitor  progress, success, or failure of a workflow, and 
 send 
 appropriate alerts  to an administrator. The application developer is 
 relieved 
 from  implementing these generic procedures. 
 
 Furthermore,  some applications or workflows need to run in periodic 
 intervals 
 or  when dependent data is available. For example, a workflow could be  
 executed 
 every day as soon as output data from the previous 24 instances  of another, 
 hourly workflow is available. The workflow coordinator  provides such 
 scheduling 
 features, along with prioritization, load  balancing and throttling to 
 optimize 
 utilization of resources in the  cluster. This makes it easier to maintain, 
 control, and coordinate  complex data applications. 
 
 Nearly  three years ago, a team of Yahoo! developers addressed these critical 
  
 requirements for Hadoop-based data processing systems by developing a  new 
 workflow management and scheduling system called Oozie. While it was  
 initially 
 developed as a Yahoo!-internal project, it was designed and  implemented with 
 the intention of open-sourcing. Oozie was released as a GitHub project in 
 early 
 2010. Oozie is used in production within Yahoo and  since it has been 
 open-sourced it has been gaining adoption with  external developers 
 
 Rationale
 Commonly,  applications that run on Hadoop require multiple Hadoop jobs in 
 order 
 to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  
 a 
 combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes  
 map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and 
 

Re: [VOTE] Oozie to join the Incubator

2011-06-29 Thread Nigel Daley
+1 (binding)

Sent from my iPad

On Jun 29, 2011, at 12:10 PM, Mohammad Islam misla...@yahoo.com wrote:

 Hi All,
 
 The discussion about Oozie proposal is settling down. Therefore I would like 
 to 
 initiate a vote to accept Oozie as an Apache Incubator project.
 
 The latest proposal is pasted at the end and it could be found in the wiki as 
 well:
 
 http://wiki.apache.org/incubator/OozieProposal
 
 
 The related discussion thread is at:
 http://www.mail-archive.com/general@incubator.apache.org/msg29633.html
 
 
 Please cast your votes:
 
 [  ] +1 Accept Oozie for incubation
 [  ] +0 Indifferent to Oozie incubation
 [  ] -1 Reject Oozie for incubation
 
 This vote will close 72 hours  from now.
 
 Regards,
 Mohammad
 
 
 Abstract
 Oozie is a server-based workflow scheduling and coordination system to manage 
 data processing jobs for Apache HadoopTM. 
 
 Proposal
 Oozie is an  extensible, scalable and reliable system to define, manage, 
 schedule,  and execute complex Hadoop workloads via web services. More  
 specifically, this includes: 
 
* XML-based declarative framework to specify a job or a complex workflow 
 of 
 dependent jobs. 
 
* Support different types of job such as Hadoop Map-Reduce, Pipe, 
 Streaming, 
 Pig, Hive and custom java applications. 
 
* Workflow scheduling based on frequency and/or data availability. 
* Monitoring capability, automatic retry and failure handing of jobs. 
* Extensible and pluggable architecture to allow arbitrary grid 
 programming 
 paradigms. 
 
* Authentication, authorization, and capacity-aware load throttling to 
 allow 
 multi-tenant software as a service. 
 
 Background
 Most data  processing applications require multiple jobs to achieve their 
 goals,  
 with inherent dependencies among the jobs. A dependency could be  sequential, 
 where one job can only start after another job has finished.  Or it could be 
 conditional, where the execution of a job depends on the  return value or 
 status 
 of another job. In other cases, parallel  execution of multiple jobs may be 
 permitted – or desired – to exploit  the massive pool of compute nodes 
 provided 
 by Hadoop. 
 
 These  job dependencies are often expressed as a Directed Acyclic Graph, also 
  
 called a workflow. A node in the workflow is typically a job (a  computation 
 on 
 the grid) or another type of action such as an eMail  notification. 
 Computations 
 can be expressed in map/reduce, Pig, Hive or  any other programming paradigm 
 available on the grid. Edges of the graph  represent transitions from one 
 node 
 to the next, as the execution of a  workflow proceeds. 
 
 Describing  a workflow in a declarative way has the advantage of decoupling 
 job  
 dependencies and execution control from application logic. Furthermore,  the 
 workflow is modularized into jobs that can be reused within the same  
 workflow 
 or across different workflows. Execution of the workflow is  then driven by a 
 runtime system without understanding the application  logic of the jobs. This 
 runtime system specializes in reliable and  predictable execution: It can 
 retry 
 actions that have failed or invoke a  cleanup action after termination of the 
 workflow; it can monitor  progress, success, or failure of a workflow, and 
 send 
 appropriate alerts  to an administrator. The application developer is 
 relieved 
 from  implementing these generic procedures. 
 
 Furthermore,  some applications or workflows need to run in periodic 
 intervals 
 or  when dependent data is available. For example, a workflow could be  
 executed 
 every day as soon as output data from the previous 24 instances  of another, 
 hourly workflow is available. The workflow coordinator  provides such 
 scheduling 
 features, along with prioritization, load  balancing and throttling to 
 optimize 
 utilization of resources in the  cluster. This makes it easier to maintain, 
 control, and coordinate  complex data applications. 
 
 Nearly  three years ago, a team of Yahoo! developers addressed these critical 
  
 requirements for Hadoop-based data processing systems by developing a  new 
 workflow management and scheduling system called Oozie. While it was  
 initially 
 developed as a Yahoo!-internal project, it was designed and  implemented with 
 the intention of open-sourcing. Oozie was released as a GitHub project in 
 early 
 2010. Oozie is used in production within Yahoo and  since it has been 
 open-sourced it has been gaining adoption with  external developers 
 
 Rationale
 Commonly,  applications that run on Hadoop require multiple Hadoop jobs in 
 order 
 to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  
 a 
 combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes  
 map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and 
 shell 
 scripts. 
 
 Because  of this, developers find themselves writing ad-hoc glue programs to  
 combine these Hadoop 

Bluessky calls for a new mentor!!!

2011-06-29 Thread Chen Liu
Hi,all,

Now, Bluesky project calls for a new mentor to guide us to complete the
release work of 4th version.

BlueSky is an e-learning solution designed to help solve the disparity in
availability of qualified education between well-developed cities and poorer
regions of

China (e.g., countryside of Western China). BlueSky is already deployed to
12 + 5 primary/high schools with positive reviews.BlueSky was originally
created by Qinghua

Zheng and Jun Liu in September 2005. The BlueSky development is being done
at XJTU-IBM Open Technology and Application Joint Develop Center, more than
20 developers

are involved. And it entered incubation on 2008-01-12.

BlueSky is consisted with two subsystems -- RealClass system and MersMp
system, both of which contains a set of flexible, extensible applications
such as Distance

Collaboration System, Collaboration player, Collaboration recording
tool, Resources Sharing and Management Platform  and Access of mobile
terminal, designed by engineers and educators with years of experience in
the problem domain, as well as a framework that makes it possible to create
new applications that exist with others.

Currently,Bluesky project has evolved into the 4th version system with more
flexible applications and stabilities. We had developed the core
applications of bluesky system with QT to support both Windows and Linux.
What's more important, the new version system has the feature of integrated
applications, which means that user can record a video during the
interactive process and then  play it in VOD method.The third advancement is
that the 4th version sysytem supports Android platform and we can further
advance Blursky in the mobile domain.

We propose to move future development of BlueSky to the Apache Software
Foundation in order to build a broader user and developer community. We hope
to encourage

contributions and use of Bluesky by other developing countries with similar
education needs. Currently, the new version of BlueSky system is already
handy to release.We really need to add an enthusiastic and responsive mentor
to help us to complete full cycle of successful release.So we hope that with
the help of developers all around the world, the system could become more
powerful and functional in education area.

We appreciate your attentions to Bluesky project.

Best regards.


Bluesky calls for a new mentor!

2011-06-29 Thread Chen Liu
Hi,all,

Now, Bluesky project calls for a new mentor to guide us to complete the
release work of 4th version.

BlueSky is an e-learning solution designed to help solve the disparity in
availability of qualified education between well-developed cities and poorer
regions of

China (e.g., countryside of Western China). BlueSky is already deployed to
12 + 5 primary/high schools with positive reviews.BlueSky was originally
created by Qinghua Zheng and Jun Liu in September 2005. The BlueSky
development is being done at XJTU-IBM Open Technology and Application Joint
Develop Center, more than 20 developers are involved. And it entered
incubation on 2008-01-12.

BlueSky is consisted with two subsystems -- RealClass system and MersMp
system, both of which contains a set of flexible, extensible applications
such as Distance Collaboration System, Collaboration player,
Collaboration recording tool, Resources Sharing and Management Platform 
and Access of mobile terminal, designed by engineers and educators with
years of experience in the problem domain, as well as a framework that makes
it possible to create new applications that exist with others.

Currently,Bluesky project has evolved into the 4th version system with more
flexible applications and stabilities. We had developed the core
applications of bluesky system with QT to support both Windows and Linux.
What's more important, the new version system has the feature of integrated
applications, which means that user can record a video during the
interactive process and then  play it in VOD method.The third advancement is
that the 4th version sysytem supports Android platform and we can further
advance Blursky in the mobile domain.

We propose to move future development of BlueSky to the Apache Software
Foundation in order to build a broader user and developer community. We hope
to encourage contributions and use of Bluesky by other developing countries
with similar education needs. Currently, the new version of BlueSky system
is already handy to release.We really need to add an enthusiastic and
responsive mentor to help us to complete full cycle of successful release.So
we hope that with the help of developers all around the world, the system
could become more powerful and functional in education area.

We appreciate your attentions to Bluesky project.

Best regards.


Re: [VOTE] Release Apache log4php 2.1.0 (RC2)

2011-06-29 Thread Yoav Shapira
On Tue, Jun 28, 2011 at 5:35 PM, Ivan Habunek ivan.habu...@gmail.com wrote:
 Dear all,

 It is my pleasure to announce the second release candidate for Apache
 log4php 2.1.0.

 Since we are short of PMC members at the log4php project, I would
 appreciate if other PMCs would join in so that we may pass this vote.

 Fixes compared to RC1:
  * included build.xml in source packages
  * fixed version number in title of api docs

 Significant changes in this release include:
  * a new logging level: trace
  * a new appender: MongoDB (thanks to Vladimir Gorej)
  * a plethora of bugfixes and other code improvements
  * most of the site docs have been rewritten to make log4php more
 accessible to users

 Apache log4php 2.1.0 RC2 is available for review here:
  * http://people.apache.org/builds/logging/log4php/2.1.0/RC2/

 The tag for this release is available here:
  * http://svn.apache.org/viewvc/logging/log4php/tags/apache-log4php-2.1.0/

 The site for this version can be found at:
  * http://people.apache.org/~ihabunek/apache-log4php-2.1.0-RC2/site/

 According to the process, please vote with:
 [ X ] +1 Yes go ahead and release the artifacts

I'm not a PHP expert, but I want to help out, so I played with it a
bit.  Didn't see anything wrong.  Not exactly exhaustive testing, but
looks fine.

Yoav

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: Bluesky calls for a new mentor!

2011-06-29 Thread Joe Schaefer
Sorry, I'm going to pass on this.  During the entire time you guys have
been at the ASF you have not managed to develop *any* project governance
that could be charitably be described as open.  You are supposed to be doing
your development work in the ASF subversion repository, using ASF mailing
lists, as peers.  Looking at the (limited) commit history, there is a total
imbalance between the number of people associated with the development work 
(20+)
and the number of people with Apache accounts here (2).  I don't see any 
attempts
to rectify this other than to say there are cultural issues at play and that
from now on you'll be using the mailing list to discuss *results*.  Frankly
this particular issue isn't something that can be written off so easily.

What we really need you to discuss are *plans*, how you will implement them,
who will implement them, and how you will collaborate in the codebase as peers.

That is what open development is all about, and the main reason why your mentors
are looking to shut down the project at this point.


- Original Message 
 From: Chen Liu liuchen0...@gmail.com
 To: general@incubator.apache.org
 Sent: Wed, June 29, 2011 9:12:37 PM
 Subject: Bluesky calls for a new mentor!
 
 Hi,all,
 
 Now, Bluesky project calls for a new mentor to guide us to  complete the
 release work of 4th version.
 
 BlueSky is an e-learning  solution designed to help solve the disparity in
 availability of qualified  education between well-developed cities and poorer
 regions of
 
 China  (e.g., countryside of Western China). BlueSky is already deployed to
 12 + 5  primary/high schools with positive reviews.BlueSky was originally
 created by  Qinghua Zheng and Jun Liu in September 2005. The BlueSky
 development is being  done at XJTU-IBM Open Technology and Application Joint
 Develop Center, more  than 20 developers are involved. And it entered
 incubation on  2008-01-12.
 
 BlueSky is consisted with two subsystems -- RealClass system  and MersMp
 system, both of which contains a set of flexible, extensible  applications
 such as Distance Collaboration System, Collaboration  player,
 Collaboration recording tool, Resources Sharing and Management  Platform 
 and Access of mobile terminal, designed by engineers and educators  with
 years of experience in the problem domain, as well as a framework that  makes
 it possible to create new applications that exist with  others.
 
 Currently,Bluesky project has evolved into the 4th version system  with more
 flexible applications and stabilities. We had developed the  core
 applications of bluesky system with QT to support both Windows and  Linux.
 What's more important, the new version system has the feature of  integrated
 applications, which means that user can record a video during  the
 interactive process and then  play it in VOD method.The third  advancement is
 that the 4th version sysytem supports Android platform and we  can further
 advance Blursky in the mobile domain.
 
 We propose to move  future development of BlueSky to the Apache Software
 Foundation in order to  build a broader user and developer community. We hope
 to encourage  contributions and use of Bluesky by other developing countries
 with similar  education needs. Currently, the new version of BlueSky system
 is already  handy to release.We really need to add an enthusiastic and
 responsive mentor  to help us to complete full cycle of successful release.So
 we hope that with  the help of developers all around the world, the system
 could become more  powerful and functional in education area.
 
 We appreciate your attentions  to Bluesky project.
 
 Best regards.
 

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



RE: Bluesky calls for a new mentor!

2011-06-29 Thread Noel J. Bergman
Joe Schaefer wrote:
 Chen Liu wrote:
  We propose to move future development of BlueSky to the Apache Software
  Foundation in order to build a broader user and developer community.

 You are supposed to be doing your development work in the ASF subversion
 repository, using ASF mailing lists, as peers.

Chen, as Joe points out, these are what BlueSky should have been doing for
the past three (3) years, and yet we still here a proposal for the future.

 Looking at the (limited) commit history, there is a total imbalance
between
 the number of people associated with the development work (20+) and the
 number of people with Apache accounts here (2).

Again, as Joe points out, ALL of BlueSky development should been done via
the ASF infrastructure, not periodically synchronized.  We are a development
community, not a remote archive.

 What we really need you to discuss are *plans*, how you will implement
them,
 who will implement them, and how you will collaborate in the codebase as
peers.

Joe, again, has this on the money.  The BlueSky project must immediately
make significant strides to rectify these issues.  Now, not later.

We should see:

  1) All current code in the ASF repository.
  2) All development via ASF accounts (get the rest of the people signed
up).
  3) Ddevelopment discussion on the mailing list.
  4) All licensing issues cleaned up.

--- Noel



-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: Bluesky calls for a new mentor!

2011-06-29 Thread SamuelKevin
Hi, Noel:

2011/6/30 Noel J. Bergman n...@devtech.com

 Joe Schaefer wrote:
  Chen Liu wrote:
   We propose to move future development of BlueSky to the Apache Software
   Foundation in order to build a broader user and developer community.

  You are supposed to be doing your development work in the ASF subversion
  repository, using ASF mailing lists, as peers.

 Chen, as Joe points out, these are what BlueSky should have been doing for
 the past three (3) years, and yet we still here a proposal for the future.

  Looking at the (limited) commit history, there is a total imbalance
 between
  the number of people associated with the development work (20+) and the
  number of people with Apache accounts here (2).

 I guess i can explain that. Most of the developers of BlueSky project are
students. As you all know, students come  when they join in school and go
after they graduate. So the active developers are around 10. Like we used to
have 5 committers, but now we only have 2 committers in active.

 Again, as Joe points out, ALL of BlueSky development should been done via
 the ASF infrastructure, not periodically synchronized.  We are a
 development
 community, not a remote archive.

  What we really need you to discuss are *plans*, how you will implement
 them,
  who will implement them, and how you will collaborate in the codebase as
 peers.

 Joe, again, has this on the money.  The BlueSky project must immediately
 make significant strides to rectify these issues.  Now, not later.

 We should see:

  1) All current code in the ASF repository.
  2) All development via ASF accounts (get the rest of the people signed
 up).
  3) Ddevelopment discussion on the mailing list.
  4) All licensing issues cleaned up.

 According to what you've listed, i would forward your suggestion to bluesky
dev list and wish we could make a quick response after
discussion. Appreciate your help.
regards,
Kevin

--- Noel



 -
 To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
 For additional commands, e-mail: general-h...@incubator.apache.org




-- 
Bowen Ma a.k.a Samuel Kevin @ Bluesky Dev TeamXJTU
Shaanxi Province Key Lab. of Satellite and Terrestrial Network Tech
http://incubator.apache.org/bluesky/


Re: Bluesky calls for a new mentor!

2011-06-29 Thread Chen Liu
We really appreciate both of your suggestions on Bluesky's development in
ASF.

We think that we are supposed to finish code release work of the 4th version
in the end of July.Due to the effective work we really need a experienced
mentor to guide our release. In addition, there are about 8-9 developers to
do the release work, some of whom are green hands to release code in ASF
owing to change of team members.We consider that it is important if there is
a disscussion about how to fast acquire this guide instructions.

By the way, we have been developing the bluesky project. After the release
work of 4th version. We would like to discuss the deployment of the 5th
version including IPTV function and so on. We appreciate collaboraters in
ASF to give us some suggestions about mobile terminal and IPTV.


2011/6/30 Noel J. Bergman n...@devtech.com

 Joe Schaefer wrote:
  Chen Liu wrote:
   We propose to move future development of BlueSky to the Apache Software
   Foundation in order to build a broader user and developer community.

  You are supposed to be doing your development work in the ASF subversion
  repository, using ASF mailing lists, as peers.

 Chen, as Joe points out, these are what BlueSky should have been doing for
 the past three (3) years, and yet we still here a proposal for the future.

  Looking at the (limited) commit history, there is a total imbalance
 between
  the number of people associated with the development work (20+) and the
  number of people with Apache accounts here (2).

 Again, as Joe points out, ALL of BlueSky development should been done via
 the ASF infrastructure, not periodically synchronized.  We are a
 development
 community, not a remote archive.

  What we really need you to discuss are *plans*, how you will implement
 them,
  who will implement them, and how you will collaborate in the codebase as
 peers.

 Joe, again, has this on the money.  The BlueSky project must immediately
 make significant strides to rectify these issues.  Now, not later.

 We should see:

  1) All current code in the ASF repository.
  2) All development via ASF accounts (get the rest of the people signed
 up).
  3) Ddevelopment discussion on the mailing list.
  4) All licensing issues cleaned up.

--- Noel



 -
 To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
 For additional commands, e-mail: general-h...@incubator.apache.org




Re: Bluesky calls for a new mentor!

2011-06-29 Thread Ralph Goers
Sorry, but the explanation below makes things sound even worse. Apache projects 
are not here to give students a place to do school work. What you have 
described is not a community.  If the project cannot build a community of 
people who are interested in the project for more than a school term then it 
doesn't belong here.

Ralph

On Jun 29, 2011, at 8:12 PM, SamuelKevin wrote:

 Hi, Noel:
 
 2011/6/30 Noel J. Bergman n...@devtech.com
 
 Joe Schaefer wrote:
 Chen Liu wrote:
 We propose to move future development of BlueSky to the Apache Software
 Foundation in order to build a broader user and developer community.
 
 You are supposed to be doing your development work in the ASF subversion
 repository, using ASF mailing lists, as peers.
 
 Chen, as Joe points out, these are what BlueSky should have been doing for
 the past three (3) years, and yet we still here a proposal for the future.
 
 Looking at the (limited) commit history, there is a total imbalance
 between
 the number of people associated with the development work (20+) and the
 number of people with Apache accounts here (2).
 
 I guess i can explain that. Most of the developers of BlueSky project are
 students. As you all know, students come  when they join in school and go
 after they graduate. So the active developers are around 10. Like we used to
 have 5 committers, but now we only have 2 committers in active.
 
 Again, as Joe points out, ALL of BlueSky development should been done via
 the ASF infrastructure, not periodically synchronized.  We are a
 development
 community, not a remote archive.
 
 What we really need you to discuss are *plans*, how you will implement
 them,
 who will implement them, and how you will collaborate in the codebase as
 peers.
 
 Joe, again, has this on the money.  The BlueSky project must immediately
 make significant strides to rectify these issues.  Now, not later.
 
 We should see:
 
 1) All current code in the ASF repository.
 2) All development via ASF accounts (get the rest of the people signed
 up).
 3) Ddevelopment discussion on the mailing list.
 4) All licensing issues cleaned up.
 
 According to what you've listed, i would forward your suggestion to bluesky
 dev list and wish we could make a quick response after
 discussion. Appreciate your help.
 regards,
 Kevin
 
   --- Noel
 
 
 
 -
 To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
 For additional commands, e-mail: general-h...@incubator.apache.org
 
 
 
 
 -- 
 Bowen Ma a.k.a Samuel Kevin @ Bluesky Dev TeamXJTU
 Shaanxi Province Key Lab. of Satellite and Terrestrial Network Tech
 http://incubator.apache.org/bluesky/


-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: Bluesky calls for a new mentor!

2011-06-29 Thread Luciano Resende
On Wed, Jun 29, 2011 at 8:18 PM, Chen Liu liuchen0...@gmail.com wrote:
 We really appreciate both of your suggestions on Bluesky's development in
 ASF.

 We think that we are supposed to finish code release work of the 4th version
 in the end of July.Due to the effective work we really need a experienced
 mentor to guide our release. In addition, there are about 8-9 developers to
 do the release work, some of whom are green hands to release code in ASF
 owing to change of team members.We consider that it is important if there is
 a disscussion about how to fast acquire this guide instructions.

 By the way, we have been developing the bluesky project. After the release
 work of 4th version. We would like to discuss the deployment of the 5th
 version including IPTV function and so on. We appreciate collaboraters in
 ASF to give us some suggestions about mobile terminal and IPTV.


Who are the current students/developers that worked on the so called
4th version of the code ? The interesting part is that the last JIRA
was created on June 2010 (BLUESKY-10), which raises the question on
who is actually committing the other students code to the repository.

-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: Bluesky calls for a new mentor!

2011-06-29 Thread SamuelKevin
Hi, Luciano:
 Currently, we have 4 committers. And 2 of them stay inactive.  The
newest source code would be committed by me after the students grant
warrant.
regards,
Kevin

2011/6/30 Luciano Resende luckbr1...@gmail.com

 On Wed, Jun 29, 2011 at 8:18 PM, Chen Liu liuchen0...@gmail.com wrote:
  We really appreciate both of your suggestions on Bluesky's development in
  ASF.
 
  We think that we are supposed to finish code release work of the 4th
 version
  in the end of July.Due to the effective work we really need a experienced
  mentor to guide our release. In addition, there are about 8-9 developers
 to
  do the release work, some of whom are green hands to release code in ASF
  owing to change of team members.We consider that it is important if there
 is
  a disscussion about how to fast acquire this guide instructions.
 
  By the way, we have been developing the bluesky project. After the
 release
  work of 4th version. We would like to discuss the deployment of the 5th
  version including IPTV function and so on. We appreciate collaboraters in
  ASF to give us some suggestions about mobile terminal and IPTV.
 

 Who are the current students/developers that worked on the so called
 4th version of the code ? The interesting part is that the last JIRA
 was created on June 2010 (BLUESKY-10), which raises the question on
 who is actually committing the other students code to the repository.

 --
 Luciano Resende
 http://people.apache.org/~lresende
 http://twitter.com/lresende1975
 http://lresende.blogspot.com/

 -
 To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
 For additional commands, e-mail: general-h...@incubator.apache.org




-- 
Bowen Ma a.k.a Samuel Kevin @ Bluesky Dev TeamXJTU
Shaanxi Province Key Lab. of Satellite and Terrestrial Network Tech
http://incubator.apache.org/bluesky/


Re: [VOTE] Oozie to join the Incubator

2011-06-29 Thread Ashish
+1 (non-binding)

On Thu, Jun 30, 2011 at 12:40 AM, Mohammad Islam misla...@yahoo.com wrote:

 Hi All,

 The discussion about Oozie proposal is settling down. Therefore I would
 like to
 initiate a vote to accept Oozie as an Apache Incubator project.

 The latest proposal is pasted at the end and it could be found in the wiki
 as
 well:

 http://wiki.apache.org/incubator/OozieProposal


 The related discussion thread is at:
 http://www.mail-archive.com/general@incubator.apache.org/msg29633.html


 Please cast your votes:

 [  ] +1 Accept Oozie for incubation
 [  ] +0 Indifferent to Oozie incubation
 [  ] -1 Reject Oozie for incubation

 This vote will close 72 hours  from now.

 Regards,
 Mohammad


 Abstract
 Oozie is a server-based workflow scheduling and coordination system to
 manage
 data processing jobs for Apache HadoopTM.

 Proposal
 Oozie is an  extensible, scalable and reliable system to define, manage,
 schedule,  and execute complex Hadoop workloads via web services. More
 specifically, this includes:

* XML-based declarative framework to specify a job or a complex
 workflow of
 dependent jobs.

* Support different types of job such as Hadoop Map-Reduce, Pipe,
 Streaming,
 Pig, Hive and custom java applications.

* Workflow scheduling based on frequency and/or data availability.
* Monitoring capability, automatic retry and failure handing of
 jobs.
* Extensible and pluggable architecture to allow arbitrary grid
 programming
 paradigms.

* Authentication, authorization, and capacity-aware load throttling
 to allow
 multi-tenant software as a service.

 Background
 Most data  processing applications require multiple jobs to achieve their
 goals,
 with inherent dependencies among the jobs. A dependency could be
  sequential,
 where one job can only start after another job has finished.  Or it could
 be
 conditional, where the execution of a job depends on the  return value or
 status
 of another job. In other cases, parallel  execution of multiple jobs may be
 permitted – or desired – to exploit  the massive pool of compute nodes
 provided
 by Hadoop.

 These  job dependencies are often expressed as a Directed Acyclic Graph,
 also
 called a workflow. A node in the workflow is typically a job (a
  computation on
 the grid) or another type of action such as an eMail  notification.
 Computations
 can be expressed in map/reduce, Pig, Hive or  any other programming
 paradigm
 available on the grid. Edges of the graph  represent transitions from one
 node
 to the next, as the execution of a  workflow proceeds.

 Describing  a workflow in a declarative way has the advantage of decoupling
 job
 dependencies and execution control from application logic. Furthermore,
  the
 workflow is modularized into jobs that can be reused within the same
  workflow
 or across different workflows. Execution of the workflow is  then driven by
 a
 runtime system without understanding the application  logic of the jobs.
 This
 runtime system specializes in reliable and  predictable execution: It can
 retry
 actions that have failed or invoke a  cleanup action after termination of
 the
 workflow; it can monitor  progress, success, or failure of a workflow, and
 send
 appropriate alerts  to an administrator. The application developer is
 relieved
 from  implementing these generic procedures.

 Furthermore,  some applications or workflows need to run in periodic
 intervals
 or  when dependent data is available. For example, a workflow could be
  executed
 every day as soon as output data from the previous 24 instances  of
 another,
 hourly workflow is available. The workflow coordinator  provides such
 scheduling
 features, along with prioritization, load  balancing and throttling to
 optimize
 utilization of resources in the  cluster. This makes it easier to maintain,
 control, and coordinate  complex data applications.

 Nearly  three years ago, a team of Yahoo! developers addressed these
 critical
 requirements for Hadoop-based data processing systems by developing a  new
 workflow management and scheduling system called Oozie. While it was
  initially
 developed as a Yahoo!-internal project, it was designed and  implemented
 with
 the intention of open-sourcing. Oozie was released as a GitHub project in
 early
 2010. Oozie is used in production within Yahoo and  since it has been
 open-sourced it has been gaining adoption with  external developers

 Rationale
 Commonly,  applications that run on Hadoop require multiple Hadoop jobs in
 order
 to  obtain the desired results. Furthermore, these Hadoop jobs are commonly
  a
 combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
 map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and
 shell
 scripts.

 Because  of this, developers find themselves writing ad-hoc glue programs
 to
 combine these Hadoop jobs. These ad-hoc programs are difficult to
  schedule,
 manage, monitor and recover.

 

Re: [VOTE] Oozie to join the Incubator

2011-06-29 Thread Alejandro Abdelnur
+1 (non-binding)

On Wed, Jun 29, 2011 at 10:18 PM, Ashish paliwalash...@gmail.com wrote:

 +1 (non-binding)

 On Thu, Jun 30, 2011 at 12:40 AM, Mohammad Islam misla...@yahoo.com
 wrote:

  Hi All,
 
  The discussion about Oozie proposal is settling down. Therefore I would
  like to
  initiate a vote to accept Oozie as an Apache Incubator project.
 
  The latest proposal is pasted at the end and it could be found in the
 wiki
  as
  well:
 
  http://wiki.apache.org/incubator/OozieProposal
 
 
  The related discussion thread is at:
  http://www.mail-archive.com/general@incubator.apache.org/msg29633.html
 
 
  Please cast your votes:
 
  [  ] +1 Accept Oozie for incubation
  [  ] +0 Indifferent to Oozie incubation
  [  ] -1 Reject Oozie for incubation
 
  This vote will close 72 hours  from now.
 
  Regards,
  Mohammad
 
 
  Abstract
  Oozie is a server-based workflow scheduling and coordination system to
  manage
  data processing jobs for Apache HadoopTM.
 
  Proposal
  Oozie is an  extensible, scalable and reliable system to define, manage,
  schedule,  and execute complex Hadoop workloads via web services. More
  specifically, this includes:
 
 * XML-based declarative framework to specify a job or a complex
  workflow of
  dependent jobs.
 
 * Support different types of job such as Hadoop Map-Reduce, Pipe,
  Streaming,
  Pig, Hive and custom java applications.
 
 * Workflow scheduling based on frequency and/or data availability.
 * Monitoring capability, automatic retry and failure handing of
  jobs.
 * Extensible and pluggable architecture to allow arbitrary grid
  programming
  paradigms.
 
 * Authentication, authorization, and capacity-aware load
 throttling
  to allow
  multi-tenant software as a service.
 
  Background
  Most data  processing applications require multiple jobs to achieve their
  goals,
  with inherent dependencies among the jobs. A dependency could be
   sequential,
  where one job can only start after another job has finished.  Or it could
  be
  conditional, where the execution of a job depends on the  return value or
  status
  of another job. In other cases, parallel  execution of multiple jobs may
 be
  permitted – or desired – to exploit  the massive pool of compute nodes
  provided
  by Hadoop.
 
  These  job dependencies are often expressed as a Directed Acyclic Graph,
  also
  called a workflow. A node in the workflow is typically a job (a
   computation on
  the grid) or another type of action such as an eMail  notification.
  Computations
  can be expressed in map/reduce, Pig, Hive or  any other programming
  paradigm
  available on the grid. Edges of the graph  represent transitions from one
  node
  to the next, as the execution of a  workflow proceeds.
 
  Describing  a workflow in a declarative way has the advantage of
 decoupling
  job
  dependencies and execution control from application logic. Furthermore,
   the
  workflow is modularized into jobs that can be reused within the same
   workflow
  or across different workflows. Execution of the workflow is  then driven
 by
  a
  runtime system without understanding the application  logic of the jobs.
  This
  runtime system specializes in reliable and  predictable execution: It can
  retry
  actions that have failed or invoke a  cleanup action after termination of
  the
  workflow; it can monitor  progress, success, or failure of a workflow,
 and
  send
  appropriate alerts  to an administrator. The application developer is
  relieved
  from  implementing these generic procedures.
 
  Furthermore,  some applications or workflows need to run in periodic
  intervals
  or  when dependent data is available. For example, a workflow could be
   executed
  every day as soon as output data from the previous 24 instances  of
  another,
  hourly workflow is available. The workflow coordinator  provides such
  scheduling
  features, along with prioritization, load  balancing and throttling to
  optimize
  utilization of resources in the  cluster. This makes it easier to
 maintain,
  control, and coordinate  complex data applications.
 
  Nearly  three years ago, a team of Yahoo! developers addressed these
  critical
  requirements for Hadoop-based data processing systems by developing a
  new
  workflow management and scheduling system called Oozie. While it was
   initially
  developed as a Yahoo!-internal project, it was designed and  implemented
  with
  the intention of open-sourcing. Oozie was released as a GitHub project in
  early
  2010. Oozie is used in production within Yahoo and  since it has been
  open-sourced it has been gaining adoption with  external developers
 
  Rationale
  Commonly,  applications that run on Hadoop require multiple Hadoop jobs
 in
  order
  to  obtain the desired results. Furthermore, these Hadoop jobs are
 commonly
   a
  combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
  map-reduce jobs, Pig jobs, Hive jobs, HDFS 

Re: [VOTE] Oozie to join the Incubator

2011-06-29 Thread Arvind Prabhakar
+1 (non-binding)

Thanks,
Arvind

On Wed, Jun 29, 2011 at 12:10 PM, Mohammad Islam misla...@yahoo.com wrote:
 Hi All,

 The discussion about Oozie proposal is settling down. Therefore I would like 
 to
 initiate a vote to accept Oozie as an Apache Incubator project.

 The latest proposal is pasted at the end and it could be found in the wiki as
 well:

 http://wiki.apache.org/incubator/OozieProposal


 The related discussion thread is at:
 http://www.mail-archive.com/general@incubator.apache.org/msg29633.html


 Please cast your votes:

 [  ] +1 Accept Oozie for incubation
 [  ] +0 Indifferent to Oozie incubation
 [  ] -1 Reject Oozie for incubation

 This vote will close 72 hours  from now.

 Regards,
 Mohammad


 Abstract
 Oozie is a server-based workflow scheduling and coordination system to manage
 data processing jobs for Apache HadoopTM.

 Proposal
 Oozie is an  extensible, scalable and reliable system to define, manage,
 schedule,  and execute complex Hadoop workloads via web services. More
 specifically, this includes:

        * XML-based declarative framework to specify a job or a complex 
 workflow of
 dependent jobs.

        * Support different types of job such as Hadoop Map-Reduce, Pipe, 
 Streaming,
 Pig, Hive and custom java applications.

        * Workflow scheduling based on frequency and/or data availability.
        * Monitoring capability, automatic retry and failure handing of jobs.
        * Extensible and pluggable architecture to allow arbitrary grid 
 programming
 paradigms.

        * Authentication, authorization, and capacity-aware load throttling to 
 allow
 multi-tenant software as a service.

 Background
 Most data  processing applications require multiple jobs to achieve their 
 goals,
 with inherent dependencies among the jobs. A dependency could be  sequential,
 where one job can only start after another job has finished.  Or it could be
 conditional, where the execution of a job depends on the  return value or 
 status
 of another job. In other cases, parallel  execution of multiple jobs may be
 permitted – or desired – to exploit  the massive pool of compute nodes 
 provided
 by Hadoop.

 These  job dependencies are often expressed as a Directed Acyclic Graph, also
 called a workflow. A node in the workflow is typically a job (a  computation 
 on
 the grid) or another type of action such as an eMail  notification. 
 Computations
 can be expressed in map/reduce, Pig, Hive or  any other programming paradigm
 available on the grid. Edges of the graph  represent transitions from one node
 to the next, as the execution of a  workflow proceeds.

 Describing  a workflow in a declarative way has the advantage of decoupling 
 job
 dependencies and execution control from application logic. Furthermore,  the
 workflow is modularized into jobs that can be reused within the same  workflow
 or across different workflows. Execution of the workflow is  then driven by a
 runtime system without understanding the application  logic of the jobs. This
 runtime system specializes in reliable and  predictable execution: It can 
 retry
 actions that have failed or invoke a  cleanup action after termination of the
 workflow; it can monitor  progress, success, or failure of a workflow, and 
 send
 appropriate alerts  to an administrator. The application developer is relieved
 from  implementing these generic procedures.

 Furthermore,  some applications or workflows need to run in periodic intervals
 or  when dependent data is available. For example, a workflow could be  
 executed
 every day as soon as output data from the previous 24 instances  of another,
 hourly workflow is available. The workflow coordinator  provides such 
 scheduling
 features, along with prioritization, load  balancing and throttling to 
 optimize
 utilization of resources in the  cluster. This makes it easier to maintain,
 control, and coordinate  complex data applications.

 Nearly  three years ago, a team of Yahoo! developers addressed these critical
 requirements for Hadoop-based data processing systems by developing a  new
 workflow management and scheduling system called Oozie. While it was  
 initially
 developed as a Yahoo!-internal project, it was designed and  implemented with
 the intention of open-sourcing. Oozie was released as a GitHub project in 
 early
 2010. Oozie is used in production within Yahoo and  since it has been
 open-sourced it has been gaining adoption with  external developers

 Rationale
 Commonly,  applications that run on Hadoop require multiple Hadoop jobs in 
 order
 to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  a
 combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
 map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and 
 shell
 scripts.

 Because  of this, developers find themselves writing ad-hoc glue programs to
 combine these Hadoop jobs. These ad-hoc programs are difficult to  schedule,
 manage, monitor and 

Re: [VOTE] Oozie to join the Incubator

2011-06-29 Thread Edward J. Yoon
Cool project, +1

On Thu, Jun 30, 2011 at 2:23 PM, Arvind Prabhakar arv...@apache.org wrote:
 +1 (non-binding)

 Thanks,
 Arvind

 On Wed, Jun 29, 2011 at 12:10 PM, Mohammad Islam misla...@yahoo.com wrote:
 Hi All,

 The discussion about Oozie proposal is settling down. Therefore I would like 
 to
 initiate a vote to accept Oozie as an Apache Incubator project.

 The latest proposal is pasted at the end and it could be found in the wiki as
 well:

 http://wiki.apache.org/incubator/OozieProposal


 The related discussion thread is at:
 http://www.mail-archive.com/general@incubator.apache.org/msg29633.html


 Please cast your votes:

 [  ] +1 Accept Oozie for incubation
 [  ] +0 Indifferent to Oozie incubation
 [  ] -1 Reject Oozie for incubation

 This vote will close 72 hours  from now.

 Regards,
 Mohammad


 Abstract
 Oozie is a server-based workflow scheduling and coordination system to manage
 data processing jobs for Apache HadoopTM.

 Proposal
 Oozie is an  extensible, scalable and reliable system to define, manage,
 schedule,  and execute complex Hadoop workloads via web services. More
 specifically, this includes:

        * XML-based declarative framework to specify a job or a complex 
 workflow of
 dependent jobs.

        * Support different types of job such as Hadoop Map-Reduce, Pipe, 
 Streaming,
 Pig, Hive and custom java applications.

        * Workflow scheduling based on frequency and/or data availability.
        * Monitoring capability, automatic retry and failure handing of jobs.
        * Extensible and pluggable architecture to allow arbitrary grid 
 programming
 paradigms.

        * Authentication, authorization, and capacity-aware load throttling 
 to allow
 multi-tenant software as a service.

 Background
 Most data  processing applications require multiple jobs to achieve their 
 goals,
 with inherent dependencies among the jobs. A dependency could be  sequential,
 where one job can only start after another job has finished.  Or it could be
 conditional, where the execution of a job depends on the  return value or 
 status
 of another job. In other cases, parallel  execution of multiple jobs may be
 permitted – or desired – to exploit  the massive pool of compute nodes 
 provided
 by Hadoop.

 These  job dependencies are often expressed as a Directed Acyclic Graph, also
 called a workflow. A node in the workflow is typically a job (a  computation 
 on
 the grid) or another type of action such as an eMail  notification. 
 Computations
 can be expressed in map/reduce, Pig, Hive or  any other programming paradigm
 available on the grid. Edges of the graph  represent transitions from one 
 node
 to the next, as the execution of a  workflow proceeds.

 Describing  a workflow in a declarative way has the advantage of decoupling 
 job
 dependencies and execution control from application logic. Furthermore,  the
 workflow is modularized into jobs that can be reused within the same  
 workflow
 or across different workflows. Execution of the workflow is  then driven by a
 runtime system without understanding the application  logic of the jobs. This
 runtime system specializes in reliable and  predictable execution: It can 
 retry
 actions that have failed or invoke a  cleanup action after termination of the
 workflow; it can monitor  progress, success, or failure of a workflow, and 
 send
 appropriate alerts  to an administrator. The application developer is 
 relieved
 from  implementing these generic procedures.

 Furthermore,  some applications or workflows need to run in periodic 
 intervals
 or  when dependent data is available. For example, a workflow could be  
 executed
 every day as soon as output data from the previous 24 instances  of another,
 hourly workflow is available. The workflow coordinator  provides such 
 scheduling
 features, along with prioritization, load  balancing and throttling to 
 optimize
 utilization of resources in the  cluster. This makes it easier to maintain,
 control, and coordinate  complex data applications.

 Nearly  three years ago, a team of Yahoo! developers addressed these critical
 requirements for Hadoop-based data processing systems by developing a  new
 workflow management and scheduling system called Oozie. While it was  
 initially
 developed as a Yahoo!-internal project, it was designed and  implemented with
 the intention of open-sourcing. Oozie was released as a GitHub project in 
 early
 2010. Oozie is used in production within Yahoo and  since it has been
 open-sourced it has been gaining adoption with  external developers

 Rationale
 Commonly,  applications that run on Hadoop require multiple Hadoop jobs in 
 order
 to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  
 a
 combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
 map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and 
 shell
 scripts.

 Because  of this, developers find themselves writing ad-hoc glue programs 

Re: [VOTE] Oozie to join the Incubator

2011-06-29 Thread Roman Shaposhnik
+1 (non-binding)

On Wed, Jun 29, 2011 at 12:10 PM, Mohammad Islam misla...@yahoo.com wrote:
 Hi All,

 The discussion about Oozie proposal is settling down. Therefore I would like 
 to
 initiate a vote to accept Oozie as an Apache Incubator project.

 The latest proposal is pasted at the end and it could be found in the wiki as
 well:

 http://wiki.apache.org/incubator/OozieProposal


 The related discussion thread is at:
 http://www.mail-archive.com/general@incubator.apache.org/msg29633.html


 Please cast your votes:

 [  ] +1 Accept Oozie for incubation
 [  ] +0 Indifferent to Oozie incubation
 [  ] -1 Reject Oozie for incubation

 This vote will close 72 hours  from now.

 Regards,
 Mohammad


 Abstract
 Oozie is a server-based workflow scheduling and coordination system to manage
 data processing jobs for Apache HadoopTM.

 Proposal
 Oozie is an  extensible, scalable and reliable system to define, manage,
 schedule,  and execute complex Hadoop workloads via web services. More
 specifically, this includes:

        * XML-based declarative framework to specify a job or a complex 
 workflow of
 dependent jobs.

        * Support different types of job such as Hadoop Map-Reduce, Pipe, 
 Streaming,
 Pig, Hive and custom java applications.

        * Workflow scheduling based on frequency and/or data availability.
        * Monitoring capability, automatic retry and failure handing of jobs.
        * Extensible and pluggable architecture to allow arbitrary grid 
 programming
 paradigms.

        * Authentication, authorization, and capacity-aware load throttling to 
 allow
 multi-tenant software as a service.

 Background
 Most data  processing applications require multiple jobs to achieve their 
 goals,
 with inherent dependencies among the jobs. A dependency could be  sequential,
 where one job can only start after another job has finished.  Or it could be
 conditional, where the execution of a job depends on the  return value or 
 status
 of another job. In other cases, parallel  execution of multiple jobs may be
 permitted – or desired – to exploit  the massive pool of compute nodes 
 provided
 by Hadoop.

 These  job dependencies are often expressed as a Directed Acyclic Graph, also
 called a workflow. A node in the workflow is typically a job (a  computation 
 on
 the grid) or another type of action such as an eMail  notification. 
 Computations
 can be expressed in map/reduce, Pig, Hive or  any other programming paradigm
 available on the grid. Edges of the graph  represent transitions from one node
 to the next, as the execution of a  workflow proceeds.

 Describing  a workflow in a declarative way has the advantage of decoupling 
 job
 dependencies and execution control from application logic. Furthermore,  the
 workflow is modularized into jobs that can be reused within the same  workflow
 or across different workflows. Execution of the workflow is  then driven by a
 runtime system without understanding the application  logic of the jobs. This
 runtime system specializes in reliable and  predictable execution: It can 
 retry
 actions that have failed or invoke a  cleanup action after termination of the
 workflow; it can monitor  progress, success, or failure of a workflow, and 
 send
 appropriate alerts  to an administrator. The application developer is relieved
 from  implementing these generic procedures.

 Furthermore,  some applications or workflows need to run in periodic intervals
 or  when dependent data is available. For example, a workflow could be  
 executed
 every day as soon as output data from the previous 24 instances  of another,
 hourly workflow is available. The workflow coordinator  provides such 
 scheduling
 features, along with prioritization, load  balancing and throttling to 
 optimize
 utilization of resources in the  cluster. This makes it easier to maintain,
 control, and coordinate  complex data applications.

 Nearly  three years ago, a team of Yahoo! developers addressed these critical
 requirements for Hadoop-based data processing systems by developing a  new
 workflow management and scheduling system called Oozie. While it was  
 initially
 developed as a Yahoo!-internal project, it was designed and  implemented with
 the intention of open-sourcing. Oozie was released as a GitHub project in 
 early
 2010. Oozie is used in production within Yahoo and  since it has been
 open-sourced it has been gaining adoption with  external developers

 Rationale
 Commonly,  applications that run on Hadoop require multiple Hadoop jobs in 
 order
 to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  a
 combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
 map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and 
 shell
 scripts.

 Because  of this, developers find themselves writing ad-hoc glue programs to
 combine these Hadoop jobs. These ad-hoc programs are difficult to  schedule,
 manage, monitor and recover.

 Workflow