[PROPOSAL] Sqoop Project

2011-05-27 Thread arv...@cloudera.com
Greetings All,

We would like to propose Sqoop Project for inclusion in ASF Incubator as a
new podling. Sqoop is a tool designed for efficiently transferring bulk data
between Apache Hadoop and structured datastores such as relational
databases. The complete proposal can be found at:

http://wiki.apache.org/incubator/SqoopProposal

The initial contents of this proposal are also pasted below for convenience.

Thanks and Regards,
Arvind Prabhakar

= Sqoop - A Data Transfer Tool for Hadoop =

== Abstract ==

Sqoop is a tool designed for efficiently transferring bulk data between
Apache Hadoop and structured datastores such as relational databases. You
can use Sqoop to import data from external structured datastores into Hadoop
Distributed File System or related systems like Hive and HBase. Conversely,
Sqoop can be used to extract data from Hadoop and export it to external
structured datastores such as relational databases and enterprise data
warehouses.

== Proposal ==

Hadoop and related systems operate on large volumes of data. Typically this
data originates from outside of Hadoop infrastructure and must be
provisioned for consumption by Hadoop and related systems for analysis and
processing. Sqoop allows fast provisioning of data into Hadoop and related
systems by providing a bulk import and export mechanism that enables
consumers to effectively use Hadoop for data analysis and processing.

== Background ==

Sqoop was initially developed by Cloudera to enable the import and export of
data between various databases and Hadoop Distributed File System (HDFS). It
was provided as a patch to Hadoop project via the issue [[
https://issues.apache.org/jira/browse/HADOOP-5815|HADOOP-5815]] and was
maintained as a contrib module to Hadoop between May 2009 to April 2010. In
April 2010, Sqoop was removed from Hadoop contrib via [[
https://issues.apache.org/jira/browse/MAPREDUCE-1644|MAPREDUCE-1644]] and
was made available by Cloudera on [[http://github.com/cloudera/sqoop|GitHub]].


Since then Sqoop has been maintained by Cloudera as an open source project
on GitHub. All code available in Sqoop is open source and made publicaly
available under the Apache 2 license. During this time Sqoop has been
formally released three times as versions 1.0, 1.1 and 1.2.

== Rationale ==

Hadoop is often used to process data that originated or is later served by
structured data stores such as relational databases, spreadsheets or
enterprise data warehouses. Unfortunately, current methods of transferring
data are inefficient and ad hoc, often consisting of manual steps specific
to the external system. These steps are necessary to help provision this
data for consumption by Map-Reduce jobs, or by systems that build on top of
Hadoop such as Hive and Pig. The transfer of this data can take substantial
amount of time depending upon its size. An optimal transfer approach that
works well with one particular datastore will typically not work as
optimally with another datastore due to inherent architectural differences
between different datastore implementations. Sqoop addresses this problem by
providing connectivity of Hadoop with external systems via pluggable
connectors. Specialized connectors are developed for optimal performance for
data transfer between Hadoop and target systems.

Analyzed and processed data from Hadoop and related systems may also require
to be provisioned outside of Hadoop for consumption by business
applications. Sqoop allows the export of data from Hadoop to external
systems to facilitate its use in other systems. This too, like the import
scenario, is implemented via specialized connectors that are built for the
purposes of optimal integration between Hadoop and external systems.

Connectors can be built for systems that Sqoop does not yet integrate with
and thus can be easily incorporated into Sqoop. Connectors allow Sqoop to
interface with external systems of different types, ensuring that newer
systems can integrate with Hadoop with relative ease and in a consistent
manner.

Besides allowing integration with other external systems, Sqoop provides
tight integration with systems that build on to of Hadoop such as Hive,
HBase etc - thus providing data integration between Hadoop based systems and
external systems in a single step manner.

== Initial Goals ==

Sqoop is currently in its first major release with a considerable number of
enhancement requests, tasks, and issues logged towards its future
development. The initial goal of this project will be to address the highly
requested features and bug-fixes towards its next dot release. The key
features of interest are the following:
 * Support for bulk import into Apache HBase.
 * Allow user to supply password in permission protected file.
 * Support for pluggable query to help Sqoop identify the metadata
associated with the source or target table definitions.
 * Allow user to specify custom split semantics for efficient
parallelization of import jobs.

= Current Stat

Re: [PROPOSAL] Sqoop Project

2011-06-01 Thread arv...@cloudera.com
Hi Mohammad,

Thanks for your offer to step in as a mentor. I have added you to the list
of nominated mentors on the proposal.

Thanks and Regards,
Arvind Prabhakar

On Wed, Jun 1, 2011 at 7:19 AM, Mohammad Nour El-Din <
nour.moham...@gmail.com> wrote:

> +1 on the proposal
>
> Sound like a very good and project. Also I am interested to be a
> mentor if you would like to have another one and if thats possible :).
>
> On Tue, May 31, 2011 at 3:13 PM, Phillip Rhodes
>  wrote:
> > On Fri, May 27, 2011 at 2:40 PM, arv...@cloudera.com <
> arv...@cloudera.com>wrote:
> >
> >> Greetings All,
> >>
> >> We would like to propose Sqoop Project for inclusion in ASF Incubator as
> a
> >> new podling. Sqoop is a tool designed for efficiently transferring bulk
> >> data
> >> between Apache Hadoop and structured datastores such as relational
> >> databases. The complete proposal can be found at:
> >>
> >> http://wiki.apache.org/incubator/SqoopProposal
> >>
> >>
> > +1
> >
>
>
>
> --
> Thanks
> - Mohammad Nour
>   Author of (WebSphere Application Server Community Edition 2.0 User Guide)
>   http://www.redbooks.ibm.com/abstracts/sg247585.html
> - LinkedIn: http://www.linkedin.com/in/mnour
> - Blog: http://tadabborat.blogspot.com
> 
> "Life is like riding a bicycle. To keep your balance you must keep moving"
> - Albert Einstein
>
> "Writing clean code is what you must do in order to call yourself a
> professional. There is no reasonable excuse for doing anything less
> than your best."
> - Clean Code: A Handbook of Agile Software Craftsmanship
>
> "Stay hungry, stay foolish."
> - Steve Jobs
>
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>


Re: [PROPOSAL] Sqoop Project

2011-06-03 Thread arv...@cloudera.com
Hi All,

It has been a week since the proposal was submitted. Responses to the
proposal so far have been very encouraging. Thank you all for your support.
We would like to include more mentors on the proposal, so would greatly
appreciate if you could volunteer as one.

Unless there is any active discussion regarding this proposal in the early
part of next week, I would like to call this to vote on Tuesday, June 14th.

Thanks and Regards,
Arvind Prabhakar



On Wed, Jun 1, 2011 at 9:49 AM, Mohammad Nour El-Din <
nour.moham...@gmail.com> wrote:

> Thanks a lot Arvind
>
> On Wed, Jun 1, 2011 at 6:43 PM, arv...@cloudera.com 
> wrote:
> > Hi Mohammad,
> >
> > Thanks for your offer to step in as a mentor. I have added you to the
> list
> > of nominated mentors on the proposal.
> >
> > Thanks and Regards,
> > Arvind Prabhakar
> >
> > On Wed, Jun 1, 2011 at 7:19 AM, Mohammad Nour El-Din <
> > nour.moham...@gmail.com> wrote:
> >
> >> +1 on the proposal
> >>
> >> Sound like a very good and project. Also I am interested to be a
> >> mentor if you would like to have another one and if thats possible :).
> >>
> >> On Tue, May 31, 2011 at 3:13 PM, Phillip Rhodes
> >>  wrote:
> >> > On Fri, May 27, 2011 at 2:40 PM, arv...@cloudera.com <
> >> arv...@cloudera.com>wrote:
> >> >
> >> >> Greetings All,
> >> >>
> >> >> We would like to propose Sqoop Project for inclusion in ASF Incubator
> as
> >> a
> >> >> new podling. Sqoop is a tool designed for efficiently transferring
> bulk
> >> >> data
> >> >> between Apache Hadoop and structured datastores such as relational
> >> >> databases. The complete proposal can be found at:
> >> >>
> >> >> http://wiki.apache.org/incubator/SqoopProposal
> >> >>
> >> >>
> >> > +1
> >> >
> >>
> >>
> >>
> >> --
> >> Thanks
> >> - Mohammad Nour
> >>   Author of (WebSphere Application Server Community Edition 2.0 User
> Guide)
> >>   http://www.redbooks.ibm.com/abstracts/sg247585.html
> >> - LinkedIn: http://www.linkedin.com/in/mnour
> >> - Blog: http://tadabborat.blogspot.com
> >> 
> >> "Life is like riding a bicycle. To keep your balance you must keep
> moving"
> >> - Albert Einstein
> >>
> >> "Writing clean code is what you must do in order to call yourself a
> >> professional. There is no reasonable excuse for doing anything less
> >> than your best."
> >> - Clean Code: A Handbook of Agile Software Craftsmanship
> >>
> >> "Stay hungry, stay foolish."
> >> - Steve Jobs
> >>
> >> -
> >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> >> For additional commands, e-mail: general-h...@incubator.apache.org
> >>
> >>
> >
>
>
>
> --
> Thanks
> - Mohammad Nour
>   Author of (WebSphere Application Server Community Edition 2.0 User Guide)
>   http://www.redbooks.ibm.com/abstracts/sg247585.html
> - LinkedIn: http://www.linkedin.com/in/mnour
> - Blog: http://tadabborat.blogspot.com
> 
> "Life is like riding a bicycle. To keep your balance you must keep moving"
> - Albert Einstein
>
> "Writing clean code is what you must do in order to call yourself a
> professional. There is no reasonable excuse for doing anything less
> than your best."
> - Clean Code: A Handbook of Agile Software Craftsmanship
>
> "Stay hungry, stay foolish."
> - Steve Jobs
>
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>


Re: [PROPOSAL] Sqoop Project

2011-06-06 Thread arv...@cloudera.com
Hi Olivier,

Thank you for volunteering to be a mentor for the project. I have added you
to the list of nominated mentors on the proposal.

Thanks and Regards,
Arvind Prabhakar

On Sun, Jun 5, 2011 at 6:30 AM, Olivier Lamy  wrote:

> Hello,
>
> If you need more mentor, I can help.
>
> --
> Olivier Lamy
> http://twitter.com/olamy | http://www.linkedin.com/in/olamy
>
> 2011/6/4 arv...@cloudera.com :
> > Hi All,
> >
> > It has been a week since the proposal was submitted. Responses to the
> > proposal so far have been very encouraging. Thank you all for your
> support.
> > We would like to include more mentors on the proposal, so would greatly
> > appreciate if you could volunteer as one.
> >
> > Unless there is any active discussion regarding this proposal in the
> early
> > part of next week, I would like to call this to vote on Tuesday, June
> 14th.
> >
> > Thanks and Regards,
> > Arvind Prabhakar
> >
> >
> >
> > On Wed, Jun 1, 2011 at 9:49 AM, Mohammad Nour El-Din <
> > nour.moham...@gmail.com> wrote:
> >
> >> Thanks a lot Arvind
> >>
> >> On Wed, Jun 1, 2011 at 6:43 PM, arv...@cloudera.com <
> arv...@cloudera.com>
> >> wrote:
> >> > Hi Mohammad,
> >> >
> >> > Thanks for your offer to step in as a mentor. I have added you to the
> >> list
> >> > of nominated mentors on the proposal.
> >> >
> >> > Thanks and Regards,
> >> > Arvind Prabhakar
> >> >
> >> > On Wed, Jun 1, 2011 at 7:19 AM, Mohammad Nour El-Din <
> >> > nour.moham...@gmail.com> wrote:
> >> >
> >> >> +1 on the proposal
> >> >>
> >> >> Sound like a very good and project. Also I am interested to be a
> >> >> mentor if you would like to have another one and if thats possible
> :).
> >> >>
> >> >> On Tue, May 31, 2011 at 3:13 PM, Phillip Rhodes
> >> >>  wrote:
> >> >> > On Fri, May 27, 2011 at 2:40 PM, arv...@cloudera.com <
> >> >> arv...@cloudera.com>wrote:
> >> >> >
> >> >> >> Greetings All,
> >> >> >>
> >> >> >> We would like to propose Sqoop Project for inclusion in ASF
> Incubator
> >> as
> >> >> a
> >> >> >> new podling. Sqoop is a tool designed for efficiently transferring
> >> bulk
> >> >> >> data
> >> >> >> between Apache Hadoop and structured datastores such as relational
> >> >> >> databases. The complete proposal can be found at:
> >> >> >>
> >> >> >> http://wiki.apache.org/incubator/SqoopProposal
> >> >> >>
> >> >> >>
> >> >> > +1
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Thanks
> >> >> - Mohammad Nour
> >> >>   Author of (WebSphere Application Server Community Edition 2.0 User
> >> Guide)
> >> >>   http://www.redbooks.ibm.com/abstracts/sg247585.html
> >> >> - LinkedIn: http://www.linkedin.com/in/mnour
> >> >> - Blog: http://tadabborat.blogspot.com
> >> >> 
> >> >> "Life is like riding a bicycle. To keep your balance you must keep
> >> moving"
> >> >> - Albert Einstein
> >> >>
> >> >> "Writing clean code is what you must do in order to call yourself a
> >> >> professional. There is no reasonable excuse for doing anything less
> >> >> than your best."
> >> >> - Clean Code: A Handbook of Agile Software Craftsmanship
> >> >>
> >> >> "Stay hungry, stay foolish."
> >> >> - Steve Jobs
> >> >>
> >> >> -
> >> >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> >> >> For additional commands, e-mail: general-h...@incubator.apache.org
> >> >>
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >> Thanks
> >> - Mohammad Nour
> >>   Author of (WebSphere Application Server Community Edition 2.0 User
> Guide)
> >>   http://www.redbooks.ibm.com/abstracts/sg247585.html
> >> - LinkedIn: http://www.linkedin.com/in/mnour
> >> - Blog: http://tadabborat.blogspot.com
> >> 
> >> "Life is like riding a bicycle. To keep your balance you must keep
> moving"
> >> - Albert Einstein
> >>
> >> "Writing clean code is what you must do in order to call yourself a
> >> professional. There is no reasonable excuse for doing anything less
> >> than your best."
> >> - Clean Code: A Handbook of Agile Software Craftsmanship
> >>
> >> "Stay hungry, stay foolish."
> >> - Steve Jobs
> >>
> >> -
> >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> >> For additional commands, e-mail: general-h...@incubator.apache.org
> >>
> >>
> >
>
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>


Re: [PROPOSAL] Sqoop Project

2011-06-06 Thread arv...@cloudera.com
Hello All,

I had a typo in my previous mail where I stated "Unless there is any active
discussion regarding this proposal in the early part of next week, I would
like to call this to vote on Tuesday, June 14th".

The date mentioned here is off by a week and was not the intention. Instead
I would be calling a vote on this proposal tomorrow - June 7th instead.

My apologies for any confusion this has caused.

Thanks and Regards,
Arvind Prabhakar

On Mon, Jun 6, 2011 at 9:29 AM, arv...@cloudera.com wrote:

> Hi Olivier,
>
> Thank you for volunteering to be a mentor for the project. I have added you
> to the list of nominated mentors on the proposal.
>
> Thanks and Regards,
> Arvind Prabhakar
>
>
> On Sun, Jun 5, 2011 at 6:30 AM, Olivier Lamy  wrote:
>
>> Hello,
>>
>> If you need more mentor, I can help.
>>
>> --
>> Olivier Lamy
>> http://twitter.com/olamy | http://www.linkedin.com/in/olamy
>>
>> 2011/6/4 arv...@cloudera.com :
>> > Hi All,
>> >
>> > It has been a week since the proposal was submitted. Responses to the
>> > proposal so far have been very encouraging. Thank you all for your
>> support.
>> > We would like to include more mentors on the proposal, so would greatly
>> > appreciate if you could volunteer as one.
>> >
>> > Unless there is any active discussion regarding this proposal in the
>> early
>> > part of next week, I would like to call this to vote on Tuesday, June
>> 14th.
>> >
>> > Thanks and Regards,
>> > Arvind Prabhakar
>> >
>> >
>> >
>> > On Wed, Jun 1, 2011 at 9:49 AM, Mohammad Nour El-Din <
>> > nour.moham...@gmail.com> wrote:
>> >
>> >> Thanks a lot Arvind
>> >>
>> >> On Wed, Jun 1, 2011 at 6:43 PM, arv...@cloudera.com <
>> arv...@cloudera.com>
>> >> wrote:
>> >> > Hi Mohammad,
>> >> >
>> >> > Thanks for your offer to step in as a mentor. I have added you to the
>> >> list
>> >> > of nominated mentors on the proposal.
>> >> >
>> >> > Thanks and Regards,
>> >> > Arvind Prabhakar
>> >> >
>> >> > On Wed, Jun 1, 2011 at 7:19 AM, Mohammad Nour El-Din <
>> >> > nour.moham...@gmail.com> wrote:
>> >> >
>> >> >> +1 on the proposal
>> >> >>
>> >> >> Sound like a very good and project. Also I am interested to be a
>> >> >> mentor if you would like to have another one and if thats possible
>> :).
>> >> >>
>> >> >> On Tue, May 31, 2011 at 3:13 PM, Phillip Rhodes
>> >> >>  wrote:
>> >> >> > On Fri, May 27, 2011 at 2:40 PM, arv...@cloudera.com <
>> >> >> arv...@cloudera.com>wrote:
>> >> >> >
>> >> >> >> Greetings All,
>> >> >> >>
>> >> >> >> We would like to propose Sqoop Project for inclusion in ASF
>> Incubator
>> >> as
>> >> >> a
>> >> >> >> new podling. Sqoop is a tool designed for efficiently
>> transferring
>> >> bulk
>> >> >> >> data
>> >> >> >> between Apache Hadoop and structured datastores such as
>> relational
>> >> >> >> databases. The complete proposal can be found at:
>> >> >> >>
>> >> >> >> http://wiki.apache.org/incubator/SqoopProposal
>> >> >> >>
>> >> >> >>
>> >> >> > +1
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Thanks
>> >> >> - Mohammad Nour
>> >> >>   Author of (WebSphere Application Server Community Edition 2.0 User
>> >> Guide)
>> >> >>   http://www.redbooks.ibm.com/abstracts/sg247585.html
>> >> >> - LinkedIn: http://www.linkedin.com/in/mnour
>> >> >> - Blog: http://tadabborat.blogspot.com
>> >> >> 
>> >> >> "Life is like riding a bicycle. To keep your balance you must keep
>> >> moving"
>> >> >> - Albert Einstein
>> >> >>
>> >> >> "Writing clean code is what you must do in order to call yourself a
>> >> >> professional. There is no 

[VOTE] Accept Sqoop for Incubation

2011-06-07 Thread arv...@cloudera.com
As there are no active discussions on the [PROPOSAL] thread for a few
days now, I will like to initiate the vote to accept Sqoop as an
Apache Incubator project. The proposal discussion thread and full text
of the proposal can be found at the following locations:

Discussion Thread:
http://www.mail-archive.com/general@incubator.apache.org/msg27726.html
Proposal: http://wiki.apache.org/incubator/SqoopProposal

Please cast your votes:

[  ] +1 Accept Sqoop for incubation
[  ] +0 Indifferent to Sqoop incubation
[  ]  -1 Reject Sqoop for incubation

This vote will close 72 hours from now.

Thanks and Regards,
Arvind Prabhakar

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [VOTE] Accept Sqoop for Incubation

2011-06-08 Thread arv...@cloudera.com
Welcome Paul! We are delighted to have you on the team!

Thanks,
Arvind Prabhakar

On Wed, Jun 8, 2011 at 8:01 PM, Zimdars, Paul A (3880-Affiliate)
 wrote:
> Hey Guys,
>
> Thanks this project looks awesome. I know you've already started VOTE'ing but 
> I'm super interested to join up. We've used Sqoop in a couple of my projects 
> here at NASA and I am interested in contributing. I've added myself to the 
> wiki as a committer and look forward to working with you guys (if you are OK 
> with it).
>
> Paul Z
> On Jun 7, 2011, at 8:39 PM, arv...@cloudera.com wrote:
>
>> As there are no active discussions on the [PROPOSAL] thread for a few
>> days now, I will like to initiate the vote to accept Sqoop as an
>> Apache Incubator project. The proposal discussion thread and full text
>> of the proposal can be found at the following locations:
>>
>> Discussion Thread:
>> http://www.mail-archive.com/general@incubator.apache.org/msg27726.html
>> Proposal: http://wiki.apache.org/incubator/SqoopProposal
>>
>> Please cast your votes:
>>
>> [  ] +1 Accept Sqoop for incubation
>> [  ] +0 Indifferent to Sqoop incubation
>> [  ]  -1 Reject Sqoop for incubation
>>
>> This vote will close 72 hours from now.
>>
>> Thanks and Regards,
>> Arvind Prabhakar
>>
>> -
>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>> For additional commands, e-mail: general-h...@incubator.apache.org
>>
>
>
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [VOTE] Accept Sqoop for Incubation

2011-06-11 Thread arv...@cloudera.com
This VOTE is now closed. I will be sending out the results in a
separate mail soon.

Thanks and Regards,
Arvind Prabhakar

On Thu, Jun 9, 2011 at 2:37 AM, Michael McCandless
 wrote:
> +1
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Tue, Jun 7, 2011 at 11:39 PM, arv...@cloudera.com
>  wrote:
>> As there are no active discussions on the [PROPOSAL] thread for a few
>> days now, I will like to initiate the vote to accept Sqoop as an
>> Apache Incubator project. The proposal discussion thread and full text
>> of the proposal can be found at the following locations:
>>
>> Discussion Thread:
>> http://www.mail-archive.com/general@incubator.apache.org/msg27726.html
>> Proposal: http://wiki.apache.org/incubator/SqoopProposal
>>
>> Please cast your votes:
>>
>> [  ] +1 Accept Sqoop for incubation
>> [  ] +0 Indifferent to Sqoop incubation
>> [  ]  -1 Reject Sqoop for incubation
>>
>> This vote will close 72 hours from now.
>>
>> Thanks and Regards,
>> Arvind Prabhakar
>>
>> -
>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>> For additional commands, e-mail: general-h...@incubator.apache.org
>>
>>
>
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



[VOTE] [RESULT] Accept Sqoop for Incubation

2011-06-11 Thread arv...@cloudera.com
With 19 +1 votes (11 binding), no -1 votes, and no 0 votes, the vote passes.

Binding votes

  Chris Mattmann
  Sanjiva Weerawarana
  Ralph Goers
  Julien Vermillard
  Mark Struberg
  Tommaso Teofili
  Leo Simons
  Christian Grobmeier
  Niall Pemberton
  Patrick Hunt
  Tom White

Non-binding votes

  Ioannis Canellos
  Nigel Daley
  Edward J. Yoon
  Olivier Lamy
  Steve Loughran
  Phillip Rhodes
  Eric Sammer
  Michael McCandless


The binding votes were counted based on the Incubator PMC membership
list located at:
http://people.apache.org/committers-by-project.html#incubator-pmc

Thanks everyone who voted.

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [VOTE] Accept Bigtop for incubation

2011-06-17 Thread arv...@cloudera.com
+1 (non-binding)

On Fri, Jun 17, 2011 at 10:15 AM, Tom White  wrote:
> As there are no active discussions on the proposal thread, I would
> like to initiate a vote to accept Bigtop as an Apache Incubator
> project.
>
> The proposal is available at
>
> http://wiki.apache.org/incubator/BigtopProposal?action=recall&rev=13
>
> I've also put a copy of the proposal at the end of this email.
>
> The discussion thread is available at
>
> http://mail-archives.apache.org/mod_mbox/incubator-general/201106.mbox/%3cbanlktimriyvs5g5maklqvinauz9h6s5...@mail.gmail.com%3E
>
> Please cast your votes:
>
> [  ] +1 Accept Bigtop for incubation
> [  ] +0 Indifferent to Bigtop incubation
> [  ] -1 Reject Bigtop for incubation
>
> This vote will close 72 hours from now.
>
> Thanks,
> Tom
>
> = Bigtop - Apache Hadoop Ecosystem Packaging and Test =
>
> == Abstract ==
>
> Bigtop - a project for the development of packaging and tests of the
> Hadoop ecosystem.
>
> == Proposal ==
>
> The primary goal of Bigtop is to build a community around the
> packaging and interoperability testing of Hadoop-related projects.
> This includes testing at various levels (packaging, platform, runtime,
> upgrade, etc...) developed by a community with a focus on the system
> as a whole, rather than individual projects.
>
> Build, packaging and integration test code that depends upon official
> releases of the Apache Hadoop-related projects (HDFS, MapReduce,
> HBase, Hive, Pig, ZooKeeper, etc...) will be developed and released by
> this project. As bugs and other issues are found we expect these to be
> fixed upstream.
>
> == Background ==
>
> The initial packaging and test code for Bigtop was developed by
> Cloudera to package projects from the Apache Hadoop ecosystem and
> provide a consistent, inter-operable framework.
>
> == Rationale ==
>
> Hadoop defines itself as:
>
> {{{
> The Apache Hadoop project develops open-source software for reliable,
> scalable, distributed computing. Hadoop includes these subprojects:
>
> * Hadoop Common: The common utilities that support the other Hadoop 
> subprojects.
> * HDFS: A distributed file system that provides high throughput access
> to application data.
> * MapReduce: A software framework for distributed processing of large
> data sets on compute clusters.
> }}}
>
> There are also several other Hadoop-related projects at Apache.  Some
> TLP examples include HBase, Hive, Mahout, ZooKeeper, and Pig.  There
> are also several new projects in the Incubator such as HCatalog, Hama
> and Sqoop.
>
> From a packaging and deployment perspective, the current
> loosely-coupled nature of the project has limitations:
>  1. Insufficient building against trunk versions of dependent projects
> (in the style of Apache Gump).
>  1. Insufficient testing against the trunk versions of dependent projects.
>  1. No consistent packaging for the Linux servers which provide the
> main Hadoop datacenter platform.
>  1. No functional testing against multi-machine clusters as part of
> the regular automated build process. This is due to a lack of a
> physical or virtual Hadoop cluster for testing, and not enough test
> suites designed to run against a live cluster with known datasets.
>
> The intent of this project is to build a community where the projects
> are brought together, packaged, and tested for interoperability.
>
> Projects such as Apache Whirr (incubating), which deploy and use a
> collection of Hadoop-related projects, would benefit from the
> interoperability testing done by Bigtop, rather than picking and
> testing project combinations themselves.
>
> == Initial Goals ==
>
> Much of the code for Bigtop has been released by Cloudera under the
> Apache 2.0 license for over two years.
>
> Some current goals include:
>  * create a set of packages for the Hadoop ecosystem, over a wide
> range of platforms
>  * interoperability test these projects
>  * document project sets that are known to work well together
>
> Bigtop’s release artifact would consist of a single tarball of
> packaging and test code that, when built, would produce source and
> binary Linux packages for the upstream projects.
>
> = Current Status =
>
> == Meritocracy ==
>
> Bigtop was originally developed and released as an open source
> packaging infrastructure, CDH, by Cloudera.
>
> == Community ==
>
> The community is primarily the original developers at Cloudera,
> however a number of contributions to the packaging specifications have
> been accepted from outside contributors. Growing a diverse community
> is the main reason to bring Bigtop to the Apache Incubator.
>
> == Core Developers ==
>
> The core developers for Bigtop project are:
>  * Andrew Bayer has extensive expertise with build tools, specifically
> Jenkins continuous integration and Maven.
>  * Peter Linnell has contributed to the RPM packaging.
>  * Bruno Mahé has overseen much of the development of the RPM and
> Debian packaging system.
>  * Roman Shaposhnik and Konstantin Boudnik designed an

Re: [PROPOSAL] Oozie for the Apache Incubator

2011-06-28 Thread arv...@cloudera.com
+1 (non-binding).

Thanks,
Arvind Prabhakar

On Fri, Jun 24, 2011 at 12:46 PM, Mohammad Islam  wrote:
> Hi,
>
> I would like to propose Oozie to be an Apache Incubator project.
> Oozie is a server-based workflow scheduling and coordination system to manage
> data processing jobs for Apache Hadoop.
>
>
> Here's a link to the proposal in the Incubator wiki
> http://wiki.apache.org/incubator/OozieProposal
>
>
> I've also pasted the initial contents below.
>
> Regards,
>
> Mohammad Islam
>
>
> Start of Oozie Proposal
>
> Abstract
> Oozie is a server-based workflow scheduling and coordination system to manage
> data processing jobs for Apache HadoopTM.
>
> Proposal
> Oozie is an  extensible, scalable and reliable system to define, manage,
> schedule,  and execute complex Hadoop workloads via web services. More
> specifically, this includes:
>
>        * XML-based declarative framework to specify a job or a complex 
> workflow of
> dependent jobs.
>
>        * Support different types of job such as Hadoop Map-Reduce, Pipe, 
> Streaming,
> Pig, Hive and custom java applications.
>
>        * Workflow scheduling based on frequency and/or data availability.
>        * Monitoring capability, automatic retry and failure handing of jobs.
>        * Extensible and pluggable architecture to allow arbitrary grid 
> programming
> paradigms.
>
>        * Authentication, authorization, and capacity-aware load throttling to 
> allow
> multi-tenant software as a service.
>
> Background
> Most data  processing applications require multiple jobs to achieve their 
> goals,
> with inherent dependencies among the jobs. A dependency could be  sequential,
> where one job can only start after another job has finished.  Or it could be
> conditional, where the execution of a job depends on the  return value or 
> status
> of another job. In other cases, parallel  execution of multiple jobs may be
> permitted – or desired – to exploit  the massive pool of compute nodes 
> provided
> by Hadoop.
>
> These  job dependencies are often expressed as a Directed Acyclic Graph, also
> called a workflow. A node in the workflow is typically a job (a  computation 
> on
> the grid) or another type of action such as an eMail  notification. 
> Computations
> can be expressed in map/reduce, Pig, Hive or  any other programming paradigm
> available on the grid. Edges of the graph  represent transitions from one node
> to the next, as the execution of a  workflow proceeds.
>
> Describing  a workflow in a declarative way has the advantage of decoupling 
> job
> dependencies and execution control from application logic. Furthermore,  the
> workflow is modularized into jobs that can be reused within the same  workflow
> or across different workflows. Execution of the workflow is  then driven by a
> runtime system without understanding the application  logic of the jobs. This
> runtime system specializes in reliable and  predictable execution: It can 
> retry
> actions that have failed or invoke a  cleanup action after termination of the
> workflow; it can monitor  progress, success, or failure of a workflow, and 
> send
> appropriate alerts  to an administrator. The application developer is relieved
> from  implementing these generic procedures.
>
> Furthermore,  some applications or workflows need to run in periodic intervals
> or  when dependent data is available. For example, a workflow could be  
> executed
> every day as soon as output data from the previous 24 instances  of another,
> hourly workflow is available. The workflow coordinator  provides such 
> scheduling
> features, along with prioritization, load  balancing and throttling to 
> optimize
> utilization of resources in the  cluster. This makes it easier to maintain,
> control, and coordinate  complex data applications.
>
> Nearly  three years ago, a team of Yahoo! developers addressed these critical
> requirements for Hadoop-based data processing systems by developing a  new
> workflow management and scheduling system called Oozie. While it was  
> initially
> developed as a Yahoo!-internal project, it was designed and  implemented with
> the intention of open-sourcing. Oozie was released as a GitHub project in 
> early
> 2010. Oozie is used in production within Yahoo and  since it has been
> open-sourced it has been gaining adoption with  external developers
>
> Rationale
> Commonly,  applications that run on Hadoop require multiple Hadoop jobs in 
> order
> to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  a
> combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
> map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and 
> shell
> scripts.
>
> Because  of this, developers find themselves writing ad-hoc glue programs to
> combine these Hadoop jobs. These ad-hoc programs are difficult to  schedule,
> manage, monitor and recover.
>
> Workflow  management and scheduling is an essential feature for large-scale 
> data
> processing app

Re: [PROPOSAL] Oozie for the Apache Incubator

2011-06-28 Thread arv...@cloudera.com
+1 (non-binding).

Thanks,
Arvind

On Fri, Jun 24, 2011 at 12:46 PM, Mohammad Islam  wrote:
> Hi,
>
> I would like to propose Oozie to be an Apache Incubator project.
> Oozie is a server-based workflow scheduling and coordination system to manage
> data processing jobs for Apache Hadoop.
>
>
> Here's a link to the proposal in the Incubator wiki
> http://wiki.apache.org/incubator/OozieProposal
>
>
> I've also pasted the initial contents below.
>
> Regards,
>
> Mohammad Islam
>
>
> Start of Oozie Proposal
>
> Abstract
> Oozie is a server-based workflow scheduling and coordination system to manage
> data processing jobs for Apache HadoopTM.
>
> Proposal
> Oozie is an  extensible, scalable and reliable system to define, manage,
> schedule,  and execute complex Hadoop workloads via web services. More
> specifically, this includes:
>
>        * XML-based declarative framework to specify a job or a complex 
> workflow of
> dependent jobs.
>
>        * Support different types of job such as Hadoop Map-Reduce, Pipe, 
> Streaming,
> Pig, Hive and custom java applications.
>
>        * Workflow scheduling based on frequency and/or data availability.
>        * Monitoring capability, automatic retry and failure handing of jobs.
>        * Extensible and pluggable architecture to allow arbitrary grid 
> programming
> paradigms.
>
>        * Authentication, authorization, and capacity-aware load throttling to 
> allow
> multi-tenant software as a service.
>
> Background
> Most data  processing applications require multiple jobs to achieve their 
> goals,
> with inherent dependencies among the jobs. A dependency could be  sequential,
> where one job can only start after another job has finished.  Or it could be
> conditional, where the execution of a job depends on the  return value or 
> status
> of another job. In other cases, parallel  execution of multiple jobs may be
> permitted – or desired – to exploit  the massive pool of compute nodes 
> provided
> by Hadoop.
>
> These  job dependencies are often expressed as a Directed Acyclic Graph, also
> called a workflow. A node in the workflow is typically a job (a  computation 
> on
> the grid) or another type of action such as an eMail  notification. 
> Computations
> can be expressed in map/reduce, Pig, Hive or  any other programming paradigm
> available on the grid. Edges of the graph  represent transitions from one node
> to the next, as the execution of a  workflow proceeds.
>
> Describing  a workflow in a declarative way has the advantage of decoupling 
> job
> dependencies and execution control from application logic. Furthermore,  the
> workflow is modularized into jobs that can be reused within the same  workflow
> or across different workflows. Execution of the workflow is  then driven by a
> runtime system without understanding the application  logic of the jobs. This
> runtime system specializes in reliable and  predictable execution: It can 
> retry
> actions that have failed or invoke a  cleanup action after termination of the
> workflow; it can monitor  progress, success, or failure of a workflow, and 
> send
> appropriate alerts  to an administrator. The application developer is relieved
> from  implementing these generic procedures.
>
> Furthermore,  some applications or workflows need to run in periodic intervals
> or  when dependent data is available. For example, a workflow could be  
> executed
> every day as soon as output data from the previous 24 instances  of another,
> hourly workflow is available. The workflow coordinator  provides such 
> scheduling
> features, along with prioritization, load  balancing and throttling to 
> optimize
> utilization of resources in the  cluster. This makes it easier to maintain,
> control, and coordinate  complex data applications.
>
> Nearly  three years ago, a team of Yahoo! developers addressed these critical
> requirements for Hadoop-based data processing systems by developing a  new
> workflow management and scheduling system called Oozie. While it was  
> initially
> developed as a Yahoo!-internal project, it was designed and  implemented with
> the intention of open-sourcing. Oozie was released as a GitHub project in 
> early
> 2010. Oozie is used in production within Yahoo and  since it has been
> open-sourced it has been gaining adoption with  external developers
>
> Rationale
> Commonly,  applications that run on Hadoop require multiple Hadoop jobs in 
> order
> to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  a
> combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
> map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and 
> shell
> scripts.
>
> Because  of this, developers find themselves writing ad-hoc glue programs to
> combine these Hadoop jobs. These ad-hoc programs are difficult to  schedule,
> manage, monitor and recover.
>
> Workflow  management and scheduling is an essential feature for large-scale 
> data
> processing applications.