Re: Update on hadoop-0.23

2011-09-26 Thread Roman Shaposhnik
Hi Arun!

Great news! Hopefuly you wouldn't mind answering some of the questions below...

On Mon, Sep 26, 2011 at 2:07 PM, Arun C Murthy  wrote:
> NextGen MapReduce (aka MRv2, aka YARN) is coming along great:
> # We are happy to report we've done extensive scale testing to confirm 
> stability
>  - Sort/GridMixv3 etc. at ~350nodes
>  - Scale testing with simulated clusters of ~1500 nodes
> # Functional tests for all of MapReduce functionality
> # Pig  (0.9 & 0.9.1) working with NextGen MapReduce

Is there a *released* version of Pig that compiles cleanly against .23
snapshots?
Same question for Hive.

> We are about to finish performance certification for both HDFS & MapReduce in 
> the next
> couple of weeks too, after which we start integration tests with HBase, Hive, 
> Oozie etc.

I'm curious -- what are these integrations tests? Can I take a look at
them? I would
be really nice if we can levarage those via Bigtop infrastructure. Currently we
have a certain # of integration tests in Bigtop that we're running
against a fully
deployed stack, but it would be quite nice to have extra coverage.

> Given where we are I'm confident we can have a strong hadoop-0.23.0 release
> by late October. The current plan is to deploy to alpha clusters in November. 
> Citius, Altius, Fortius! :)

Could you, please, elaborate on what will be part of that deployment?
Which versions
of Pig, Hive, HBase, Oozie and Mahout are you targeting?

Thanks,
Roman.


Re: Update on hadoop-0.23

2011-09-26 Thread Arun C Murthy
Roman, 

In general, we'll need to make changes upstream: 
# I believe someone got HBase working. 
# We made changes to Pig - rather we got help from the Pig team, particularly 
Daniel.

So, we plan to work through the rest of the stack - Hive, Oozie etc. very soon 
and we'll depend on updated releases from the individual projects.

Arun

On Sep 26, 2011, at 3:15 PM, Roman Shaposhnik wrote:

> Hi Arun!
> 
> Great news! Hopefuly you wouldn't mind answering some of the questions 
> below...
> 
> On Mon, Sep 26, 2011 at 2:07 PM, Arun C Murthy  wrote:
>> NextGen MapReduce (aka MRv2, aka YARN) is coming along great:
>> # We are happy to report we've done extensive scale testing to confirm 
>> stability
>>  - Sort/GridMixv3 etc. at ~350nodes
>>  - Scale testing with simulated clusters of ~1500 nodes
>> # Functional tests for all of MapReduce functionality
>> # Pig  (0.9 & 0.9.1) working with NextGen MapReduce
> 
> Is there a *released* version of Pig that compiles cleanly against .23
> snapshots?
> Same question for Hive.
> 
>> We are about to finish performance certification for both HDFS & MapReduce 
>> in the next
>> couple of weeks too, after which we start integration tests with HBase, 
>> Hive, Oozie etc.
> 
> I'm curious -- what are these integrations tests? Can I take a look at
> them? I would
> be really nice if we can levarage those via Bigtop infrastructure. Currently 
> we
> have a certain # of integration tests in Bigtop that we're running
> against a fully
> deployed stack, but it would be quite nice to have extra coverage.
> 
>> Given where we are I'm confident we can have a strong hadoop-0.23.0 release
>> by late October. The current plan is to deploy to alpha clusters in 
>> November. Citius, Altius, Fortius! :)
> 
> Could you, please, elaborate on what will be part of that deployment?
> Which versions
> of Pig, Hive, HBase, Oozie and Mahout are you targeting?
> 
> Thanks,
> Roman.



Re: Update on hadoop-0.23

2011-09-26 Thread Arun C Murthy

On Sep 26, 2011, at 11:20 PM, Arun C Murthy wrote:

> Roman, 
> 
> In general, we'll need to make changes upstream: 
> # I believe someone got HBase working. 
> # We made changes to Pig - rather we got help from the Pig team, particularly 
> Daniel.
> 
> So, we plan to work through the rest of the stack - Hive, Oozie etc. very 
> soon and we'll depend on updated releases from the individual projects.
> 

To clarify, the changes to Pig were mainly due to it's usage of the Context 
Objects apis which have had changes in hadoop-0.21/hadoop-0.22.

Also, we expect some pieces of the stack to change if they rely on 
undocumented/hidden features in MR.

We are absolutely committed to ensuring end-user MR applications have full 
compatibility - to this end we have, long since, marked the old apis as stable 
& supported i.e. un-deprecated them.

Arun

> Arun
> 
> On Sep 26, 2011, at 3:15 PM, Roman Shaposhnik wrote:
> 
>> Hi Arun!
>> 
>> Great news! Hopefuly you wouldn't mind answering some of the questions 
>> below...
>> 
>> On Mon, Sep 26, 2011 at 2:07 PM, Arun C Murthy  wrote:
>>> NextGen MapReduce (aka MRv2, aka YARN) is coming along great:
>>> # We are happy to report we've done extensive scale testing to confirm 
>>> stability
>>> - Sort/GridMixv3 etc. at ~350nodes
>>> - Scale testing with simulated clusters of ~1500 nodes
>>> # Functional tests for all of MapReduce functionality
>>> # Pig  (0.9 & 0.9.1) working with NextGen MapReduce
>> 
>> Is there a *released* version of Pig that compiles cleanly against .23
>> snapshots?
>> Same question for Hive.
>> 
>>> We are about to finish performance certification for both HDFS & MapReduce 
>>> in the next
>>> couple of weeks too, after which we start integration tests with HBase, 
>>> Hive, Oozie etc.
>> 
>> I'm curious -- what are these integrations tests? Can I take a look at
>> them? I would
>> be really nice if we can levarage those via Bigtop infrastructure. Currently 
>> we
>> have a certain # of integration tests in Bigtop that we're running
>> against a fully
>> deployed stack, but it would be quite nice to have extra coverage.
>> 
>>> Given where we are I'm confident we can have a strong hadoop-0.23.0 release
>>> by late October. The current plan is to deploy to alpha clusters in 
>>> November. Citius, Altius, Fortius! :)
>> 
>> Could you, please, elaborate on what will be part of that deployment?
>> Which versions
>> of Pig, Hive, HBase, Oozie and Mahout are you targeting?
>> 
>> Thanks,
>> Roman.
> 



Re: Update on hadoop-0.23

2011-09-27 Thread Doug Cutting
On 09/26/2011 02:07 PM, Arun C Murthy wrote:
> We are about to finish performance certification for both HDFS &
> MapReduce in the next couple of weeks too, after which we start
> integration tests with HBase, Hive, Oozie etc.

Who's 'we' here?  I haven't seen this happening on the list, so I assume
here you mean you and others working privately?

This is great to hear, though.  Can you provide any more details?

BTW, has anyone else been benchmarking the 0.23 branch yet?  Can they
talk about their experiences?

It's great to see 0.23 take shape.  Thanks for your efforts here, Arun.

Doug


Re: Update on hadoop-0.23

2011-09-27 Thread Roman Shaposhnik
Hi Arun!

Thanks for the quick reply!

I'm sorry if I had too many questions in my original email, but I can't find
an answer to my "integration tests" question. Could you, please, share
a URL with us where I can find out more about them?

On Mon, Sep 26, 2011 at 11:20 PM, Arun C Murthy  wrote:
> # We made changes to Pig - rather we got help from the Pig team, particularly 
> Daniel.
>
> So, we plan to work through the rest of the stack - Hive, Oozie etc. very 
> soon and we'll
> depend on updated releases from the individual projects.

Do we have any kinds of commitment from downstream projects as far as those
updates are concerned? Are they targeting these changes as part of point (patch)
release of an already released version (like Pig 0.9.X for example) or
will it be
part of a brand new major release?

Thanks,
Roman.


Re: Update on hadoop-0.23

2011-09-27 Thread Todd Lipcon
Hi all,

Just an update from the HBase side: I've run some cluster tests on
HDFS 0.23 (as of about a month ago) and it generally works well.
Performance for some workloads is ~2x due to HDFS-941, and can be
improved a bit more if I finish HDFS-2080 in time. I did not do
extensive failure testing (to stress the new append/sync code) but I
do plan to do that in the coming months.

HBase trunk can compile against 0.23 by using -Dhadoop23 on the maven
build. Currently some 15 or so tests are failing - the following HBase
JIRA tracks those issues:
https://issues.apache.org/jira/browse/HBASE-4254

(these may be indicative of HDFS side bugs)

Any help there from the community would be appreciated!

-Todd

On Tue, Sep 27, 2011 at 12:24 PM, Roman Shaposhnik  wrote:
> Hi Arun!
>
> Thanks for the quick reply!
>
> I'm sorry if I had too many questions in my original email, but I can't find
> an answer to my "integration tests" question. Could you, please, share
> a URL with us where I can find out more about them?
>
> On Mon, Sep 26, 2011 at 11:20 PM, Arun C Murthy  wrote:
>> # We made changes to Pig - rather we got help from the Pig team, 
>> particularly Daniel.
>>
>> So, we plan to work through the rest of the stack - Hive, Oozie etc. very 
>> soon and we'll
>> depend on updated releases from the individual projects.
>
> Do we have any kinds of commitment from downstream projects as far as those
> updates are concerned? Are they targeting these changes as part of point 
> (patch)
> release of an already released version (like Pig 0.9.X for example) or
> will it be
> part of a brand new major release?
>
> Thanks,
> Roman.
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Update on hadoop-0.23

2011-09-28 Thread Arun C Murthy
Roman,

On Sep 27, 2011, at 12:24 PM, Roman Shaposhnik wrote:

> I'm sorry if I had too many questions in my original email, but I can't find
> an answer to my "integration tests" question. Could you, please, share
> a URL with us where I can find out more about them?

As you know, me & my team at HW along with folks at Y do a lot of manual 
testing along with tests like GridMix/PigMix etc.

The basic idea is to test all features of HDFS, MapReduce, Streaming, Pipes 
etc. Similarly for Pig, Hive, Oozie. I'm sure none of this is news to you.

Similarly we test for performance for all aspects of MapReduce.

> On Mon, Sep 26, 2011 at 11:20 PM, Arun C Murthy  wrote:
>> # We made changes to Pig - rather we got help from the Pig team, 
>> particularly Daniel.
>> 
>> So, we plan to work through the rest of the stack - Hive, Oozie etc. very 
>> soon and we'll
>> depend on updated releases from the individual projects.
> 
> Do we have any kinds of commitment from downstream projects as far as those
> updates are concerned?

The easiest way to get 'commitment' is to test and provide patches as necessary:
# mapreduce-dev@ helped pig-dev@ to do it for Pig
# Todd has done it for HBase, hdfs-dev@ will further help as necessary
# mapreduce-dev@ will soon help out for Hive, Oozie 

... etc.

> Are they targeting these changes as part of point (patch)
> release of an already released version (like Pig 0.9.X for example) or
> will it be
> part of a brand new major release?


I don't know about specific releases for all projects: 
# Pig will work 0.9.1 or 0.9.2 and beyond.
# HBase trunk works (refer to Todd's msg) - the actual release depends on HBase 
community for a release.

Historically, we make a release of Hadoop Core and then work through related 
projects to make necessary changes - mostly minor, sometime more. Thus, having 
an early release is important.

Arun



Re: Update on hadoop-0.23

2011-09-29 Thread Jeff Hammerbacher
>
> As you know, me & my team at HW along with folks at Y do a lot of manual
> testing along with tests like GridMix/PigMix etc.
>
> The basic idea is to test all features of HDFS, MapReduce, Streaming, Pipes
> etc. Similarly for Pig, Hive, Oozie. I'm sure none of this is news to you.
>
> Similarly we test for performance for all aspects of MapReduce.


Why not make the code for these tests available to the community via
Bigtop?


Re: Update on hadoop-0.23

2011-09-29 Thread Eric Baldeschwieler
Hi Jeff,

This sees like a great opportunity for you to add some value.  I'd welcome 
that.  It seems rude to me to beat up the folks who have been driving the 
majority of the work on a project to do more.  In general I don't think its 
good open source etiquete to ask others to contribute their time to address 
your concerns.  This is a community of volunteers after all.  

If you're wondering why I am asserting that arun and company have done the 
majority of the work on 23, check out the last graph on this post, or look at 
the commit logs.

http://www.hortonworks.com/the-yahoo-effect/

If you are interested in doing some work, I'd suggest starting by wiring in the 
work on gridmix and pigmix that members of the our mapreduce and pig teams have 
contributed.  That's several man-years worth of testing contributed to the 
community.  These are the center pieces of our testing.  Go wild.

Follow your passion,

E14



On Sep 29, 2011, at 11:00 AM, Jeff Hammerbacher wrote:

>> 
>> As you know, me & my team at HW along with folks at Y do a lot of manual
>> testing along with tests like GridMix/PigMix etc.
>> 
>> The basic idea is to test all features of HDFS, MapReduce, Streaming, Pipes
>> etc. Similarly for Pig, Hive, Oozie. I'm sure none of this is news to you.
>> 
>> Similarly we test for performance for all aspects of MapReduce.
> 
> 
> Why not make the code for these tests available to the community via
> Bigtop?



Re: Update on hadoop-0.23

2011-09-29 Thread Doug Cutting
On 09/29/2011 12:35 PM, Eric Baldeschwieler wrote:
> If you're wondering why I am asserting that arun and company have
> done the majority of the work on 23, check out the last graph on this
> post, or look at the commit logs.

The ASF discourages the use of Java's @author tag in large part because
it tends to mark code as the territory of particular contributors.  We
want ASF codebases to be the responsibility of an entire community.

Claiming that one party has contributed more than all others together
seems to me to be a similar claim of ownership and a demand for credit.
 Folks should contribute to the ASF because they want the contributions
of others to join their contributions, not so they can gain credit.

Also, I'd be concerned for the health of a project if one group was
really doing nearly all of the contribution.

> http://www.hortonworks.com/the-yahoo-effect/

Hmm.  Lines of code are not proportional to effort.  The stacked
cumulative histogram makes slopes steeper for those who happen to be on
top.  And the codebase did not start from zero in 2006.

Here are some other reports for 0.23 that make contribution look pretty
diverse and healthy.

http://s.apache.org/tI
http://s.apache.org/9mp
http://s.apache.org/6E2

Statistics are fun, aren't they!

Doug


Re: Update on hadoop-0.23

2011-09-29 Thread Eric Baldeschwieler
Hi Doug, Jeff, Roman

Let me rephrase my point.  I'd like to request that folks take bigtop project 
discussions onto the bigtop lists and don't greet status reports on 
general@hadoop with insinuations that folks who are working really hard on this 
project should be contributing different things to another project or are 
somehow misbehaving by testing on their own infrastructure with their own 
users.  Any kind of testing is a gift to the community and adds value.  You are 
all welcome to contribute too.  If you find issues, then file JIRAs and work on 
the appropriate project lists.  I believe that observing these points of 
etiquette will help this project continue to prosper.

I agree with you that the Hadoop project is healthy.  I'll leave the stats 
discussion to folks who want to dig through the data.  I'm happy to go through 
the details with you offline.

Thanks,

E14

On Sep 29, 2011, at 2:38 PM, Doug Cutting wrote:

> On 09/29/2011 12:35 PM, Eric Baldeschwieler wrote:
>> If you're wondering why I am asserting that arun and company have
>> done the majority of the work on 23, check out the last graph on this
>> post, or look at the commit logs.
> 
> The ASF discourages the use of Java's @author tag in large part because
> it tends to mark code as the territory of particular contributors.  We
> want ASF codebases to be the responsibility of an entire community.
> 
> Claiming that one party has contributed more than all others together
> seems to me to be a similar claim of ownership and a demand for credit.
> Folks should contribute to the ASF because they want the contributions
> of others to join their contributions, not so they can gain credit.
> 
> Also, I'd be concerned for the health of a project if one group was
> really doing nearly all of the contribution.
> 
>> http://www.hortonworks.com/the-yahoo-effect/
> 
> Hmm.  Lines of code are not proportional to effort.  The stacked
> cumulative histogram makes slopes steeper for those who happen to be on
> top.  And the codebase did not start from zero in 2006.
> 
> Here are some other reports for 0.23 that make contribution look pretty
> diverse and healthy.
> 
> http://s.apache.org/tI
> http://s.apache.org/9mp
> http://s.apache.org/6E2
> 
> Statistics are fun, aren't they!
> 
> Doug



Re: Update on hadoop-0.23

2011-09-30 Thread Konstantin Shvachko
On Thu, Sep 29, 2011 at 10:27 PM, Eric Baldeschwieler
 wrote:
> Hi Doug, Jeff, Roman
>
> I'd like to request that folks take bigtop project discussions onto
> the bigtop lists and don't greet status reports on general@hadoop

I am personally very interested in the results of testing of 0.22 with
BigTop, or other tools, or without any tools.
So I'd like to ask (rather than request) good people to continue
posting your findings on the general@hadoop list.

Eric, thank you for your continuous contributions to Apache Hadoop.

I also think that general@hadoop is the right place to discuss
inter-project issues like making HBase, Pig, Hive,
working on Hadoop 0.22 and 0.23. Where else?

Thanks,
--Konstantin


Re: Update on hadoop-0.23

2011-09-30 Thread Steve Loughran

On 29/09/2011 22:38, Doug Cutting wrote:


The ASF discourages the use of Java's @author tag in large part because
it tends to mark code as the territory of particular contributors.  We
want ASF codebases to be the responsibility of an entire community.


There's another reason which is if you have your name next to some code, 
you get emails asking about it for the rest of your life. Anonymity 
offers deniability






Re: Update on hadoop-0.23

2011-09-30 Thread Steve Loughran

On 30/09/2011 06:27, Eric Baldeschwieler wrote:

Hi Doug, Jeff, Roman

Let me rephrase my point.  I'd like to request that folks take bigtop project 
discussions onto the bigtop lists and don't greet status reports on 
general@hadoop with insinuations that folks who are working really hard on this 
project should be contributing different things to another project or are 
somehow misbehaving by testing on their own infrastructure with their own 
users.  Any kind of testing is a gift to the community and adds value.  You are 
all welcome to contribute too.  If you find issues, then file JIRAs and work on 
the appropriate project lists.  I believe that observing these points of 
etiquette will help this project continue to prosper.


Bigtop is an attempt to have a coherent test & release process, with 
full stack testing, release artifacts tested on a set of platforms, and 
a codebase that has matured out of cloudera. I don't care about origin, 
all I want is consistent releases of compatible artifacts -and the 
testing to back up the claims of compatibility. The artifacts should be 
those things people install -RPMs, debs- ideally the tests should start 
of small clusters, then scale up to production size before release.


there are things happening in the hadoop core that mimic some of the 
features here -RPMs- but appear to be lacking the full stack functional 
testing which is a goal of bigtop.




I agree with you that the Hadoop project is healthy.


How do you define health in this context?

1. There is a 0.20.20x branch that is the one people use in production 
-the stable one. The API is behind the 0.21+ feature set, and so is less 
convenient to code against. It picks up features as well as fixes, which 
I find troublesome. You don't see new features going into RHEL5.x, 
Ubuntu LTS releases. Yes, I know users like those features, but it could 
be due to a slow release of new versions that they trust to work and 
preserve data. It's healthy, but the backport of features creates inertia.


2. there is the 0.23 branch that everyone -especially Arun- is working 
on, which is really promising, though some of the features (federation, 
YARN) are going to be fairly traumatic in rollout. That doesn't mean 
they are good, only that switching to them will have surprises.


3. There's 0.22 which is going to combine the API of 0.21 with the fixes 
of 0.20.20x *and* will be the last release of the MR1.0 engine. For that 
last reason, I think there's value in pushing it out, though it's going 
to take time, and there's a risk of it adding another branch to be 
maintained for an indeterminate period.


4. There are the third party "compatible" projects, CDH, MapR, EMC HD, 
Amazon Elastic MR, which are all declaring compatibility with 0.20.x; no 
stated plans when/how to move to 0.23+


I would say Hadoop is incredibly successful -it's generating lots of 
interest, is being used by big companies, it has almost singlehandedly 
revitalised server-side Java dev, it is the foundation for an OSS 
version of the MS Azure stack. But for that latter goal to be achieved 
-it's what I want- we need to move forward on releases where the entire 
stack is consistent, releases that people want to use.


For that consistency, I'd like bigtop to be a subject people can talk 
about here, just as MRUnit, which will be needed now that 0.23+ removes 
the MiniMRCluster feature.


-steve


Re: Update on hadoop-0.23

2011-09-30 Thread Milind.Bhandarkar

3. There's 0.22 which is going to combine the API of 0.21 with the fixes
>of 0.20.20x *and* will be the last release of the MR1.0 engine. For that
>last reason, I think there's value in pushing it out, though it's going
>to take time, and there's a risk of it adding another branch to be
>maintained for an indeterminate period.

+1

- milind

---
Milind Bhandarkar
Greenplum Labs, EMC
(Disclaimer: Opinions expressed in this email are those of the author, and
do not necessarily represent the views of any organization, past or
present, the author might be affiliated with.)



Re: Update on hadoop-0.23

2011-09-30 Thread Andrew Purtell
This time it seems easy to split the difference here.

- Sufficient interest in Bigtop so announcements and discussions can/should go 
to general@.*

- There is no need to (and a request not to) inject exhortations to participate 
in Bigtop into random other topics on general@, such as status reports by 
another project or group. Simply create new threads to discuss Bigtop matters.

* - Seems to me a community effort to qualify an integrated stack top to bottom 
is a good thing, but I question doing this for 0.22, which nobody is going to 
use much, or so I hear.

Best regards,


    - Andy


Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
Tom White)


>
>From: Konstantin Shvachko 
>To: general@hadoop.apache.org
>Sent: Friday, September 30, 2011 2:23 AM
>Subject: Re: Update on hadoop-0.23
>
>On Thu, Sep 29, 2011 at 10:27 PM, Eric Baldeschwieler
> wrote:
>> Hi Doug, Jeff, Roman
>>
>> I'd like to request that folks take bigtop project discussions onto
>> the bigtop lists and don't greet status reports on general@hadoop
>
>I am personally very interested in the results of testing of 0.22 with
>BigTop, or other tools, or without any tools.
>So I'd like to ask (rather than request) good people to continue
>posting your findings on the general@hadoop list.
>
>Eric, thank you for your continuous contributions to Apache Hadoop.
>
>I also think that general@hadoop is the right place to discuss
>inter-project issues like making HBase, Pig, Hive,
>working on Hadoop 0.22 and 0.23. Where else?
>
>Thanks,
>--Konstantin
>
>
>

Re: Update on hadoop-0.23

2011-09-30 Thread Matt Foley
>> Sufficient interest in Bigtop so announcements and discussions can/should
go to general@.*

Why wouldn't this work like other projects, and Bigtop discussions go to the
Bigtop mailing lists?  (Announcements, of course, do belong on general.)
 People interested in Bigtop discussions sign up for the Bigtop mailing
lists.

I'm going to go do that right now, since I am. :-)

--Matt


On Fri, Sep 30, 2011 at 9:34 AM, Andrew Purtell  wrote:

> This time it seems easy to split the difference here.
>
> - Sufficient interest in Bigtop so announcements and discussions can/should
> go to general@.*
>
> - There is no need to (and a request not to) inject exhortations to
> participate in Bigtop into random other topics on general@, such as status
> reports by another project or group. Simply create new threads to discuss
> Bigtop matters.
>
> * - Seems to me a community effort to qualify an integrated stack top to
> bottom is a good thing, but I question doing this for 0.22, which nobody is
> going to use much, or so I hear.
>
> Best regards,
>
>
> - Andy
>
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>
>
> >
> >From: Konstantin Shvachko 
> >To: general@hadoop.apache.org
> >Sent: Friday, September 30, 2011 2:23 AM
> >Subject: Re: Update on hadoop-0.23
> >
> >On Thu, Sep 29, 2011 at 10:27 PM, Eric Baldeschwieler
> > wrote:
> >> Hi Doug, Jeff, Roman
> >>
> >> I'd like to request that folks take bigtop project discussions onto
> >> the bigtop lists and don't greet status reports on general@hadoop
> >
> >I am personally very interested in the results of testing of 0.22 with
> >BigTop, or other tools, or without any tools.
> >So I'd like to ask (rather than request) good people to continue
> >posting your findings on the general@hadoop list.
> >
> >Eric, thank you for your continuous contributions to Apache Hadoop.
> >
> >I also think that general@hadoop is the right place to discuss
> >inter-project issues like making HBase, Pig, Hive,
> >working on Hadoop 0.22 and 0.23. Where else?
> >
> >Thanks,
> >--Konstantin
> >
> >
> >
>


Re: Update on hadoop-0.23

2011-09-30 Thread Doug Cutting
On 09/30/2011 03:17 AM, Steve Loughran wrote:
> 4. There are the third party "compatible" projects, CDH, MapR, EMC HD,
> Amazon Elastic MR, which are all declaring compatibility with 0.20.x; no
> stated plans when/how to move to 0.23+

CDH4 will include 0.23 (hopefully without any patches).  I expect to see
an alpha release of CDH4 late this year and a production-ready release
early next year.

Doug


Re: Update on hadoop-0.23

2011-09-30 Thread Arun C Murthy

On Sep 30, 2011, at 3:17 AM, Steve Loughran wrote:
> 
> 3. There's 0.22 which is going to combine the API of 0.21 with the fixes of 
> 0.20.20x *and* will be the last release of the MR1.0 engine. For that last 
> reason, I think there's value in pushing it out, though it's going to take 
> time, and there's a risk of it adding another branch to be maintained for an 
> indeterminate period.

I'm all for people working on what they are passionate about, so this isn't to 
say one shouldn't spend time on 0.22.

But, for clarity's sake, as I've done multiple times on both the list and in 
person to Konstantin etc., I'll point out (again) that 0.22 will need multiple 
man-years of development to achieve parity with 0.20.2xx just in terms of 
bug-fixes and performance. Then there is security, multi-tenancy etc. which 
regress significantly vis-a-vis 0.20.2xx. Then there is scaling etc.

0.23 is already past all of these hurdles and very close to meeting, if not 
beating 0.20.2xx in performance. It already beats 0.20.2xx in lots of 
dimensions (improved shuffle with zero-copy etc.).

So, unless folks plan to invest this gargantuan time, please do not say that 
0.21 has fixes from 0.20.2xx. That's all I ask. Thus, 0.20.2xx may well be the 
last _viable_ release of MR1 engine.

Arun



Re: Update on hadoop-0.23

2011-09-30 Thread Arun C Murthy

On Sep 30, 2011, at 3:17 AM, Steve Loughran wrote:
> 
> 3. There's 0.22 which is going to combine the API of 0.21 with the fixes of 
> 0.20.20x *and* will be the last release of the MR1.0 engine. For that last 
> reason, I think there's value in pushing it out, though it's going to take 
> time, and there's a risk of it adding another branch to be maintained for an 
> indeterminate period.

I'm all for people working on what they are passionate about, so this isn't to 
say one shouldn't spend time on 0.22.

But, for clarity's sake, as I've done multiple times on both the list and in 
person to Konstantin etc., I'll point out (again) that 0.22 will need multiple 
man-years of development to achieve parity with 0.20.2xx just in terms of 
bug-fixes and performance. Then there is security, multi-tenancy etc. which 
regress significantly vis-a-vis 0.20.2xx. Then there is scaling etc.

0.23 is already past all of these hurdles and very close to meeting, if not 
beating 0.20.2xx in performance. It already beats 0.20.2xx in lots of 
dimensions (improved shuffle with zero-copy etc.).

So, unless folks plan to invest this gargantuan time, please do not say that 
0.21 has fixes from 0.20.2xx. That's all I ask. Thus, 0.20.2xx may well be the 
last _viable_ release of MR1 engine.

Arun



Re: Update on hadoop-0.23

2011-09-30 Thread Roman Shaposhnik
Hi Arun!

On Fri, Sep 30, 2011 at 10:54 AM, Arun C Murthy  wrote:
> I'm all for people working on what they are passionate about, so this isn't 
> to say one shouldn't spend time on 0.22.
>
> But, for clarity's sake, as I've done multiple times on both the list and in 
> person to Konstantin etc.,
> I'll point out (again) that 0.22 will need multiple man-years of development 
> to achieve parity with
> 0.20.2xx just in terms of bug-fixes and performance.

I apologize if my level of institutional knowledge of these things is
lacking, but do you have any
benchmarking results between 0.22 and 0.20.2xx? The reason I'm asking
is twofold -- I really
would like to see an objective numbers qualifying the viability of
0.22 from the performance stand point,
but more importantly I would really like to include the benchmarking
code into Bigtop.

In terms of bugs -- same question. Is there any publicly available
list of, at least, the critical
ones that make 0.22 not viable from your point of view?

Thanks,
Roman.


Re: Update on hadoop-0.23

2011-09-30 Thread Todd Lipcon
On Fri, Sep 30, 2011 at 11:44 AM, Roman Shaposhnik  wrote:
> I apologize if my level of institutional knowledge of these things is
> lacking, but do you have any
> benchmarking results between 0.22 and 0.20.2xx? The reason I'm asking
> is twofold -- I really
> would like to see an objective numbers qualifying the viability of
> 0.22 from the performance stand point,
> but more importantly I would really like to include the benchmarking
> code into Bigtop.

0.22 currently suffers from MAPREDUCE-2266, which, last time I
benchmarked it, caused a significant slowdown. iirc a terasort ran
something like twice as slow on my test cluster due to this bug.
0.23/MR2 doesn't suffer from this bug.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Update on hadoop-0.23

2011-09-30 Thread Arun C Murthy

On Sep 30, 2011, at 1:13 PM, Todd Lipcon wrote:

> On Fri, Sep 30, 2011 at 11:44 AM, Roman Shaposhnik  wrote:
>> I apologize if my level of institutional knowledge of these things is
>> lacking, but do you have any
>> benchmarking results between 0.22 and 0.20.2xx? The reason I'm asking
>> is twofold -- I really
>> would like to see an objective numbers qualifying the viability of
>> 0.22 from the performance stand point,
>> but more importantly I would really like to include the benchmarking
>> code into Bigtop.
> 
> 0.22 currently suffers from MAPREDUCE-2266, which, last time I
> benchmarked it, caused a significant slowdown. iirc a terasort ran
> something like twice as slow on my test cluster due to this bug.
> 0.23/MR2 doesn't suffer from this bug.
> 

I don't really know where to start. CHANGES.txt in branch-0.20-security has the 
full list.

If I remember right, long ago (late 2009)  we benchmarked .21 with gridmix and 
saw >30% prior to abandoning .21.

Since then 0.20.2xx has had innumerable improvements to JobTracker, TaskTracker 
etc. etc. 
# JobTracker itself is almost thrice as fast as it used to be in 2009.
# The scheduler is significantly better (>2x locality) and throughput.
# TaskTracker has had innumerable fixes for dist.cache, task launch, shutdown 
(MR-2266 and lots of other similar fixes). 
# The MR runtime has fixes for latency on innumerable fronts.

Other regressions:
# Security
# Support for multi-tenant clusters.
# Tonnes of operability fixes (jobhistory, task logs i.e. MR-1100) for running 
MR clusters.

The one redeeming aspect for .22 is the shuffle based on the work we did for 
winning Terasort/Petasort in 2009 but 0.23 has even more work there with 
zero-copy with netty (yaay! no more jetty! Thanks to @cdouglas).

> In terms of bugs -- same question. Is there any publicly available
> list of, at least, the critical
> ones that make 0.22 not viable from your point of view?

We marked a lot of them as blockers on .22 and they were discarded by the 
release master(s). branch-0.20-security/CHANGES.txt is the full list. I really 
can't spend time enumerating over 4000 commits and > 2000 (?) jiras to that 
branch at this point.

In my opinion, as someone who has helped develop/run/support very large 
installs and done this for over 5 1/2 years, a major release with regression on 
features (security, multi-tenancy) and scalability, performance etc. is 
distinctly _unviable_.



Again, none of this is meant to say you should invest time on fixing them or 
releasing 0.22 as it stands - just, please, don't label it in a manner which 
helps build unreasonable expectations among users about it's viability & 
usability.

thanks,
Arun



Re: Update on hadoop-0.23

2011-09-30 Thread Konstantin Boudnik
BTW, Roman I have a recollection of Hadoop performance suite for iTest (aka
BigTop now) which I have put together during the initial development phase of
the framework.

I don't see in BigTop's source tree - has this work ever been committed to
open source along with the iTest? This way if any benchmarking tests for 0.22
(or 0.20.2xx) are getting open we can put them to the same container
conditional on letting them see the light of day by their current copyright
holder.

With regards,
  Cos

Disclaimer: apologies for seemingly off-topic discussion, but this is about
Hadoop performance testing, so bring up BigTop in this frame of reference
looks completely justifiable.

On Fri, Sep 30, 2011 at 11:44AM, Roman Shaposhnik wrote:
> Hi Arun!
> 
> On Fri, Sep 30, 2011 at 10:54 AM, Arun C Murthy  wrote:
> > I'm all for people working on what they are passionate about, so this isn't 
> > to say one shouldn't spend time on 0.22.
> >
> > But, for clarity's sake, as I've done multiple times on both the list and 
> > in person to Konstantin etc.,
> > I'll point out (again) that 0.22 will need multiple man-years of 
> > development to achieve parity with
> > 0.20.2xx just in terms of bug-fixes and performance.
> 
> I apologize if my level of institutional knowledge of these things is
> lacking, but do you have any
> benchmarking results between 0.22 and 0.20.2xx? The reason I'm asking
> is twofold -- I really
> would like to see an objective numbers qualifying the viability of
> 0.22 from the performance stand point,
> but more importantly I would really like to include the benchmarking
> code into Bigtop.
> 
> In terms of bugs -- same question. Is there any publicly available
> list of, at least, the critical
> ones that make 0.22 not viable from your point of view?
> 
> Thanks,
> Roman.


Re: Update on hadoop-0.23

2011-10-01 Thread Konstantin Shvachko
On Fri, Sep 30, 2011 at 9:34 AM, Andrew Purtell  wrote:
> * - Seems to me a community effort to qualify an integrated stack top to 
> bottom is a good thing, but I question doing this for 0.22, which nobody is 
> going to use much, or so I hear.
>

Andrew, what I hear here in the East Bay is that my 500 users will.
I also hear that if 0.22 was available people would use it now.

Thanks,
Konstajntin

>>
>>From: Konstantin Shvachko 
>>To: general@hadoop.apache.org
>>Sent: Friday, September 30, 2011 2:23 AM
>>Subject: Re: Update on hadoop-0.23
>>
>>On Thu, Sep 29, 2011 at 10:27 PM, Eric Baldeschwieler
>> wrote:
>>> Hi Doug, Jeff, Roman
>>>
>>> I'd like to request that folks take bigtop project discussions onto
>>> the bigtop lists and don't greet status reports on general@hadoop
>>
>>I am personally very interested in the results of testing of 0.22 with
>>BigTop, or other tools, or without any tools.
>>So I'd like to ask (rather than request) good people to continue
>>posting your findings on the general@hadoop list.
>>
>>Eric, thank you for your continuous contributions to Apache Hadoop.
>>
>>I also think that general@hadoop is the right place to discuss
>>inter-project issues like making HBase, Pig, Hive,
>>working on Hadoop 0.22 and 0.23. Where else?
>>
>>Thanks,
>>--Konstantin
>>
>>
>>


Re: Update on hadoop-0.23

2011-10-01 Thread Konstantin Shvachko
I am very glad that the development and testing of 0.23 is going so well.
I see a lot of commits and hundreds of changes going in literally every day.
It is great to see the new technology building!

On the criticism of the 0.22 release.
Arun has a top-down view and I agree a lot of progress have been
achieved with the framework.
My bottom-up view is that you first need a reliable storage layer. If
the file system looses blocks or worse messes up with the image and/or
journals, the performance of the framework is your second problem. I
have said that before. Based on my experience it take time to
stabilize a file system. Anybody seen one that has been stabilized in
less than 2 years?
I do not see the 0.22 release as a wasted effort. And if the progress
with it contributes to the 0.23 rush I am twice as happy.

Thanks,
--Konstantin

On Fri, Sep 30, 2011 at 3:00 PM, Arun C Murthy  wrote:
>
> On Sep 30, 2011, at 1:13 PM, Todd Lipcon wrote:
>
>> On Fri, Sep 30, 2011 at 11:44 AM, Roman Shaposhnik  wrote:
>>> I apologize if my level of institutional knowledge of these things is
>>> lacking, but do you have any
>>> benchmarking results between 0.22 and 0.20.2xx? The reason I'm asking
>>> is twofold -- I really
>>> would like to see an objective numbers qualifying the viability of
>>> 0.22 from the performance stand point,
>>> but more importantly I would really like to include the benchmarking
>>> code into Bigtop.
>>
>> 0.22 currently suffers from MAPREDUCE-2266, which, last time I
>> benchmarked it, caused a significant slowdown. iirc a terasort ran
>> something like twice as slow on my test cluster due to this bug.
>> 0.23/MR2 doesn't suffer from this bug.
>>
>
> I don't really know where to start. CHANGES.txt in branch-0.20-security has 
> the full list.
>
> If I remember right, long ago (late 2009)  we benchmarked .21 with gridmix 
> and saw >30% prior to abandoning .21.
>
> Since then 0.20.2xx has had innumerable improvements to JobTracker, 
> TaskTracker etc. etc.
> # JobTracker itself is almost thrice as fast as it used to be in 2009.
> # The scheduler is significantly better (>2x locality) and throughput.
> # TaskTracker has had innumerable fixes for dist.cache, task launch, shutdown 
> (MR-2266 and lots of other similar fixes).
> # The MR runtime has fixes for latency on innumerable fronts.
>
> Other regressions:
> # Security
> # Support for multi-tenant clusters.
> # Tonnes of operability fixes (jobhistory, task logs i.e. MR-1100) for 
> running MR clusters.
>
> The one redeeming aspect for .22 is the shuffle based on the work we did for 
> winning Terasort/Petasort in 2009 but 0.23 has even more work there with 
> zero-copy with netty (yaay! no more jetty! Thanks to @cdouglas).
>
>> In terms of bugs -- same question. Is there any publicly available
>> list of, at least, the critical
>> ones that make 0.22 not viable from your point of view?
>
> We marked a lot of them as blockers on .22 and they were discarded by the 
> release master(s). branch-0.20-security/CHANGES.txt is the full list. I 
> really can't spend time enumerating over 4000 commits and > 2000 (?) jiras to 
> that branch at this point.
>
> In my opinion, as someone who has helped develop/run/support very large 
> installs and done this for over 5 1/2 years, a major release with regression 
> on features (security, multi-tenancy) and scalability, performance etc. is 
> distinctly _unviable_.
>
> 
>
> Again, none of this is meant to say you should invest time on fixing them or 
> releasing 0.22 as it stands - just, please, don't label it in a manner which 
> helps build unreasonable expectations among users about it's viability & 
> usability.
>
> thanks,
> Arun
>
>


Re: Update on hadoop-0.23

2011-10-03 Thread Eric Baldeschwieler
Thanks Andy,

I think this is a clear summary of what would be a good outcome.  

I would suggest that detailed bigtop discussions should go to bigtop, but 
status updates are undoubtedly interesting to this audience.

But I do request that folks not "inject exhortations to participate in Bigtop 
into random other topics on general@, such as status reports by another project 
or group".

E14

On Sep 30, 2011, at 9:34 AM, Andrew Purtell wrote:

> This time it seems easy to split the difference here.
> 
> - Sufficient interest in Bigtop so announcements and discussions can/should 
> go to general@.*
> 
> - There is no need to (and a request not to) inject exhortations to 
> participate in Bigtop into random other topics on general@, such as status 
> reports by another project or group. Simply create new threads to discuss 
> Bigtop matters.
> 
> * - Seems to me a community effort to qualify an integrated stack top to 
> bottom is a good thing, but I question doing this for 0.22, which nobody is 
> going to use much, or so I hear.
> 
> Best regards,
> 
> 
> - Andy
> 
> 
> Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
> Tom White)
> 
> 
>> 
>> From: Konstantin Shvachko 
>> To: general@hadoop.apache.org
>> Sent: Friday, September 30, 2011 2:23 AM
>> Subject: Re: Update on hadoop-0.23
>> 
>> On Thu, Sep 29, 2011 at 10:27 PM, Eric Baldeschwieler
>>  wrote:
>>> Hi Doug, Jeff, Roman
>>> 
>>> I'd like to request that folks take bigtop project discussions onto
>>> the bigtop lists and don't greet status reports on general@hadoop
>> 
>> I am personally very interested in the results of testing of 0.22 with
>> BigTop, or other tools, or without any tools.
>> So I'd like to ask (rather than request) good people to continue
>> posting your findings on the general@hadoop list.
>> 
>> Eric, thank you for your continuous contributions to Apache Hadoop.
>> 
>> I also think that general@hadoop is the right place to discuss
>> inter-project issues like making HBase, Pig, Hive,
>> working on Hadoop 0.22 and 0.23. Where else?
>> 
>> Thanks,
>> --Konstantin
>> 
>> 



Re: Update on hadoop-0.23

2011-10-17 Thread Ted Yu
On behalf of Harsh w.r.t. HBASE-4510 HDFS-1620 related changes downstream
(For compiling with HDFS
0.23+)

I need to open up some HDFS jiras and change the way HBase uses safemode
determinism in the meanwhile.

Cheers

On Mon, Oct 17, 2011 at 10:17 AM, Arun C Murthy  wrote:

> Folks,
>
>  Quick note - the dev community continues to scramble to get things wrapped
> up on hadoop-0.23.
>
>  We are down to ~30 blockers and I hope to see them resolved over the next
> two weeks!
>
>  Also, I feel Alejandro and Tom can finish up the remaining mavenization
> bits by then too - as I see it, it's very close... thanks guys!
>
>  Once done, I plan to call a vote on a hadoop-0.23.0 which we can start
> deploying (and further stabilizing) right-away.
>
>  My hope is that hadoop-0.23.0 is a strong alpha which we can then beat
> into shape after, the idea is to ship soon so we get folks to play with it
> and help downstream projects to integrate for e.g. Pig already works, and I
> know Todd is working on getting HBase to play well too.
>
> thanks,
> Arun
>
>


Re: Update on hadoop-0.23

2011-10-18 Thread Steve Loughran

On 17/10/11 18:17, Arun C Murthy wrote:

Folks,

  Quick note - the dev community continues to scramble to get things wrapped up 
on hadoop-0.23.

  We are down to ~30 blockers and I hope to see them resolved over the next two 
weeks!

  Also, I feel Alejandro and Tom can finish up the remaining mavenization bits 
by then too - as I see it, it's very close... thanks guys!

  Once done, I plan to call a vote on a hadoop-0.23.0 which we can start 
deploying (and further stabilizing) right-away.

  My hope is that hadoop-0.23.0 is a strong alpha which we can then beat into 
shape after, the idea is to ship soon so we get folks to play with it and help 
downstream projects to integrate for e.g. Pig already works, and I know Todd is 
working on getting HBase to play well too.



This is good, but I can see enough changes that we will need broad 
testing to confident there is no regression.


-I propose that a "pre-alpha" is done ASAP, to test the release process 
and let people playing with YARN, the MR engine and writing tools to 
have something more stable than SNAPSHOT- to play with, then maybe a 
fast 2-4 cycle of alpha releases for a bit.


-I can add the JIRA release numbers if you give me a list.


-Where do you think the troublespots for deployment and regressions will be?

  -Anything that uses MiniMRCluster is going to go, and the migration 
strategy needs to be on the wiki (I can help there once I know what to do)
  -HBase, Hama, bigtop, MRUnit should all be pulled into the release 
process as part of the regression tests
  -It'd be good for people doing in-cluster tests to document cluster 
size, network config etc so we can identify what works & what doesn't 
though as that relies on people discussing their cluster details may be 
a bit patch.
  -HDFS migration; there really needs to be a way to test FS upgrades 
from various Hadoop versions, including Cloudera's -upgrades with 
entries in the edit log to replay





Re: Update on hadoop-0.23

2011-10-18 Thread Steve Loughran

On 17/10/11 18:17, Arun C Murthy wrote:


One more thing: are the ProtocolBuffers needed for all installations, or 
is that a compile-time requirement? If the binaries are going to be 
required, there's going to have to be one built for the various 
platforms, and source.deb/RPM files to build themselves on Linux. I'd 
rather avoid all that work


Re: Update on hadoop-0.23

2011-10-18 Thread Todd Lipcon
On Tue, Oct 18, 2011 at 4:36 AM, Steve Loughran  wrote:
>
> One more thing: are the ProtocolBuffers needed for all installations, or is
> that a compile-time requirement? If the binaries are going to be required,
> there's going to have to be one built for the various platforms, and
> source.deb/RPM files to build themselves on Linux. I'd rather avoid all that
> work

The protobuf java jar is required at runtime. protoc (native) is only
required at compile time.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Update on hadoop-0.23

2011-10-18 Thread Harsh J
HBase trunk compiles with 0.23 now, after Todd's work on it -- I've
updated https://issues.apache.org/jira/browse/HBASE-4510 with further
details.

On Tue, Oct 18, 2011 at 1:57 AM, Ted Yu  wrote:
> On behalf of Harsh w.r.t. HBASE-4510 HDFS-1620 related changes downstream
> (For compiling with HDFS
> 0.23+)
>
> I need to open up some HDFS jiras and change the way HBase uses safemode
> determinism in the meanwhile.
>
> Cheers
>
> On Mon, Oct 17, 2011 at 10:17 AM, Arun C Murthy  wrote:
>
>> Folks,
>>
>>  Quick note - the dev community continues to scramble to get things wrapped
>> up on hadoop-0.23.
>>
>>  We are down to ~30 blockers and I hope to see them resolved over the next
>> two weeks!
>>
>>  Also, I feel Alejandro and Tom can finish up the remaining mavenization
>> bits by then too - as I see it, it's very close... thanks guys!
>>
>>  Once done, I plan to call a vote on a hadoop-0.23.0 which we can start
>> deploying (and further stabilizing) right-away.
>>
>>  My hope is that hadoop-0.23.0 is a strong alpha which we can then beat
>> into shape after, the idea is to ship soon so we get folks to play with it
>> and help downstream projects to integrate for e.g. Pig already works, and I
>> know Todd is working on getting HBase to play well too.
>>
>> thanks,
>> Arun
>>
>>
>



-- 
Harsh J


Re: Update on hadoop-0.23

2011-10-19 Thread Steve Loughran

On 19/10/11 00:40, Todd Lipcon wrote:

On Tue, Oct 18, 2011 at 4:36 AM, Steve Loughran  wrote:


One more thing: are the ProtocolBuffers needed for all installations, or is
that a compile-time requirement? If the binaries are going to be required,
there's going to have to be one built for the various platforms, and
source.deb/RPM files to build themselves on Linux. I'd rather avoid all that
work


The protobuf java jar is required at runtime. protoc (native) is only
required at compile time.



OK, I've added notes on this in the wiki, please review and correct 
where I have fundamental misunderstandings


http://wiki.apache.org/hadoop/ProtocolBuffers