Re: [VOTE] Abandon hdfsproxy HDFS contrib

Nigel Daley Tue, 22 Feb 2011 21:28:29 -0800

For closure, this vote fails due to a couple binding -1 votes.

Nige


On Feb 18, 2011, at 4:46 AM, Eric Baldeschwieler wrote:

> Hi Bernd,
> 
> Apache Hadoop is about scale. Most clusters will always be small, but Hadoop 
> is going mainstream precisely because it scales to huge data and cluster 
> sizes. 
> 
> There are lots of systems that work well on 10 node clusters. People select   
> Hadoop because they are confident that as their business / problem grows, 
> Hadoop can grow with it. 
> 
> ---
> E14 - via iPhone
> 
> On Feb 17, 2011, at 7:25 AM, "Bernd Fondermann" 
> <bernd.fonderm...@googlemail.com> wrote:
> 
>> On Thu, Feb 17, 2011 at 14:58, Ian Holsman <had...@holsman.net> wrote:
>>> Hi Bernd.
>>> 
>>> On Feb 17, 2011, at 7:43 AM, Bernd Fondermann wrote:
>>>> 
>>>> We have the very unfortunate situation here at Hadoop where Apache
>>>> Hadoop is not the primary and foremost place of Hadoop development.
>>>> Instead, code is developed internally at Yahoo and then contributed in
>>>> (smaller or larger) chunks to Hadoop.
>>> 
>>> This has been the situation in the past,
>>> but as you can see in the last month, this has changed.
>>> 
>>> Yahoo! has publicly committed to move their development into the main code 
>>> base, and you can see they have started doing this with the 20.100 branch,
>>> and their recent commits to trunk.
>>> Combine this with Nige taking on the 0.22 release branch, (and sheperding 
>>> it into a stable release) and I think we have are addressing your concerns.
>>> 
>>> They have also started bringing the discussions back on the list, see the 
>>> recent discussion about Jobtracker-nextgen Arun has re-started in 
>>> MAPREDUCE-279.
>>> 
>>> I'm not saying it's perfect, but I think the major players understand there 
>>> is an issue, and they are *ALL* moving in the right direction.
>> 
>> I enthusiastically would like to see your optimism be verified.
>> Maybe I'm misreading the statements issued publicly, but I don't think
>> that this is fully understood. I agree though that it's a move into
>> the right direction.
>> 
>>>> This is open source development upside down.
>>>> It is not ok for people to diff ASF svn against their internal code
>>>> and provide the diff as a patch without reviewing IP first for every
>>>> line of code changed.
>>>> For larger chunks I'd suggest to even go via the Incubator IP clearance 
>>>> process.
>>>> Only then will we force committers to primarily work here in the open
>>>> and return to what I'd consider a healthy project.
>>>> 
>>>> To be honest: Hadoop is in the process of falling apart.
>>>> Contrib Code gets moved out of Apache instead of being maintained here.
>>>> Discussions are seldom consense-driven.
>>>> Release branches stagnate.
>>> 
>>> True. releases do take a long time. This is mainly due to it being 
>>> extremely hard to test and verify that a release is stable.
>>> It's not enough to just run the thing on 4 machines, you need at least 50 
>>> to test some of the major problems. This requires some serious $ for 
>>> someone to verify.
>> 
>> It has been proposed on the list before, IIRC. Don't know how to get
>> there, but the project seriously needs access to a cluster of this
>> size.
>> 
>>>> Downstream projects like HBase don't get proper support.
>>>> Production setups are made from 3rd party distributions.
>>>> Development is not happening here, but elsewhere behind corporate doors.
>>>> Discussion about future developments are started on corporate blogs (
>>>> http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/
>>>> ) instead of on the proper mailing list.
>>>> Hurdles for committing are way too high.
>>>> On the bright side, new committers and PMC members are added, this is
>>>> an improvement.
>>>> 
>>>> I'd suggest to move away from relying on large code dumps from
>>>> corporations, and move back to the ASF-proven "individual committer
>>>> commits on trunk"-model where more committers can get involved.
>>>> If that means not to support high end cluster sizes for some months,
>>>> well, so be it.
>>> 
>>>> Average committers cannot run - e.g. test - on high
>>>> end cluster sizes. If that would mean they cannot participate, then
>>>> the open source project better concentrate on small and medium sized
>>>> cluster instead.
>>> 
>>> 
>>> Well.. that's one approach.. but there are several companies out there who 
>>> rely on apache's hadoop to power their large clusters, so I'd hate to see 
>>> hadoop become something that only runs well on
>>> 10-nodes.. as I don't think that will help anyone either.
>> 
>> But only looking at high-end scale doesn't help either.
>> 
>> Lets face the fact that Hadoop is now moving from early adaptors phase
>> into a much broader market. I predict that small to medium sized
>> clusters will be the majority of Hadoop deployments in a few month
>> time. 4000, or even 500 machines is the high-end range. If the open
>> source project Hadoop cannot support those users adequately (without
>> becoming defunct), the committership might be better off to focus on
>> the low-end and medium sized users.
>> 
>> I'm not suggesting to turn away from the handfull (?) of high-end
>> users. They certainly have most valuable input. But also, *they*
>> obviously have the resources in terms of larger clusters and
>> developers to deal with their specific setups. Obviously, they don't
>> need to rely on the open source project to make releases. In fact,
>> they *do* work on their own Hadoop derivatives.
>> All the other users, the hundreds of boring small cluster users, don't
>> have that choice. They *depend* on the open source releases.
>> 
>> Hadoop is an Apache project, to provide HDFS and MR free of charge to
>> the general public. Not only to me - nor to only one or two big
>> companies either.
>> Focus on all the users.
>> 
>> Bernd

Re: [VOTE] Abandon hdfsproxy HDFS contrib

Reply via email to