Keith Chen is out of office for Spring Festival holiday

2011-02-01 Thread Qi BJ Chen

I will be out of the office starting  2011-02-02 and will not return until
2011-02-09.


Re: [ANNOUNCEMENT] Yahoo focusing on Apache Hadoop, discontinuing "The Yahoo Distribution of Hadoop"

2011-02-01 Thread Todd Papaioannou
Yes. We have been and continue to be firm believers in Apache and the value of 
Open Source software, as you can see from our track record to date of 
contributing heavily to Hadoop and donating Pig, ZooKeeper, Avro, etc. We are 
excited about their potential and we hope others will find them useful too.

ToddP

On 1/31/11 7:44 PM, "Jeff Hammerbacher" 
mailto:ham...@cloudera.com>> wrote:

Excellent news! Will you also make Howl, Oozie, and Yarn Apache projects as
well?

On Mon, Jan 31, 2011 at 7:27 PM, Eric Baldeschwieler
mailto:eri...@yahoo-inc.com>>wrote:

Hi Folks,

I'm pleased to announce that after some reflection, Yahoo! has decided to
discontinue the  "The Yahoo Distribution of Hadoop" and focus on Apache
Hadoop.  We plan to remove all references to a Yahoo distribution from our
website (developer.yahoo.com/hadoop), close our github repo (
yahoo.github.com/hadoop-common) and focus on working more closely with the
Apache community.  Our intent is to return to helping Apache produce binary
releases of Apache Hadoop that are so bullet proof that Yahoo and other
production Hadoop users can run them unpatched on their clusters.

Until Hadoop 0.20, Yahoo committers worked as release masters to produce
binary Apache Hadoop releases that the entire community used on their
clusters.As the community grew, we have experiment with using the
"Yahoo! Distribution of Hadoop" as the vehicle to share our work.
  Unfortunately, Apache is no longer the obvious place to go for Hadoop
releases.  The Yahoo! team wants to return to a world where anyone can
download and directly use releases of Hadoop from Apache.  We want to
contribute to the stabilization and testing of those releases.  We also want
to share our regular program of sustaining engineering that backports minor
feature enhancements into new dot releases on a regular basis, so that the
world sees regular improvements coming from Apache every few months, not
years.

Recently the Apache Hadoop community has been very turbulent.  Over the
last few months we have been developing Hadoop enhancements in our internal
git repository while doing a complete review of our options. Our commitment
to open sourcing our work was never in doubt (see http://yhoo.it/e8p3Dd),
but the future of the "Yahoo distribution of Hadoop" was far from clear.
  We've concluded that focusing on Apache Hadoop is the way forward.  We
believe that more focus on communicating our goals to the Apache Hadoop
community, and more willingness to compromise on how we get to those goals,
will help us get back to making Hadoop even better.

Unfortunately, we now have to sort out how to contribute several
person-years worth of work to Apache to let us unwind the Yahoo! git
repositories.  We currently run two lines of Hadoop development, our
sustaining program (hadoop-0.20-sustaining) and hadoop-future.
  Hadoop-0.20-sustaining is the stable version of Hadoop we currently run on
Yahoo's 40,000 nodes.  It contains a series of fixes and enhancements that
are all backwards compatible with our "Hadoop 0.20 with security".  It is
our most stable and high performance release of Hadoop ever.  We've expended
a lot of energy finding and fixing bugs in it this year. We have initiated
the process of contributing this work to Apache in the branch:
hadoop/common/branches/branch-0.20-security.  We've proposed calling this
the 20.100 release.  Once folks have had a chance to try this out and we've
had a chance to respond to their feedback, we plan to create 20.100 release
candidates and ask the community to vote on making them Apache releases.

Hadoop-future is our new feature branch.  We are working on a set of new
features for Hadoop to improve its availability, scalability and
interoperability to make Hadoop more usable in mission critical deployments.
You're going to see another burst of email activity from us as we work to
get hadoop-future patches socialized, reviewed and checked in.  These bulk
checkins are exceptional.  They are the result of us striving to be more
transparent.  Once we've merged our hadoop-future and hadoop-0.20-sustaining
work back into Apache, folks can expect us to return to our regular
development cadence.  Looking forward, we plan to socialize our roadmaps
regularly, actively synchronize our work with other active Hadoop
contributors and develop our code collaboratively, directly in Apache.

In summary, our decision to discontinue the "Yahoo! Distribution of Hadoop"
is a commitment to working more effectively with the Apache Hadoop
community.  Our goal is to make Apache Hadoop THE open source platform for
big data.

Thanks,

E14

--

PS Here is a draft list of key features in hadoop-future:

* HDFS-1052 - Federation, the ability to support much more storage per
Hadoop cluster.

* HADOOP-6728 - A the new metrics framework

* MAPREDUCE-1220 - Optimizations for small jobs

---
PPS This is cross-posted on our blog: http://yhoo.it/i9Ww8W



Re: [DISCUSS] Move common, hdfs, mapreduce contrib components to apache-extras.org or elsewhere

2011-02-01 Thread Todd Lipcon
On Tue, Feb 1, 2011 at 9:37 AM, Tom White  wrote:

>
> HBase moved all its contrib components out of the main tree a few
> months back - can anyone comment how that worked out?
>
>
Sure. For each contrib:

ec2: no longer exists, and now has been integrated into Whirr and much
improved. Whirr has made several releases in the time that HBase has made
one. The whirr contributors know way more about cloud deployment than the
HBase contributors (except where they happen to overlap). Strong net
positive.

mdc_replication: pulled into core since it's developed by core committers
and also needs a fair amount of tight integration with core components

stargate: pulled into core - it was only in contrib as a sort of staging
ground - it's really an improved/new version of the "rest" interface we
already had in core.

transactional: moved to github - this has languished a bit on github because
only one person was actively maintaining it. However, it had already been
"languishing" as part of contrib - even though it compiled, it never really
worked very well in HBase trunk. So, moving it to a place where it's
languished has just made it more obvious what was already true - that it
isn't a well supported component (yet). Recently it's been taken back up by
the author of it - if it develops a large user base it can move quickly and
evolve without waiting on our release. Net: probably a wash

So, overall, I'd say it was a good decision. Though we never had the same
number of contribs that Hadoop seems to have sprouted.

-Todd


>
> On Tue, Feb 1, 2011 at 1:02 AM, Allen Wittenauer
>  wrote:
> >
> > On Jan 31, 2011, at 3:23 PM, Todd Lipcon wrote:
> >
> >> On Sun, Jan 30, 2011 at 11:19 PM, Owen O'Malley 
> wrote:
> >>
> >>>
> >>> Also note that pushing code out of Hadoop has a high cost. There are at
> >>> least 3 forks of the hadoop-gpl-compression code. That creates a lot of
> >>> confusion for the users. A lot of users never go to the work to figure
> out
> >>> which fork and branch of hadoop-gpl-compression work with the version
> of
> >>> Hadoop they installed.
> >>>
> >>>
> >> Indeed it creates confusion, but in my opinion it has been very
> successful
> >> modulo that confusion.
> >
> >I'm not sure how the above works with what you wrote below:
> >
> >> In particular, Kevin and I (who each have a repo on github but basically
> >> co-maintain a branch) have done about 8 bugfix releases of LZO in the
> last
> >> year. The ability to take a bug and turn it around into a release within
> a
> >> few days has been very beneficial to the users. If it were part of core
> >> Hadoop, people would be forced to live with these blocker bugs for
> months at
> >> a time between dot releases.
> >
> >So is the expectation that users would have to follow bread crumbs
> to the github dumping ground, then try to figure out which repo is the
> 'better' choice for their usage?   Using LZO as an example, it appears we
> have a choice of kevin's, your's, or the master without even taking into
> consideration any tags. That sounds like a recipe for disaster that's even
> worse than what we have today.
> >
> >
> >> IMO the more we can take non-core components and move them to separate
> >> release timelines, the better. Yes, it is harder for users, but it also
> is
> >> easier for them when they hit a bug - they don't have to wait months for
> a
> >> wholesale upgrade which might contain hundreds of other changes to core
> >> components.
> >
> >I'd agree except for one thing:  even when users do provide
> patches to contrib components we ignore them.  How long have those patches
> for HOD been sitting there in the patch queue?  So of course they wait
> months/years--because we seemingly ignore anything that isn't important to
> us.  Unfortunately, that covers a large chunk of contrib. :(
> >
> >
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: [DISCUSS] Move common, hdfs, mapreduce contrib components to apache-extras.org or elsewhere

2011-02-01 Thread Todd Lipcon
On Tue, Feb 1, 2011 at 1:02 AM, Allen Wittenauer
wrote:

>
>
> So is the expectation that users would have to follow bread crumbs
> to the github dumping ground, then try to figure out which repo is the
> 'better' choice for their usage?   Using LZO as an example, it appears we
> have a choice of kevin's, your's, or the master without even taking into
> consideration any tags. That sounds like a recipe for disaster that's even
> worse than what we have today.
>
>
Kevin's and mine are currently identical
(0e7005136e4160ed4cc157c4ddd7f4f1c6e11ffa)

Not sure who "the master" is -- maybe you're referring to the Google Code
repo? The reason we started working on github over a year ago is that the
bugs we reported (and provided diffs for) in the Google Code project were
ignored. For example:
http://code.google.com/p/hadoop-gpl-compression/issues/detail?id=17

In fact this repo hasn't been updated since Sep '09:
http://code.google.com/p/hadoop-gpl-compression/source/list

Github provided an excellent place to collaborate on the project, make
progress, fix bugs, and provide a better product for the users.

As for "dumping ground," I don't quite follow your point - we develop in the
open, accept pull requests from users, and code review each others' changes.
Since October every commit has either been contributed by or fixes a bug
reported by a user completely outside of the organizations where Kevin and I
work.

I agree that it's a bit of "breadcrumb following" to find the repo, though.
We do at least have a link on the wiki:
http://wiki.apache.org/hadoop/UsingLzoCompression which points to Kevin's
repo.

Perhaps the best solution here is to add a page to the official Hadoop site
(not just the wiki) with links to actively maintained contrib projects?


>
> > IMO the more we can take non-core components and move them to separate
> > release timelines, the better. Yes, it is harder for users, but it also
> is
> > easier for them when they hit a bug - they don't have to wait months for
> a
> > wholesale upgrade which might contain hundreds of other changes to core
> > components.
>
> I'd agree except for one thing:  even when users do provide patches
> to contrib components we ignore them.  How long have those patches for HOD
> been sitting there in the patch queue?  So of course they wait
> months/years--because we seemingly ignore anything that isn't important to
> us.  Unfortunately, that covers a large chunk of contrib. :(
>

True - we ignore them because the core contributors generally have little
clue about the contrib components, so don't feel qualified to review. I'll
happily admit that I've never run failmon, index, dynamic-scheduler,
eclipse-plugin, data_join, mumak, or vertica contribs. Wouldn't you rather
these components lived on github so the people who wrote them could update
them as they wished without having to wait on committers who have little to
no clue about how to evaluate the changes?

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Hadoop-common-trunk-Commit is failing since 01/19/2011

2011-02-01 Thread Konstantin Shvachko
Giri,
Thanks a lot for fixing this.
I see it is working now.
--Konstantin

On Tue, Feb 1, 2011 at 11:27 AM, Giridharan Kesavan
wrote:

> Konstantin,
>
> trunk/artifacts gets populated when the jar and the tar ant target are
> successful.
>
> The main reason for the build failure so far is the build abort time
> configuration. It was set to 30mins.
> I have increased the build abort time and the builds are going on fine
>
> https://hudson.apache.org/hudson/view/G-L/view/Hadoop/job/Hadoop-Common-trunk-Commit
>
>
> Thanks,
> Giri
>
> On Feb 1, 2011, at 12:40 AM, Konstantin Shvachko wrote:
>
> > Giri,
> >
> > Looking at configuration of Hadoop-Common-trunk-Commit/
> > There seems to be errors in the Post-build Actions.
> > It is complaining that
> > 'trunk' exists but not 'trunk/artifacts/...'
> > Is it possible that this misconfiguration is the reason of failures?
> >
> > --Konstantin
> >
> >
> > On Mon, Jan 31, 2011 at 4:40 PM, Giridharan Kesavan
> > wrote:
> >
> >> Konstantin,
> >>
> >> I think I need to restart the slave which is running the commit build.
> For
> >> now I have published the common artifact manually from commandline.
> >>
> >> Thanks,
> >> Giri
> >>
> >> On Jan 31, 2011, at 4:27 PM, Konstantin Shvachko wrote:
> >>
> >>> Giri
> >>> looks like the last run you started failed the same way as previous
> ones.
> >>> Any thoughts on what's going on?
> >>> Thanks,
> >>> --Konstantin
> >>>
> >>> On Mon, Jan 31, 2011 at 3:33 PM, Giridharan Kesavan
> >>> wrote:
> >>>
>  ant mvn-deploy would publish snapshot artifact to the apache maven
>  repository as long you have the right credentials in
> ~/.m2/settings.xml.
> 
>  For settings.xml template pls look at
>  http://wiki.apache.org/hadoop/HowToRelease
> 
>  I'm pushing the latest common artifacts now.
> 
>  -Giri
> 
> 
> 
>  On Jan 31, 2011, at 3:11 PM, Jakob Homan wrote:
> 
> > By manually installing a new core jar into the cache, I can compile
> > trunk.  Looks like we just need to kick a new Core into maven.  Are
> > there instructions somewhere for committers to do this?  I know Nigel
> > and Owen know how, but I don't know if the knowledge is diffused past
> > them.
> > -Jakob
> >
> >
> > On Mon, Jan 31, 2011 at 1:57 PM, Konstantin Shvachko
> >  wrote:
> >> Current trunk for HDFS and MapReduce are not compiling at the
> moment.
>  Try to
> >> build trunk.
> >> This is the result of that changes to common api introduced by
>  HADOOP-6904
> >> are not promoted to HDFS and MR trunks.
> >> HDFS-1335 and MAPREDUCE-2263 depend on these changes.
> >>
> >> Common is not promoted to HDFS and MR because
> >> Hadoop-Common-trunk-Commit
> >> build is broken. See here.
> >>
> 
> >>
> https://hudson.apache.org/hudson/view/G-L/view/Hadoop/job/Hadoop-Common-trunk-Commit/
> >>
> >> As I see the last successful build was on 01/19, which integrated
> >> HADOOP-6864.
> >> I think this is when JNI changes were introduced, which cannot be
>  digested
> >> by Hudson since then.
> >>
> >> Anybody with gcc active could you please verify if the problem is
> >> caused
>  by
> >> HADOOP-6864.
> >>
> >> Thanks,
> >> --Konstantin
> >>
> >> On Mon, Jan 31, 2011 at 1:36 PM, Ted Dunning  >
>  wrote:
> >>
> >>> The has been a problem with more than one build failing (Mahout is
> >> the
>  one
> >>> that I saw first) due to a change in maven version which meant that
> >> the
> >>> clover license isn't being found properly.  At least, that is the
> >> tale
>  I
> >>> heard from infra.
> >>>
> >>> On Mon, Jan 31, 2011 at 1:31 PM, Eli Collins 
> >> wrote:
> >>>
>  Hey Konstantin,
> 
>  The only build breakage I saw from HADOOP-6904 is MAPREDUCE-2290,
>  which was fixed.  Trees from trunk are compiling against each
> other
>  for me (eg each installed to a local maven repo), perhaps the
> >> upstream
>  maven repo hasn't been updated with the latest bits yet.
> 
>  Thanks,
>  Eli
> 
>  On Mon, Jan 31, 2011 at 12:14 PM, Konstantin Shvachko
>   wrote:
> > Sending this to general to attract urgent attention.
> > Both HDFS and MapReduce are not compiling since
> > HADOOP-6904 and its hdfs and MP counterparts were committed.
> > The problem is not with this patch as described below, but I
> think
> >>> those
> > commits should be reversed if Common integration build cannot be
> > restored promptly.
> >
> > Thanks,
> > --Konstantin
> >
> >
> > On Fri, Jan 28, 2011 at 5:53 PM, Konstantin Shvachko
> > wrote:
> >
> >> I see Hadoop-common-trunk-Commit is failing and not sending any
> >>> emails.
> >> It times out on native 

Re: Hadoop-common-trunk-Commit is failing since 01/19/2011

2011-02-01 Thread Giridharan Kesavan
Konstantin,

trunk/artifacts gets populated when the jar and the tar ant target are 
successful.

The main reason for the build failure so far is the build abort time 
configuration. It was set to 30mins.
I have increased the build abort time and the builds are going on fine 
https://hudson.apache.org/hudson/view/G-L/view/Hadoop/job/Hadoop-Common-trunk-Commit


Thanks,
Giri

On Feb 1, 2011, at 12:40 AM, Konstantin Shvachko wrote:

> Giri,
> 
> Looking at configuration of Hadoop-Common-trunk-Commit/
> There seems to be errors in the Post-build Actions.
> It is complaining that
> 'trunk' exists but not 'trunk/artifacts/...'
> Is it possible that this misconfiguration is the reason of failures?
> 
> --Konstantin
> 
> 
> On Mon, Jan 31, 2011 at 4:40 PM, Giridharan Kesavan
> wrote:
> 
>> Konstantin,
>> 
>> I think I need to restart the slave which is running the commit build. For
>> now I have published the common artifact manually from commandline.
>> 
>> Thanks,
>> Giri
>> 
>> On Jan 31, 2011, at 4:27 PM, Konstantin Shvachko wrote:
>> 
>>> Giri
>>> looks like the last run you started failed the same way as previous ones.
>>> Any thoughts on what's going on?
>>> Thanks,
>>> --Konstantin
>>> 
>>> On Mon, Jan 31, 2011 at 3:33 PM, Giridharan Kesavan
>>> wrote:
>>> 
 ant mvn-deploy would publish snapshot artifact to the apache maven
 repository as long you have the right credentials in ~/.m2/settings.xml.
 
 For settings.xml template pls look at
 http://wiki.apache.org/hadoop/HowToRelease
 
 I'm pushing the latest common artifacts now.
 
 -Giri
 
 
 
 On Jan 31, 2011, at 3:11 PM, Jakob Homan wrote:
 
> By manually installing a new core jar into the cache, I can compile
> trunk.  Looks like we just need to kick a new Core into maven.  Are
> there instructions somewhere for committers to do this?  I know Nigel
> and Owen know how, but I don't know if the knowledge is diffused past
> them.
> -Jakob
> 
> 
> On Mon, Jan 31, 2011 at 1:57 PM, Konstantin Shvachko
>  wrote:
>> Current trunk for HDFS and MapReduce are not compiling at the moment.
 Try to
>> build trunk.
>> This is the result of that changes to common api introduced by
 HADOOP-6904
>> are not promoted to HDFS and MR trunks.
>> HDFS-1335 and MAPREDUCE-2263 depend on these changes.
>> 
>> Common is not promoted to HDFS and MR because
>> Hadoop-Common-trunk-Commit
>> build is broken. See here.
>> 
 
>> https://hudson.apache.org/hudson/view/G-L/view/Hadoop/job/Hadoop-Common-trunk-Commit/
>> 
>> As I see the last successful build was on 01/19, which integrated
>> HADOOP-6864.
>> I think this is when JNI changes were introduced, which cannot be
 digested
>> by Hudson since then.
>> 
>> Anybody with gcc active could you please verify if the problem is
>> caused
 by
>> HADOOP-6864.
>> 
>> Thanks,
>> --Konstantin
>> 
>> On Mon, Jan 31, 2011 at 1:36 PM, Ted Dunning 
 wrote:
>> 
>>> The has been a problem with more than one build failing (Mahout is
>> the
 one
>>> that I saw first) due to a change in maven version which meant that
>> the
>>> clover license isn't being found properly.  At least, that is the
>> tale
 I
>>> heard from infra.
>>> 
>>> On Mon, Jan 31, 2011 at 1:31 PM, Eli Collins 
>> wrote:
>>> 
 Hey Konstantin,
 
 The only build breakage I saw from HADOOP-6904 is MAPREDUCE-2290,
 which was fixed.  Trees from trunk are compiling against each other
 for me (eg each installed to a local maven repo), perhaps the
>> upstream
 maven repo hasn't been updated with the latest bits yet.
 
 Thanks,
 Eli
 
 On Mon, Jan 31, 2011 at 12:14 PM, Konstantin Shvachko
  wrote:
> Sending this to general to attract urgent attention.
> Both HDFS and MapReduce are not compiling since
> HADOOP-6904 and its hdfs and MP counterparts were committed.
> The problem is not with this patch as described below, but I think
>>> those
> commits should be reversed if Common integration build cannot be
> restored promptly.
> 
> Thanks,
> --Konstantin
> 
> 
> On Fri, Jan 28, 2011 at 5:53 PM, Konstantin Shvachko
> wrote:
> 
>> I see Hadoop-common-trunk-Commit is failing and not sending any
>>> emails.
>> It times out on native compilation and aborts.
>> Therefore changes are not integrated, and now it lead to hdfs and
 mapreduce
>> both not compiling.
>> Can somebody please take a look at this.
>> The last few lines of the build are below.
>> 
>> Thanks
>> --Konstantin
>> 
>>   [javah] [Loaded
 
>>> 
 
>> /grid/0/hudson/hudson-sla

Re: Defining Compatibility

2011-02-01 Thread Tom White
FWIW the FileSystemContractBaseTest class and the FileContext*BaseTest
classes (and their concrete subclasses) are probably the closest thing
we have to compatibility tests for FileSystem and FileContext
implementations in Hadoop.

Tom

On Mon, Jan 31, 2011 at 7:59 AM, Steve Loughran  wrote:
> On 31/01/11 14:32, Chris Douglas wrote:
>>
>> Steve-
>>
>> It's hard to answer without more concrete criteria. Is this a
>> trademark question affecting the marketing of a product? A
>> cross-compatibility taxonomy for users? The minimum criteria to
>> publish a paper/release a product without eye-rolling? The particular
>> compatibility claims made by a system will be nuanced and specific; a
>> runtime that executes MapReduce jobs as they would run in Hadoop can
>> simply make that claim, whether it uses parts of MapReduce, HDFS, or
>> neither.
>
> No, I'm thinking more about what large scale tests are needed to be run
> against the codebase before you can say "it works", and then how to say some
> changes means that it still works.
>
>>
>> For the various distributions "Powered by Apache Hadoop," one would
>> assume that compatibility will vary depending on the featureset and
>> the audience. A distribution that runs MapReduce applications
>> as-written for Apache Hadoop may be incompatible with a user's
>> deployed metrics/monitoring system. Some random script to scrape the
>> UI may not work. The product may only scale to 20 nodes. Whether these
>> are "compatible with Apache Hadoop" is awkward to answer generally,
>> unless we want to define the semantics of that phrase by policy.
>>
>> To put it bluntly, why would we bother to define such a policy? One
>> could assert that a fully-compatible system would implement all the
>> public/stable APIs as defined in HADOOP-5073, but who would that help?
>> And though interoperability is certainly relevant to systems built on
>> top of Hadoop, is there a reason the Apache project needs to be
>> involved in defining the standards for compatibility among them?
>
> Agreed, I'm just thinking about namings and definitions. Even with the
> stable/unstable internal/external split, there's still the question as to
> what the semantics of operations are, both explicit (this operation does X)
> and implicit (and it takes less than Y seconds to do it). It's those
> implicit things that always catch you out (indeed, they are the argument
> points in things like Java and Java EE compatibility test kits)
>


Re: [DISCUSS] Move common, hdfs, mapreduce contrib components to apache-extras.org or elsewhere

2011-02-01 Thread Tom White
+1 for the reasons already cited: independent release cycles,
testing/build problems, lack of maintenance, etc. I think we should
strongly discourage new contrib components in favour of Apache Extras
or github, remove inactive contrib components, and also allow
maintainers to move components out if they volunteer to.

HBase moved all its contrib components out of the main tree a few
months back - can anyone comment how that worked out?

I agree that we should move streaming (MAPREDUCE-602) and the
schedulers to the main codebase. With work like MAPREDUCE-1478 we can
put these components into a library tree so that the libraries can
depend on core, but core doesn't depend on the libraries.

Milind: Record IO is in Common (in the main tree, not a contrib
component), and was deprecated in 0.21.0. We could remove it in a
future release.

Cheers,
Tom

On Tue, Feb 1, 2011 at 1:02 AM, Allen Wittenauer
 wrote:
>
> On Jan 31, 2011, at 3:23 PM, Todd Lipcon wrote:
>
>> On Sun, Jan 30, 2011 at 11:19 PM, Owen O'Malley  wrote:
>>
>>>
>>> Also note that pushing code out of Hadoop has a high cost. There are at
>>> least 3 forks of the hadoop-gpl-compression code. That creates a lot of
>>> confusion for the users. A lot of users never go to the work to figure out
>>> which fork and branch of hadoop-gpl-compression work with the version of
>>> Hadoop they installed.
>>>
>>>
>> Indeed it creates confusion, but in my opinion it has been very successful
>> modulo that confusion.
>
>        I'm not sure how the above works with what you wrote below:
>
>> In particular, Kevin and I (who each have a repo on github but basically
>> co-maintain a branch) have done about 8 bugfix releases of LZO in the last
>> year. The ability to take a bug and turn it around into a release within a
>> few days has been very beneficial to the users. If it were part of core
>> Hadoop, people would be forced to live with these blocker bugs for months at
>> a time between dot releases.
>
>        So is the expectation that users would have to follow bread crumbs to 
> the github dumping ground, then try to figure out which repo is the 'better' 
> choice for their usage?   Using LZO as an example, it appears we have a 
> choice of kevin's, your's, or the master without even taking into 
> consideration any tags. That sounds like a recipe for disaster that's even 
> worse than what we have today.
>
>
>> IMO the more we can take non-core components and move them to separate
>> release timelines, the better. Yes, it is harder for users, but it also is
>> easier for them when they hit a bug - they don't have to wait months for a
>> wholesale upgrade which might contain hundreds of other changes to core
>> components.
>
>        I'd agree except for one thing:  even when users do provide patches to 
> contrib components we ignore them.  How long have those patches for HOD been 
> sitting there in the patch queue?  So of course they wait 
> months/years--because we seemingly ignore anything that isn't important to 
> us.  Unfortunately, that covers a large chunk of contrib. :(
>
>
>


Re: [ANNOUNCEMENT] Yahoo focusing on Apache Hadoop, discontinuing "The Yahoo Distribution of Hadoop"

2011-02-01 Thread Andrew Purtell
> From: Alan Gates 
>
> We will be proposing Howl as an Incubator project soon.

That would be excellent.

Best regards,

- Andy

Problems worthy of attack prove their worth by hitting back.
  - Piet Hein (via Tom White)



  


[ANN] Plasma MapReduce, PlasmaFS, version 0.3

2011-02-01 Thread Gerd Stolpmann
Hi,

This is about the release of Plasma-0.3, an alternate and independent
implementation of map/reduce with its own dfs. This might also be
interesting for Hadoop users and developers, because this project
incorporates a number of new ideas. So far, Plasma works on smaller
clusters and shows good signs of being scalable. HA support is still
very incomplete.

--

Plasma consists of two parts (for now), namely Plasma MapReduce, a
map/reduce compute framework, and PlasmaFS, the underlying distributed
filesystem.

Major changes in version 0.3 :

  * Optimized blocklist representation (extent-based)
  * Improved block allocator to minimize disk seeks
  * Allocating datanode access tickets in advance
  * Sophisticated RAM management
  * The command-line utility "plasma" supports wildcards

Of course, there are also numerous bug fixes and performance
improvements.

Plasma MapReduce is a distributed implementation of the map/reduce
algorithm scheme written in Ocaml. PlasmaFS is the underlying
distributed filesystem, also written in Ocaml. Especially the PlasmaFS
approach has numerous differences compared to HDFS:

  * Data blocks are preallocated, and PlasmaFS takes care of block
placement
  * Blocklists are extent-based
  * Metadata is stored in a PostgreSQL db
  * 2-phase commit is used to distribute the metadata db
  * the full set of file access functions is supported, including
random writes
  * file accesses can be transaction-based
  * shared memory can be used for speeding up the data path to
locally stored data blocks
  * we _think_ it is not possible to corrupt the namenode by
accident or by crashes
  * PlasmaFS volumes can be directly mounted via NFS
  * PlasmaFS uses ONCRPC as protocol and not home-grown protocols
(and one of the next releases will add security via GSS-API)
  * We got rid of multi-threading

There is no need that user programs are written in Ocaml, as Plasma also
support a streaming mode.

Both pieces of software are bundled together in one download. The
project page with further links is

http://projects.camlcity.org/projects/plasma.html

There is now also a homepage at

http://plasma.camlcity.org

This is an early alpha release (0.3). A lot of things work already, and
you can already run distributed map/reduce jobs. However, it is in no
way complete.

Plasma is installable via GODI for Ocaml 3.12.

For discussions on specifics of Plasma there is a separate mailing list:

https://godirepo.camlcity.org/mailman/listinfo/plasma-list

Gerd
-- 

Gerd Stolpmann, Bad Nauheimer Str.3, 64289 Darmstadt,Germany 
g...@gerd-stolpmann.de  http://www.gerd-stolpmann.de
Phone: +49-6151-153855  Fax: +49-6151-997714





Re: [ANNOUNCEMENT] Yahoo focusing on Apache Hadoop, discontinuing "The Yahoo Distribution of Hadoop"

2011-02-01 Thread Alan Gates

We will be proposing Howl as an Incubator project soon.

Alan.

On Jan 31, 2011, at 7:44 PM, Jeff Hammerbacher wrote:

Excellent news! Will you also make Howl, Oozie, and Yarn Apache  
projects as

well?

On Mon, Jan 31, 2011 at 7:27 PM, Eric Baldeschwieler
wrote:


Hi Folks,

I'm pleased to announce that after some reflection, Yahoo! has  
decided to
discontinue the  "The Yahoo Distribution of Hadoop" and focus on  
Apache
Hadoop.  We plan to remove all references to a Yahoo distribution  
from our

website (developer.yahoo.com/hadoop), close our github repo (
yahoo.github.com/hadoop-common) and focus on working more closely  
with the
Apache community.  Our intent is to return to helping Apache  
produce binary
releases of Apache Hadoop that are so bullet proof that Yahoo and  
other

production Hadoop users can run them unpatched on their clusters.

Until Hadoop 0.20, Yahoo committers worked as release masters to  
produce

binary Apache Hadoop releases that the entire community used on their
clusters.As the community grew, we have experiment with using the
"Yahoo! Distribution of Hadoop" as the vehicle to share our work.
Unfortunately, Apache is no longer the obvious place to go for Hadoop
releases.  The Yahoo! team wants to return to a world where anyone  
can

download and directly use releases of Hadoop from Apache.  We want to
contribute to the stabilization and testing of those releases.  We  
also want
to share our regular program of sustaining engineering that  
backports minor
feature enhancements into new dot releases on a regular basis, so  
that the
world sees regular improvements coming from Apache every few  
months, not

years.

Recently the Apache Hadoop community has been very turbulent.  Over  
the
last few months we have been developing Hadoop enhancements in our  
internal
git repository while doing a complete review of our options. Our  
commitment
to open sourcing our work was never in doubt (see http://yhoo.it/e8p3Dd) 
,
but the future of the "Yahoo distribution of Hadoop" was far from  
clear.
We've concluded that focusing on Apache Hadoop is the way forward.   
We
believe that more focus on communicating our goals to the Apache  
Hadoop
community, and more willingness to compromise on how we get to  
those goals,

will help us get back to making Hadoop even better.

Unfortunately, we now have to sort out how to contribute several
person-years worth of work to Apache to let us unwind the Yahoo! git
repositories.  We currently run two lines of Hadoop development, our
sustaining program (hadoop-0.20-sustaining) and hadoop-future.
Hadoop-0.20-sustaining is the stable version of Hadoop we currently  
run on
Yahoo's 40,000 nodes.  It contains a series of fixes and  
enhancements that
are all backwards compatible with our "Hadoop 0.20 with security".   
It is
our most stable and high performance release of Hadoop ever.  We've  
expended
a lot of energy finding and fixing bugs in it this year. We have  
initiated

the process of contributing this work to Apache in the branch:
hadoop/common/branches/branch-0.20-security.  We've proposed  
calling this
the 20.100 release.  Once folks have had a chance to try this out  
and we've
had a chance to respond to their feedback, we plan to create 20.100  
release
candidates and ask the community to vote on making them Apache  
releases.


Hadoop-future is our new feature branch.  We are working on a set  
of new

features for Hadoop to improve its availability, scalability and
interoperability to make Hadoop more usable in mission critical  
deployments.
You're going to see another burst of email activity from us as we  
work to
get hadoop-future patches socialized, reviewed and checked in.   
These bulk
checkins are exceptional.  They are the result of us striving to be  
more
transparent.  Once we've merged our hadoop-future and hadoop-0.20- 
sustaining

work back into Apache, folks can expect us to return to our regular
development cadence.  Looking forward, we plan to socialize our  
roadmaps

regularly, actively synchronize our work with other active Hadoop
contributors and develop our code collaboratively, directly in  
Apache.


In summary, our decision to discontinue the "Yahoo! Distribution of  
Hadoop"

is a commitment to working more effectively with the Apache Hadoop
community.  Our goal is to make Apache Hadoop THE open source  
platform for

big data.

Thanks,

E14

--

PS Here is a draft list of key features in hadoop-future:

* HDFS-1052 - Federation, the ability to support much more storage  
per

Hadoop cluster.

* HADOOP-6728 - A the new metrics framework

* MAPREDUCE-1220 - Optimizations for small jobs

---
PPS This is cross-posted on our blog: http://yhoo.it/i9Ww8W




Re: [ANNOUNCEMENT] Yahoo focusing on Apache Hadoop, discontinuing "The Yahoo Distribution of Hadoop"

2011-02-01 Thread Ian Holsman
Congratulations Eric.
this is fantastic news.
On Jan 31, 2011, at 10:27 PM, Eric Baldeschwieler wrote:

> Hi Folks,
> 
> I'm pleased to announce that after some reflection, Yahoo! has decided to 
> discontinue the  "The Yahoo Distribution of Hadoop" and focus on Apache 
> Hadoop.  We plan to remove all references to a Yahoo distribution from our 
> website (developer.yahoo.com/hadoop), close our github repo 
> (yahoo.github.com/hadoop-common) and focus on working more closely with the 
> Apache community.  Our intent is to return to helping Apache produce binary 
> releases of Apache Hadoop that are so bullet proof that Yahoo and other 
> production Hadoop users can run them unpatched on their clusters.
> 
> Until Hadoop 0.20, Yahoo committers worked as release masters to produce 
> binary Apache Hadoop releases that the entire community used on their 
> clusters.As the community grew, we have experiment with using the "Yahoo! 
> Distribution of Hadoop" as the vehicle to share our work.  Unfortunately, 
> Apache is no longer the obvious place to go for Hadoop releases.  The Yahoo! 
> team wants to return to a world where anyone can download and directly use 
> releases of Hadoop from Apache.  We want to contribute to the stabilization 
> and testing of those releases.  We also want to share our regular program of 
> sustaining engineering that backports minor feature enhancements into new dot 
> releases on a regular basis, so that the world sees regular improvements 
> coming from Apache every few months, not years.
> 
> Recently the Apache Hadoop community has been very turbulent.  Over the last 
> few months we have been developing Hadoop enhancements in our internal git 
> repository while doing a complete review of our options. Our commitment to 
> open sourcing our work was never in doubt (see http://yhoo.it/e8p3Dd), but 
> the future of the "Yahoo distribution of Hadoop" was far from clear.  We've 
> concluded that focusing on Apache Hadoop is the way forward.  We believe that 
> more focus on communicating our goals to the Apache Hadoop community, and 
> more willingness to compromise on how we get to those goals, will help us get 
> back to making Hadoop even better.
> 
> Unfortunately, we now have to sort out how to contribute several person-years 
> worth of work to Apache to let us unwind the Yahoo! git repositories.  We 
> currently run two lines of Hadoop development, our sustaining program 
> (hadoop-0.20-sustaining) and hadoop-future.  Hadoop-0.20-sustaining is the 
> stable version of Hadoop we currently run on Yahoo's 40,000 nodes.  It 
> contains a series of fixes and enhancements that are all backwards compatible 
> with our "Hadoop 0.20 with security".  It is our most stable and high 
> performance release of Hadoop ever.  We've expended a lot of energy finding 
> and fixing bugs in it this year. We have initiated the process of 
> contributing this work to Apache in the branch: 
> hadoop/common/branches/branch-0.20-security.  We've proposed calling this the 
> 20.100 release.  Once folks have had a chance to try this out and we've had a 
> chance to respond to their feedback, we plan to create 20.100 release 
> candidates and ask the community to vote on making them Apache releases. 
> 
> Hadoop-future is our new feature branch.  We are working on a set of new 
> features for Hadoop to improve its availability, scalability and 
> interoperability to make Hadoop more usable in mission critical deployments. 
> You're going to see another burst of email activity from us as we work to get 
> hadoop-future patches socialized, reviewed and checked in.  These bulk 
> checkins are exceptional.  They are the result of us striving to be more 
> transparent.  Once we've merged our hadoop-future and hadoop-0.20-sustaining 
> work back into Apache, folks can expect us to return to our regular 
> development cadence.  Looking forward, we plan to socialize our roadmaps 
> regularly, actively synchronize our work with other active Hadoop 
> contributors and develop our code collaboratively, directly in Apache.
> 
> In summary, our decision to discontinue the "Yahoo! Distribution of Hadoop" 
> is a commitment to working more effectively with the Apache Hadoop community. 
>  Our goal is to make Apache Hadoop THE open source platform for big data.
> 
> Thanks,
> 
> E14
> 
> --
> 
> PS Here is a draft list of key features in hadoop-future:
> 
> * HDFS-1052 - Federation, the ability to support much more storage per Hadoop 
> cluster.
> 
> * HADOOP-6728 - A the new metrics framework
> 
> * MAPREDUCE-1220 - Optimizations for small jobs
> 
> ---
> PPS This is cross-posted on our blog: http://yhoo.it/i9Ww8W



Job opportunities - 2 roles with Hadoop

2011-02-01 Thread Magdalena Moll-Musiał
Dear All,

 

I am looking for experienced candidates to join Nokia in Berlin.

 

Please follow the links to read the descriptions of the roles:

 

http://www.mm-consulting.com.pl/senior-software-engineer---machine-learning.
html

http://www.mm-consulting.com.pl/analitics.html

 

Your CVs are more than welcome J 

Pease provide me with your CV in English and please suggest when I could
call you to discuss the opportunity (15 minutes conversation).

 

In case of questions, please let me know.

 

Best regards,

Magda 

 

__

mmConsulting Magdalena Moll-Musiał

recruitment partner 

+48 516 340 127

  off...@mm-consulting.com.pl

  www.mm-consulting.com.pl

 



Re: [DISCUSS] Move common, hdfs, mapreduce contrib components to apache-extras.org or elsewhere

2011-02-01 Thread Allen Wittenauer

On Jan 31, 2011, at 3:23 PM, Todd Lipcon wrote:

> On Sun, Jan 30, 2011 at 11:19 PM, Owen O'Malley  wrote:
> 
>> 
>> Also note that pushing code out of Hadoop has a high cost. There are at
>> least 3 forks of the hadoop-gpl-compression code. That creates a lot of
>> confusion for the users. A lot of users never go to the work to figure out
>> which fork and branch of hadoop-gpl-compression work with the version of
>> Hadoop they installed.
>> 
>> 
> Indeed it creates confusion, but in my opinion it has been very successful
> modulo that confusion.

I'm not sure how the above works with what you wrote below:

> In particular, Kevin and I (who each have a repo on github but basically
> co-maintain a branch) have done about 8 bugfix releases of LZO in the last
> year. The ability to take a bug and turn it around into a release within a
> few days has been very beneficial to the users. If it were part of core
> Hadoop, people would be forced to live with these blocker bugs for months at
> a time between dot releases.

So is the expectation that users would have to follow bread crumbs to 
the github dumping ground, then try to figure out which repo is the 'better' 
choice for their usage?   Using LZO as an example, it appears we have a choice 
of kevin's, your's, or the master without even taking into consideration any 
tags. That sounds like a recipe for disaster that's even worse than what we 
have today.


> IMO the more we can take non-core components and move them to separate
> release timelines, the better. Yes, it is harder for users, but it also is
> easier for them when they hit a bug - they don't have to wait months for a
> wholesale upgrade which might contain hundreds of other changes to core
> components.

I'd agree except for one thing:  even when users do provide patches to 
contrib components we ignore them.  How long have those patches for HOD been 
sitting there in the patch queue?  So of course they wait months/years--because 
we seemingly ignore anything that isn't important to us.  Unfortunately, that 
covers a large chunk of contrib. :(




Re: Hadoop-common-trunk-Commit is failing since 01/19/2011

2011-02-01 Thread Konstantin Shvachko
Giri,

Looking at configuration of Hadoop-Common-trunk-Commit/
There seems to be errors in the Post-build Actions.
It is complaining that
'trunk' exists but not 'trunk/artifacts/...'
Is it possible that this misconfiguration is the reason of failures?

--Konstantin


On Mon, Jan 31, 2011 at 4:40 PM, Giridharan Kesavan
wrote:

> Konstantin,
>
> I think I need to restart the slave which is running the commit build. For
> now I have published the common artifact manually from commandline.
>
> Thanks,
> Giri
>
> On Jan 31, 2011, at 4:27 PM, Konstantin Shvachko wrote:
>
> > Giri
> > looks like the last run you started failed the same way as previous ones.
> > Any thoughts on what's going on?
> > Thanks,
> > --Konstantin
> >
> > On Mon, Jan 31, 2011 at 3:33 PM, Giridharan Kesavan
> > wrote:
> >
> >> ant mvn-deploy would publish snapshot artifact to the apache maven
> >> repository as long you have the right credentials in ~/.m2/settings.xml.
> >>
> >> For settings.xml template pls look at
> >> http://wiki.apache.org/hadoop/HowToRelease
> >>
> >> I'm pushing the latest common artifacts now.
> >>
> >> -Giri
> >>
> >>
> >>
> >> On Jan 31, 2011, at 3:11 PM, Jakob Homan wrote:
> >>
> >>> By manually installing a new core jar into the cache, I can compile
> >>> trunk.  Looks like we just need to kick a new Core into maven.  Are
> >>> there instructions somewhere for committers to do this?  I know Nigel
> >>> and Owen know how, but I don't know if the knowledge is diffused past
> >>> them.
> >>> -Jakob
> >>>
> >>>
> >>> On Mon, Jan 31, 2011 at 1:57 PM, Konstantin Shvachko
> >>>  wrote:
>  Current trunk for HDFS and MapReduce are not compiling at the moment.
> >> Try to
>  build trunk.
>  This is the result of that changes to common api introduced by
> >> HADOOP-6904
>  are not promoted to HDFS and MR trunks.
>  HDFS-1335 and MAPREDUCE-2263 depend on these changes.
> 
>  Common is not promoted to HDFS and MR because
> Hadoop-Common-trunk-Commit
>  build is broken. See here.
> 
> >>
> https://hudson.apache.org/hudson/view/G-L/view/Hadoop/job/Hadoop-Common-trunk-Commit/
> 
>  As I see the last successful build was on 01/19, which integrated
>  HADOOP-6864.
>  I think this is when JNI changes were introduced, which cannot be
> >> digested
>  by Hudson since then.
> 
>  Anybody with gcc active could you please verify if the problem is
> caused
> >> by
>  HADOOP-6864.
> 
>  Thanks,
>  --Konstantin
> 
>  On Mon, Jan 31, 2011 at 1:36 PM, Ted Dunning 
> >> wrote:
> 
> > The has been a problem with more than one build failing (Mahout is
> the
> >> one
> > that I saw first) due to a change in maven version which meant that
> the
> > clover license isn't being found properly.  At least, that is the
> tale
> >> I
> > heard from infra.
> >
> > On Mon, Jan 31, 2011 at 1:31 PM, Eli Collins 
> wrote:
> >
> >> Hey Konstantin,
> >>
> >> The only build breakage I saw from HADOOP-6904 is MAPREDUCE-2290,
> >> which was fixed.  Trees from trunk are compiling against each other
> >> for me (eg each installed to a local maven repo), perhaps the
> upstream
> >> maven repo hasn't been updated with the latest bits yet.
> >>
> >> Thanks,
> >> Eli
> >>
> >> On Mon, Jan 31, 2011 at 12:14 PM, Konstantin Shvachko
> >>  wrote:
> >>> Sending this to general to attract urgent attention.
> >>> Both HDFS and MapReduce are not compiling since
> >>> HADOOP-6904 and its hdfs and MP counterparts were committed.
> >>> The problem is not with this patch as described below, but I think
> > those
> >>> commits should be reversed if Common integration build cannot be
> >>> restored promptly.
> >>>
> >>> Thanks,
> >>> --Konstantin
> >>>
> >>>
> >>> On Fri, Jan 28, 2011 at 5:53 PM, Konstantin Shvachko
> >>> wrote:
> >>>
>  I see Hadoop-common-trunk-Commit is failing and not sending any
> > emails.
>  It times out on native compilation and aborts.
>  Therefore changes are not integrated, and now it lead to hdfs and
> >> mapreduce
>  both not compiling.
>  Can somebody please take a look at this.
>  The last few lines of the build are below.
> 
>  Thanks
>  --Konstantin
> 
> [javah] [Loaded
> >>
> >
> >>
> /grid/0/hudson/hudson-slave/workspace/Hadoop-Common-trunk-Commit/trunk/build/classes/org/apache/hadoop/security/JniBasedUnixGroupsMapping.class]
> 
> [javah] [Loaded
> >>
> >
> >>
> /homes/hudson/tools/java/jdk1.6.0_11-32/jre/lib/rt.jar(java/lang/Object.class)]
> [javah] [Forcefully writing file
> >>
> >
> >>
> /grid/0/hudson/hudson-slave/workspace/Hadoop-Common-trunk-Commit/trunk/build/native/Linux-i386-32/src/org/apache/hadoop/security/org_apache_hadoop_security_JniBasedUnixGroupsNetgroup