Giving filename as key to mapper ?

2011-07-15 Thread praveenesh kumar
Hi,
How can I give filename as key to mapper ?
I want to know the occurence of word in set of docs, so I want to keep key
as filename. Is it possible to give input key as filename in map function ?
Thanks,
Praveenesh


Re: Giving filename as key to mapper ?

2011-07-15 Thread Harsh J
You can retrieve the filename in the new API as described here:

http://search-hadoop.com/m/ZOmmJ1PZJqt1/map+input+filenamesubj=Retrieving+Filename

In the old API, its available in the configuration instance of the
mapper as key map.input.file. See the table below this section
http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Task+JVM+Reuse
for more such goodies.

On Fri, Jul 15, 2011 at 5:44 PM, praveenesh kumar praveen...@gmail.com wrote:
 Hi,
 How can I give filename as key to mapper ?
 I want to know the occurence of word in set of docs, so I want to keep key
 as filename. Is it possible to give input key as filename in map function ?
 Thanks,
 Praveenesh




-- 
Harsh J


RE: Which release to use?

2011-07-15 Thread Isaac Dooley
Will 0.23 include Kerberos authentication? Will this finally unite the Yahoo 
and Apache branches?

-Original Message-
From: Arun C Murthy [mailto:a...@hortonworks.com] 
Sent: Thursday, July 14, 2011 7:43 PM
To: common-user@hadoop.apache.org
Subject: Re: Which release to use?

Hi,

 0.20.203 is the latest stable release which includes a ton of features 
(security - kerberos based authentication) and fixes. Its currently deployed at 
over 50k machines at Yahoo too.
 So, yes, I'd encourage you to use 0.20.203. We, the community, are currently 
working on hadoop-0.23 and hope to get it out soon.

thanks,
Arun

On Jul 14, 2011, at 4:33 PM, Teruhiko Kurosaka wrote:

 I'm a newbie and I am confused by the Hadoop releases.
 I thought 0.21.0 is the latest  greatest release that I
 should be using but I noticed 0.20.203 has been released
 lately, and 0.21.X is marked unstable, unsupported.
 
 Should I be using 0.20.203?
 
 T. Kuro Kurosaka
 
 



Re: Which release to use?

2011-07-15 Thread Jonathan Coveney
Isaac: there is no more yahoo branch. They are committing all of their code
to apache.

2011/7/15 Isaac Dooley isaac.doo...@twosigma.com

 Will 0.23 include Kerberos authentication? Will this finally unite the
 Yahoo and Apache branches?

 -Original Message-
 From: Arun C Murthy [mailto:a...@hortonworks.com]
 Sent: Thursday, July 14, 2011 7:43 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Which release to use?

 Hi,

  0.20.203 is the latest stable release which includes a ton of features
 (security - kerberos based authentication) and fixes. Its currently deployed
 at over 50k machines at Yahoo too.
  So, yes, I'd encourage you to use 0.20.203. We, the community, are
 currently working on hadoop-0.23 and hope to get it out soon.

 thanks,
 Arun

 On Jul 14, 2011, at 4:33 PM, Teruhiko Kurosaka wrote:

  I'm a newbie and I am confused by the Hadoop releases.
  I thought 0.21.0 is the latest  greatest release that I
  should be using but I noticed 0.20.203 has been released
  lately, and 0.21.X is marked unstable, unsupported.
 
  Should I be using 0.20.203?
  
  T. Kuro Kurosaka
 
 




Re: Which release to use?

2011-07-15 Thread Robert Evans
Adarsh,

Yahoo! no longer has its own distribution of Hadoop.  It has been merged into 
the 0.20.2XX line so 0.20.203 is what Yahoo is running internally right now, 
and we are moving towards 0.20.204 which should be out soon.  I am not an 
expert on Cloudera so I cannot really map its releases to the Apache Releases, 
but their distro is based off of Apache Hadoop with a few bug fixes and maybe a 
few features like append added in on top of it, but you need to talk to 
Cloudera about the exact details.  For the most part they are all very similar. 
 You need to think most about support, there are several companies that can 
sell you support if you want/need it.  You also need to think about features 
vs. stability.  The 0.20.203 release has been tested on a lot of machines by 
many different groups, but may be missing some features that are needed in some 
situations.

--Bobby


On 7/14/11 11:49 PM, Adarsh Sharma adarsh.sha...@orkash.com wrote:

Hadoop releases are issued time by time. But one more thing related to
hadoop usage,

There are so many providers that provides the distribution of Hadoop ;

1. Apache Hadoop
2. Cloudera
3. Yahoo

etc.
Which distribution is best among them on production usage.
I think Cloudera's  is best among them.


Best Regards,
Adarsh
Owen O'Malley wrote:
 On Jul 14, 2011, at 4:33 PM, Teruhiko Kurosaka wrote:


 I'm a newbie and I am confused by the Hadoop releases.
 I thought 0.21.0 is the latest  greatest release that I
 should be using but I noticed 0.20.203 has been released
 lately, and 0.21.X is marked unstable, unsupported.

 Should I be using 0.20.203?


 Yes, I apologize for confusing release numbering, but the best release to use 
 is 0.20.203.0. It includes security, job limits, and many other improvements 
 over 0.20.2 and 0.21.0. Unfortunately, it doesn't have the new sync support 
 so it isn't suitable for using with HBase. Most large clusters use a separate 
 version of HDFS for HBase.

 -- Owen






Re: Giving filename as key to mapper ?

2011-07-15 Thread Robert Evans
To add to that if you really want the file name to be the key instead of just 
calling a different API in your map to get it you will probably need to write 
your own input format to do it.  It should be fairly simple and you can base it 
off of an existing input format to do it.

--Bobby

On 7/15/11 7:40 AM, Harsh J ha...@cloudera.com wrote:

You can retrieve the filename in the new API as described here:

http://search-hadoop.com/m/ZOmmJ1PZJqt1/map+input+filenamesubj=Retrieving+Filename

In the old API, its available in the configuration instance of the
mapper as key map.input.file. See the table below this section
http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Task+JVM+Reuse
for more such goodies.

On Fri, Jul 15, 2011 at 5:44 PM, praveenesh kumar praveen...@gmail.com wrote:
 Hi,
 How can I give filename as key to mapper ?
 I want to know the occurence of word in set of docs, so I want to keep key
 as filename. Is it possible to give input key as filename in map function ?
 Thanks,
 Praveenesh




--
Harsh J



Re: Giving filename as key to mapper ?

2011-07-15 Thread praveenesh kumar
I am new to this hadoop API. Can anyone give me some tutorial or code snipet
on how to write your own input format to do these kind of things.
Thanks.

On Fri, Jul 15, 2011 at 8:07 PM, Robert Evans ev...@yahoo-inc.com wrote:

 To add to that if you really want the file name to be the key instead of
 just calling a different API in your map to get it you will probably need to
 write your own input format to do it.  It should be fairly simple and you
 can base it off of an existing input format to do it.

 --Bobby

 On 7/15/11 7:40 AM, Harsh J ha...@cloudera.com wrote:

 You can retrieve the filename in the new API as described here:


 http://search-hadoop.com/m/ZOmmJ1PZJqt1/map+input+filenamesubj=Retrieving+Filename

 In the old API, its available in the configuration instance of the
 mapper as key map.input.file. See the table below this section

 http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Task+JVM+Reuse
 for more such goodies.

 On Fri, Jul 15, 2011 at 5:44 PM, praveenesh kumar praveen...@gmail.com
 wrote:
  Hi,
  How can I give filename as key to mapper ?
  I want to know the occurence of word in set of docs, so I want to keep
 key
  as filename. Is it possible to give input key as filename in map function
 ?
  Thanks,
  Praveenesh
 



 --
 Harsh J




RE: Giving filename as key to mapper ?

2011-07-15 Thread GOEKE, MATTHEW (AG/1000)
If you have the source downloaded (and if you don't I would suggest you get it) 
you can do a search for *InputFormat.java and you will have all the references 
you need. Also you might want to check out http://codedemigod.com/blog/?p=120 
or take a look at the books Hadoop in action or Hadoop: The Definitive 
Guide.

Matt

-Original Message-
From: praveenesh kumar [mailto:praveen...@gmail.com] 
Sent: Friday, July 15, 2011 9:42 AM
To: common-user@hadoop.apache.org
Subject: Re: Giving filename as key to mapper ?

I am new to this hadoop API. Can anyone give me some tutorial or code snipet
on how to write your own input format to do these kind of things.
Thanks.

On Fri, Jul 15, 2011 at 8:07 PM, Robert Evans ev...@yahoo-inc.com wrote:

 To add to that if you really want the file name to be the key instead of
 just calling a different API in your map to get it you will probably need to
 write your own input format to do it.  It should be fairly simple and you
 can base it off of an existing input format to do it.

 --Bobby

 On 7/15/11 7:40 AM, Harsh J ha...@cloudera.com wrote:

 You can retrieve the filename in the new API as described here:


 http://search-hadoop.com/m/ZOmmJ1PZJqt1/map+input+filenamesubj=Retrieving+Filename

 In the old API, its available in the configuration instance of the
 mapper as key map.input.file. See the table below this section

 http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Task+JVM+Reuse
 for more such goodies.

 On Fri, Jul 15, 2011 at 5:44 PM, praveenesh kumar praveen...@gmail.com
 wrote:
  Hi,
  How can I give filename as key to mapper ?
  I want to know the occurence of word in set of docs, so I want to keep
 key
  as filename. Is it possible to give input key as filename in map function
 ?
  Thanks,
  Praveenesh
 



 --
 Harsh J


This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



Re: Issue with MR code not scaling correctly with data sizes

2011-07-15 Thread Robert Evans
Please don't cross post.  I put common-user in BCC.

I really don't know for sure what is happening especially without the code or 
more to go on and debugging something remotely over e-mail is extremely 
difficult.  You are essentially doing a cross which is going to be very 
expensive no matter what you do. But I do have a few questions for you.


 1.  How large is the IDs file(s) you are using?  Have you updated the amount 
of heap the JVM has and the number of slots to accommodate it?
 2.  How are you storing the IDs in RAM to do the join?
 3.  Have you tried logging in your map/reduce code to verify the number of 
entries you expect are being loaded at each stage?
 4.  Along with that have you looked at the counters for your map./reduce 
program to verify that the number of records are showing flowing through the 
system as expected?

--Bobby

On 7/14/11 5:14 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com 
wrote:

All,

I have a MR program that I feed in a list of IDs and it generates the unique 
comparison set as a result. Example: if I have a list {1,2,3,4,5} then the 
resulting output would be {2x1, 3x2, 3x1, 4x3, 4x2, 4x1, 5x4, 5x3, 5x2, 5x1} or 
(n^2-n)/2 number of comparisons. My code works just fine on smaller scaled sets 
(I can verify less than 1000 fairly easily) but fails when I try to push the 
set to 10-20k IDs which is annoying when the end goal is 1-10 million.

The flow of the program is:
1) Partition the IDs evenly, based on amount of output per value, into 
a set of keys equal to the number of reduce slots we currently have
2) Use the distributed cache to push the ID file out to the various 
reducers
3) In the setup of the reducer, populate an int array with the values 
from the ID file in distributed cache
4) Output a comparison only if the current ID from the values iterator 
is greater than the current iterator through the int array

I realize that this could be done many other ways but this will be part of an 
oozie workflow so it made sense to just do it in MR for now. My issue is that 
when I try the larger sized ID files it only outputs part of resulting data set 
and there are no errors to be found. Part of me thinks that I need to tweak 
some site configuration properties, due to the size of data that is spilling to 
disk, but after scanning through all 3 sites I am having issues pin pointing 
anything I think could be causing this. I moved from reading the file from HDFS 
to using the distributed cache for the join read thinking that might solve my 
problem but there seems to be something else I am overlooking.

Any advice is greatly appreciated!

Matt
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.




RE: Which release to use?

2011-07-15 Thread Michael Segel

Unfortunately the picture is a bit more confusing.

Yahoo! is now HortonWorks. Their stated goal is to not have their own 
derivative release but to sell commercial support for the official Apache 
release.
So those selling commercial support are:
*Cloudera
*HortonWorks
*MapRTech
*EMC (reselling MapRTech, but had announced their own)
*IBM (not sure what they are selling exactly... still seems like smoke and 
mirrors...)
*DataStax 

So while you can use the Apache release, it may not make sense for your 
organization to do so. (Said as I don the flame retardant suit...)

The issue is that outside of HortonWorks which is stating that they will 
support the official Apache release, everything else is a derivative work of 
Apache's Hadoop. From what I have seen, Cloudera's release is the closest to 
the Apache release.

Like I said, things are getting interesting.

HTH

-Mike



 From: ev...@yahoo-inc.com
 To: common-user@hadoop.apache.org
 Date: Fri, 15 Jul 2011 07:35:45 -0700
 Subject: Re: Which release to use?
 
 Adarsh,
 
 Yahoo! no longer has its own distribution of Hadoop.  It has been merged into 
 the 0.20.2XX line so 0.20.203 is what Yahoo is running internally right now, 
 and we are moving towards 0.20.204 which should be out soon.  I am not an 
 expert on Cloudera so I cannot really map its releases to the Apache 
 Releases, but their distro is based off of Apache Hadoop with a few bug fixes 
 and maybe a few features like append added in on top of it, but you need to 
 talk to Cloudera about the exact details.  For the most part they are all 
 very similar.  You need to think most about support, there are several 
 companies that can sell you support if you want/need it.  You also need to 
 think about features vs. stability.  The 0.20.203 release has been tested on 
 a lot of machines by many different groups, but may be missing some features 
 that are needed in some situations.
 
 --Bobby
 
 
 On 7/14/11 11:49 PM, Adarsh Sharma adarsh.sha...@orkash.com wrote:
 
 Hadoop releases are issued time by time. But one more thing related to
 hadoop usage,
 
 There are so many providers that provides the distribution of Hadoop ;
 
 1. Apache Hadoop
 2. Cloudera
 3. Yahoo
 
 etc.
 Which distribution is best among them on production usage.
 I think Cloudera's  is best among them.
 
 
 Best Regards,
 Adarsh
 Owen O'Malley wrote:
  On Jul 14, 2011, at 4:33 PM, Teruhiko Kurosaka wrote:
 
 
  I'm a newbie and I am confused by the Hadoop releases.
  I thought 0.21.0 is the latest  greatest release that I
  should be using but I noticed 0.20.203 has been released
  lately, and 0.21.X is marked unstable, unsupported.
 
  Should I be using 0.20.203?
 
 
  Yes, I apologize for confusing release numbering, but the best release to 
  use is 0.20.203.0. It includes security, job limits, and many other 
  improvements over 0.20.2 and 0.21.0. Unfortunately, it doesn't have the new 
  sync support so it isn't suitable for using with HBase. Most large clusters 
  use a separate version of HDFS for HBase.
 
  -- Owen
 
 
 
 
  

Re: Which release to use?

2011-07-15 Thread Owen O'Malley

On Jul 15, 2011, at 7:58 AM, Michael Segel wrote:

 So while you can use the Apache release, it may not make sense for your 
 organization to do so. (Said as I don the flame retardant suit...)

I obviously disagree. *grin* Apache Hadoop 0.20.203.0 is the most stable and 
well tested release and has been deployed on Yahoo's 40,000 Hadoop machines in 
clusters of up to 4,500 machines and has been used extensively for running 
production work loads. We are actively working to make the install and 
deployment of Apache Hadoop easier

In terms of commercial support, HortonWorks is absolutely supporting the Apache 
releases. IBM is also supporting the Apache releases:

http://davidmenninger.ventanaresearch.com/2011/05/18/ibm-chooses-hadoop-unity-not-shipping-the-elephant/

So lack of commercial support isn't a problem...

-- Owen

RE: Which release to use?

2011-07-15 Thread Tom Deutsch
One quick clarification - IBM GA'd a product called BigInsights in 2Q. It 
faithfully uses the Hadoop stack and many related projects - but provides 
a number of extensions (that are compatible) based on customer requests. 
Not appropriate to say any more on this list, but the info on it is all 
publically available.



Tom Deutsch
Program Director
CTO Office: Information Management
Hadoop Product Manager / Customer Exec
IBM
3565 Harbor Blvd
Costa Mesa, CA 92626-1420
tdeut...@us.ibm.com




Michael Segel michael_se...@hotmail.com 
07/15/2011 07:58 AM
Please respond to
common-user@hadoop.apache.org


To
common-user@hadoop.apache.org
cc

Subject
RE: Which release to use?







Unfortunately the picture is a bit more confusing.

Yahoo! is now HortonWorks. Their stated goal is to not have their own 
derivative release but to sell commercial support for the official Apache 
release.
So those selling commercial support are:
*Cloudera
*HortonWorks
*MapRTech
*EMC (reselling MapRTech, but had announced their own)
*IBM (not sure what they are selling exactly... still seems like smoke and 
mirrors...)
*DataStax 

So while you can use the Apache release, it may not make sense for your 
organization to do so. (Said as I don the flame retardant suit...)

The issue is that outside of HortonWorks which is stating that they will 
support the official Apache release, everything else is a derivative work 
of Apache's Hadoop. From what I have seen, Cloudera's release is the 
closest to the Apache release.

Like I said, things are getting interesting.

HTH

 
  


Re: Which release to use?

2011-07-15 Thread Arun C Murthy
Apache Hadoop is a volunteer driven, open-source project. The contributors to 
Apache Hadoop, both individuals and folks across a diverse set of 
organizations, are committed to driving the project forward and making timely 
releases - see discussion on hadoop-0.23 with a raft newer features such as 
HDFS Federation, NextGen MapReduce and plans for HA NameNode etc. 

As with most successful projects there are several options for commercial 
support to Hadoop or its derivatives.

However, Apache Hadoop has thrived before there was any commercial support 
(I've personally been involved in over 20 releases of Apache Hadoop and 
deployed them while at Yahoo) and I'm sure it will in this new world order. 

We, the Apache Hadoop community, are committed to keeping Apache Hadoop 'free', 
providing support to our users and to move it forward at a rapid rate. 

Arun

On Jul 15, 2011, at 7:58 AM, Michael Segel wrote:

 
 Unfortunately the picture is a bit more confusing.
 
 Yahoo! is now HortonWorks. Their stated goal is to not have their own 
 derivative release but to sell commercial support for the official Apache 
 release.
 So those selling commercial support are:
 *Cloudera
 *HortonWorks
 *MapRTech
 *EMC (reselling MapRTech, but had announced their own)
 *IBM (not sure what they are selling exactly... still seems like smoke and 
 mirrors...)
 *DataStax 
 
 So while you can use the Apache release, it may not make sense for your 
 organization to do so. (Said as I don the flame retardant suit...)
 
 The issue is that outside of HortonWorks which is stating that they will 
 support the official Apache release, everything else is a derivative work of 
 Apache's Hadoop. From what I have seen, Cloudera's release is the closest to 
 the Apache release.
 
 Like I said, things are getting interesting.
 
 HTH
 
 -Mike
 
 
 
 From: ev...@yahoo-inc.com
 To: common-user@hadoop.apache.org
 Date: Fri, 15 Jul 2011 07:35:45 -0700
 Subject: Re: Which release to use?
 
 Adarsh,
 
 Yahoo! no longer has its own distribution of Hadoop.  It has been merged 
 into the 0.20.2XX line so 0.20.203 is what Yahoo is running internally right 
 now, and we are moving towards 0.20.204 which should be out soon.  I am not 
 an expert on Cloudera so I cannot really map its releases to the Apache 
 Releases, but their distro is based off of Apache Hadoop with a few bug 
 fixes and maybe a few features like append added in on top of it, but you 
 need to talk to Cloudera about the exact details.  For the most part they 
 are all very similar.  You need to think most about support, there are 
 several companies that can sell you support if you want/need it.  You also 
 need to think about features vs. stability.  The 0.20.203 release has been 
 tested on a lot of machines by many different groups, but may be missing 
 some features that are needed in some situations.
 
 --Bobby
 
 
 On 7/14/11 11:49 PM, Adarsh Sharma adarsh.sha...@orkash.com wrote:
 
 Hadoop releases are issued time by time. But one more thing related to
 hadoop usage,
 
 There are so many providers that provides the distribution of Hadoop ;
 
 1. Apache Hadoop
 2. Cloudera
 3. Yahoo
 
 etc.
 Which distribution is best among them on production usage.
 I think Cloudera's  is best among them.
 
 
 Best Regards,
 Adarsh
 Owen O'Malley wrote:
 On Jul 14, 2011, at 4:33 PM, Teruhiko Kurosaka wrote:
 
 
 I'm a newbie and I am confused by the Hadoop releases.
 I thought 0.21.0 is the latest  greatest release that I
 should be using but I noticed 0.20.203 has been released
 lately, and 0.21.X is marked unstable, unsupported.
 
 Should I be using 0.20.203?
 
 
 Yes, I apologize for confusing release numbering, but the best release to 
 use is 0.20.203.0. It includes security, job limits, and many other 
 improvements over 0.20.2 and 0.21.0. Unfortunately, it doesn't have the new 
 sync support so it isn't suitable for using with HBase. Most large clusters 
 use a separate version of HDFS for HBase.
 
 -- Owen
 
 
 
 
 



Re: Cluster Tuning

2011-07-15 Thread Steve Loughran

On 08/07/2011 16:25, Juan P. wrote:

Here's another thought. I realized that the reduce operation in my
map/reduce jobs is a flash. But it goes really slow until the
mappers end. Is there a way to configure the cluster to make the reduce wait
for the map operations to complete? Specially considering my hardware
restraints


take a look to see if its usually the same machine that's taking too 
long; test your HDDs to see if there are any signs of problems in the 
SMART messages. Then turn on speculation. It could be the problem with a 
slow mapper is caused by disk problems or an overloaded server.




Re: Which release to use?

2011-07-15 Thread Steve Loughran

On 15/07/2011 15:58, Michael Segel wrote:


Unfortunately the picture is a bit more confusing.

Yahoo! is now HortonWorks. Their stated goal is to not have their own 
derivative release but to sell commercial support for the official Apache 
release.
So those selling commercial support are:
*Cloudera
*HortonWorks
*MapRTech
*EMC (reselling MapRTech, but had announced their own)
*IBM (not sure what they are selling exactly... still seems like smoke and 
mirrors...)
*DataStax


+ Amazon, indirectly, that do their own derivative work of some release 
of Hadoop (which version is it based on?)


I've used 0.21, which was the first with the new APIs and, with MRUnit, 
has the best test framework. For my small-cluster uses, it worked well. 
(oh, and I didn't care about security)





Re: Which release to use?

2011-07-15 Thread Steve Loughran

On 15/07/2011 18:06, Arun C Murthy wrote:

Apache Hadoop is a volunteer driven, open-source project. The contributors to 
Apache Hadoop, both individuals and folks across a diverse set of 
organizations, are committed to driving the project forward and making timely 
releases - see discussion on hadoop-0.23 with a raft newer features such as 
HDFS Federation, NextGen MapReduce and plans for HA NameNode etc.

As with most successful projects there are several options for commercial 
support to Hadoop or its derivatives.

However, Apache Hadoop has thrived before there was any commercial support 
(I've personally been involved in over 20 releases of Apache Hadoop and 
deployed them while at Yahoo) and I'm sure it will in this new world order.

We, the Apache Hadoop community, are committed to keeping Apache Hadoop 'free', 
providing support to our users and to move it forward at a rapid rate.



Arun makes a good point which is that the Apache project depends on 
contributions from the community to thrive. That includes


 -bug reports
 -patches to fix problems
 -more tests
 -documentation improvements: more examples, more on getting started, 
troubleshooting, etc.


If there's something lacking in the codebase, and you think you can fix 
it, please do so. Helping with the documentation is a good start, as it 
can be improved, and you aren't going to break anything.


Once you get into changing the code, you'll end up working with the head 
of whichever branch you are targeting.


The other area everyone can contribute on is testing. Yes, Y! and FB can 
test at scale, yes, other people can test large clusters too -but nobody 
has a network that looks like yours but you. And Hadoop does care about 
network configurations. Testing beta and release candidate releases in 
your infrastructure, helps verify that the final release will work on 
your site, and you don't end up getting all the phone calls about 
something not working


Re: Which release to use?

2011-07-15 Thread Mark Kerzner
Steve,

this is so well said, do you mind if I repeat it here,
http://shmsoft.blogspot.com/2011/07/hadoop-commercial-support-options.html

Thank you,
Mark

On Fri, Jul 15, 2011 at 4:00 PM, Steve Loughran ste...@apache.org wrote:

 On 15/07/2011 15:58, Michael Segel wrote:


 Unfortunately the picture is a bit more confusing.

 Yahoo! is now HortonWorks. Their stated goal is to not have their own
 derivative release but to sell commercial support for the official Apache
 release.
 So those selling commercial support are:
 *Cloudera
 *HortonWorks
 *MapRTech
 *EMC (reselling MapRTech, but had announced their own)
 *IBM (not sure what they are selling exactly... still seems like smoke and
 mirrors...)
 *DataStax


 + Amazon, indirectly, that do their own derivative work of some release of
 Hadoop (which version is it based on?)

 I've used 0.21, which was the first with the new APIs and, with MRUnit, has
 the best test framework. For my small-cluster uses, it worked well. (oh, and
 I didn't care about security)





RE: Which release to use?

2011-07-15 Thread Michael Segel

See, I knew there was something that I forgot. 

It all goes back to the question ... 'which release to use'... 

2 years ago it was a very simple decision. Now, not so much. :-)

And while Arun and Ownen work for a vendor, I do not and I try to follow each 
company and their offering. 

As Hadoop goes mainstream, the question of which vendor to choose gets 
interesting. 
Just like in the 90's during the database vendor wars, it looks like the vendor 
who has the best sales force and PR will win.
(Not necessarily the best product.)

JMHO

-Mike


 Date: Fri, 15 Jul 2011 16:25:55 -0500
 Subject: Re: Which release to use?
 From: markkerz...@gmail.com
 To: common-user@hadoop.apache.org
 
 Steve,
 
 this is so well said, do you mind if I repeat it here,
 http://shmsoft.blogspot.com/2011/07/hadoop-commercial-support-options.html
 
 Thank you,
 Mark
 
 On Fri, Jul 15, 2011 at 4:00 PM, Steve Loughran ste...@apache.org wrote:
 
  On 15/07/2011 15:58, Michael Segel wrote:
 
 
  Unfortunately the picture is a bit more confusing.
 
  Yahoo! is now HortonWorks. Their stated goal is to not have their own
  derivative release but to sell commercial support for the official Apache
  release.
  So those selling commercial support are:
  *Cloudera
  *HortonWorks
  *MapRTech
  *EMC (reselling MapRTech, but had announced their own)
  *IBM (not sure what they are selling exactly... still seems like smoke and
  mirrors...)
  *DataStax
 
 
  + Amazon, indirectly, that do their own derivative work of some release of
  Hadoop (which version is it based on?)
 
  I've used 0.21, which was the first with the new APIs and, with MRUnit, has
  the best test framework. For my small-cluster uses, it worked well. (oh, and
  I didn't care about security)