Giving filename as key to mapper ?
Hi, How can I give filename as key to mapper ? I want to know the occurence of word in set of docs, so I want to keep key as filename. Is it possible to give input key as filename in map function ? Thanks, Praveenesh
Re: Giving filename as key to mapper ?
You can retrieve the filename in the new API as described here: http://search-hadoop.com/m/ZOmmJ1PZJqt1/map+input+filenamesubj=Retrieving+Filename In the old API, its available in the configuration instance of the mapper as key map.input.file. See the table below this section http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Task+JVM+Reuse for more such goodies. On Fri, Jul 15, 2011 at 5:44 PM, praveenesh kumar praveen...@gmail.com wrote: Hi, How can I give filename as key to mapper ? I want to know the occurence of word in set of docs, so I want to keep key as filename. Is it possible to give input key as filename in map function ? Thanks, Praveenesh -- Harsh J
RE: Which release to use?
Will 0.23 include Kerberos authentication? Will this finally unite the Yahoo and Apache branches? -Original Message- From: Arun C Murthy [mailto:a...@hortonworks.com] Sent: Thursday, July 14, 2011 7:43 PM To: common-user@hadoop.apache.org Subject: Re: Which release to use? Hi, 0.20.203 is the latest stable release which includes a ton of features (security - kerberos based authentication) and fixes. Its currently deployed at over 50k machines at Yahoo too. So, yes, I'd encourage you to use 0.20.203. We, the community, are currently working on hadoop-0.23 and hope to get it out soon. thanks, Arun On Jul 14, 2011, at 4:33 PM, Teruhiko Kurosaka wrote: I'm a newbie and I am confused by the Hadoop releases. I thought 0.21.0 is the latest greatest release that I should be using but I noticed 0.20.203 has been released lately, and 0.21.X is marked unstable, unsupported. Should I be using 0.20.203? T. Kuro Kurosaka
Re: Which release to use?
Isaac: there is no more yahoo branch. They are committing all of their code to apache. 2011/7/15 Isaac Dooley isaac.doo...@twosigma.com Will 0.23 include Kerberos authentication? Will this finally unite the Yahoo and Apache branches? -Original Message- From: Arun C Murthy [mailto:a...@hortonworks.com] Sent: Thursday, July 14, 2011 7:43 PM To: common-user@hadoop.apache.org Subject: Re: Which release to use? Hi, 0.20.203 is the latest stable release which includes a ton of features (security - kerberos based authentication) and fixes. Its currently deployed at over 50k machines at Yahoo too. So, yes, I'd encourage you to use 0.20.203. We, the community, are currently working on hadoop-0.23 and hope to get it out soon. thanks, Arun On Jul 14, 2011, at 4:33 PM, Teruhiko Kurosaka wrote: I'm a newbie and I am confused by the Hadoop releases. I thought 0.21.0 is the latest greatest release that I should be using but I noticed 0.20.203 has been released lately, and 0.21.X is marked unstable, unsupported. Should I be using 0.20.203? T. Kuro Kurosaka
Re: Which release to use?
Adarsh, Yahoo! no longer has its own distribution of Hadoop. It has been merged into the 0.20.2XX line so 0.20.203 is what Yahoo is running internally right now, and we are moving towards 0.20.204 which should be out soon. I am not an expert on Cloudera so I cannot really map its releases to the Apache Releases, but their distro is based off of Apache Hadoop with a few bug fixes and maybe a few features like append added in on top of it, but you need to talk to Cloudera about the exact details. For the most part they are all very similar. You need to think most about support, there are several companies that can sell you support if you want/need it. You also need to think about features vs. stability. The 0.20.203 release has been tested on a lot of machines by many different groups, but may be missing some features that are needed in some situations. --Bobby On 7/14/11 11:49 PM, Adarsh Sharma adarsh.sha...@orkash.com wrote: Hadoop releases are issued time by time. But one more thing related to hadoop usage, There are so many providers that provides the distribution of Hadoop ; 1. Apache Hadoop 2. Cloudera 3. Yahoo etc. Which distribution is best among them on production usage. I think Cloudera's is best among them. Best Regards, Adarsh Owen O'Malley wrote: On Jul 14, 2011, at 4:33 PM, Teruhiko Kurosaka wrote: I'm a newbie and I am confused by the Hadoop releases. I thought 0.21.0 is the latest greatest release that I should be using but I noticed 0.20.203 has been released lately, and 0.21.X is marked unstable, unsupported. Should I be using 0.20.203? Yes, I apologize for confusing release numbering, but the best release to use is 0.20.203.0. It includes security, job limits, and many other improvements over 0.20.2 and 0.21.0. Unfortunately, it doesn't have the new sync support so it isn't suitable for using with HBase. Most large clusters use a separate version of HDFS for HBase. -- Owen
Re: Giving filename as key to mapper ?
To add to that if you really want the file name to be the key instead of just calling a different API in your map to get it you will probably need to write your own input format to do it. It should be fairly simple and you can base it off of an existing input format to do it. --Bobby On 7/15/11 7:40 AM, Harsh J ha...@cloudera.com wrote: You can retrieve the filename in the new API as described here: http://search-hadoop.com/m/ZOmmJ1PZJqt1/map+input+filenamesubj=Retrieving+Filename In the old API, its available in the configuration instance of the mapper as key map.input.file. See the table below this section http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Task+JVM+Reuse for more such goodies. On Fri, Jul 15, 2011 at 5:44 PM, praveenesh kumar praveen...@gmail.com wrote: Hi, How can I give filename as key to mapper ? I want to know the occurence of word in set of docs, so I want to keep key as filename. Is it possible to give input key as filename in map function ? Thanks, Praveenesh -- Harsh J
Re: Giving filename as key to mapper ?
I am new to this hadoop API. Can anyone give me some tutorial or code snipet on how to write your own input format to do these kind of things. Thanks. On Fri, Jul 15, 2011 at 8:07 PM, Robert Evans ev...@yahoo-inc.com wrote: To add to that if you really want the file name to be the key instead of just calling a different API in your map to get it you will probably need to write your own input format to do it. It should be fairly simple and you can base it off of an existing input format to do it. --Bobby On 7/15/11 7:40 AM, Harsh J ha...@cloudera.com wrote: You can retrieve the filename in the new API as described here: http://search-hadoop.com/m/ZOmmJ1PZJqt1/map+input+filenamesubj=Retrieving+Filename In the old API, its available in the configuration instance of the mapper as key map.input.file. See the table below this section http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Task+JVM+Reuse for more such goodies. On Fri, Jul 15, 2011 at 5:44 PM, praveenesh kumar praveen...@gmail.com wrote: Hi, How can I give filename as key to mapper ? I want to know the occurence of word in set of docs, so I want to keep key as filename. Is it possible to give input key as filename in map function ? Thanks, Praveenesh -- Harsh J
RE: Giving filename as key to mapper ?
If you have the source downloaded (and if you don't I would suggest you get it) you can do a search for *InputFormat.java and you will have all the references you need. Also you might want to check out http://codedemigod.com/blog/?p=120 or take a look at the books Hadoop in action or Hadoop: The Definitive Guide. Matt -Original Message- From: praveenesh kumar [mailto:praveen...@gmail.com] Sent: Friday, July 15, 2011 9:42 AM To: common-user@hadoop.apache.org Subject: Re: Giving filename as key to mapper ? I am new to this hadoop API. Can anyone give me some tutorial or code snipet on how to write your own input format to do these kind of things. Thanks. On Fri, Jul 15, 2011 at 8:07 PM, Robert Evans ev...@yahoo-inc.com wrote: To add to that if you really want the file name to be the key instead of just calling a different API in your map to get it you will probably need to write your own input format to do it. It should be fairly simple and you can base it off of an existing input format to do it. --Bobby On 7/15/11 7:40 AM, Harsh J ha...@cloudera.com wrote: You can retrieve the filename in the new API as described here: http://search-hadoop.com/m/ZOmmJ1PZJqt1/map+input+filenamesubj=Retrieving+Filename In the old API, its available in the configuration instance of the mapper as key map.input.file. See the table below this section http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Task+JVM+Reuse for more such goodies. On Fri, Jul 15, 2011 at 5:44 PM, praveenesh kumar praveen...@gmail.com wrote: Hi, How can I give filename as key to mapper ? I want to know the occurence of word in set of docs, so I want to keep key as filename. Is it possible to give input key as filename in map function ? Thanks, Praveenesh -- Harsh J This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
Re: Issue with MR code not scaling correctly with data sizes
Please don't cross post. I put common-user in BCC. I really don't know for sure what is happening especially without the code or more to go on and debugging something remotely over e-mail is extremely difficult. You are essentially doing a cross which is going to be very expensive no matter what you do. But I do have a few questions for you. 1. How large is the IDs file(s) you are using? Have you updated the amount of heap the JVM has and the number of slots to accommodate it? 2. How are you storing the IDs in RAM to do the join? 3. Have you tried logging in your map/reduce code to verify the number of entries you expect are being loaded at each stage? 4. Along with that have you looked at the counters for your map./reduce program to verify that the number of records are showing flowing through the system as expected? --Bobby On 7/14/11 5:14 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: All, I have a MR program that I feed in a list of IDs and it generates the unique comparison set as a result. Example: if I have a list {1,2,3,4,5} then the resulting output would be {2x1, 3x2, 3x1, 4x3, 4x2, 4x1, 5x4, 5x3, 5x2, 5x1} or (n^2-n)/2 number of comparisons. My code works just fine on smaller scaled sets (I can verify less than 1000 fairly easily) but fails when I try to push the set to 10-20k IDs which is annoying when the end goal is 1-10 million. The flow of the program is: 1) Partition the IDs evenly, based on amount of output per value, into a set of keys equal to the number of reduce slots we currently have 2) Use the distributed cache to push the ID file out to the various reducers 3) In the setup of the reducer, populate an int array with the values from the ID file in distributed cache 4) Output a comparison only if the current ID from the values iterator is greater than the current iterator through the int array I realize that this could be done many other ways but this will be part of an oozie workflow so it made sense to just do it in MR for now. My issue is that when I try the larger sized ID files it only outputs part of resulting data set and there are no errors to be found. Part of me thinks that I need to tweak some site configuration properties, due to the size of data that is spilling to disk, but after scanning through all 3 sites I am having issues pin pointing anything I think could be causing this. I moved from reading the file from HDFS to using the distributed cache for the join read thinking that might solve my problem but there seems to be something else I am overlooking. Any advice is greatly appreciated! Matt This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
RE: Which release to use?
Unfortunately the picture is a bit more confusing. Yahoo! is now HortonWorks. Their stated goal is to not have their own derivative release but to sell commercial support for the official Apache release. So those selling commercial support are: *Cloudera *HortonWorks *MapRTech *EMC (reselling MapRTech, but had announced their own) *IBM (not sure what they are selling exactly... still seems like smoke and mirrors...) *DataStax So while you can use the Apache release, it may not make sense for your organization to do so. (Said as I don the flame retardant suit...) The issue is that outside of HortonWorks which is stating that they will support the official Apache release, everything else is a derivative work of Apache's Hadoop. From what I have seen, Cloudera's release is the closest to the Apache release. Like I said, things are getting interesting. HTH -Mike From: ev...@yahoo-inc.com To: common-user@hadoop.apache.org Date: Fri, 15 Jul 2011 07:35:45 -0700 Subject: Re: Which release to use? Adarsh, Yahoo! no longer has its own distribution of Hadoop. It has been merged into the 0.20.2XX line so 0.20.203 is what Yahoo is running internally right now, and we are moving towards 0.20.204 which should be out soon. I am not an expert on Cloudera so I cannot really map its releases to the Apache Releases, but their distro is based off of Apache Hadoop with a few bug fixes and maybe a few features like append added in on top of it, but you need to talk to Cloudera about the exact details. For the most part they are all very similar. You need to think most about support, there are several companies that can sell you support if you want/need it. You also need to think about features vs. stability. The 0.20.203 release has been tested on a lot of machines by many different groups, but may be missing some features that are needed in some situations. --Bobby On 7/14/11 11:49 PM, Adarsh Sharma adarsh.sha...@orkash.com wrote: Hadoop releases are issued time by time. But one more thing related to hadoop usage, There are so many providers that provides the distribution of Hadoop ; 1. Apache Hadoop 2. Cloudera 3. Yahoo etc. Which distribution is best among them on production usage. I think Cloudera's is best among them. Best Regards, Adarsh Owen O'Malley wrote: On Jul 14, 2011, at 4:33 PM, Teruhiko Kurosaka wrote: I'm a newbie and I am confused by the Hadoop releases. I thought 0.21.0 is the latest greatest release that I should be using but I noticed 0.20.203 has been released lately, and 0.21.X is marked unstable, unsupported. Should I be using 0.20.203? Yes, I apologize for confusing release numbering, but the best release to use is 0.20.203.0. It includes security, job limits, and many other improvements over 0.20.2 and 0.21.0. Unfortunately, it doesn't have the new sync support so it isn't suitable for using with HBase. Most large clusters use a separate version of HDFS for HBase. -- Owen
Re: Which release to use?
On Jul 15, 2011, at 7:58 AM, Michael Segel wrote: So while you can use the Apache release, it may not make sense for your organization to do so. (Said as I don the flame retardant suit...) I obviously disagree. *grin* Apache Hadoop 0.20.203.0 is the most stable and well tested release and has been deployed on Yahoo's 40,000 Hadoop machines in clusters of up to 4,500 machines and has been used extensively for running production work loads. We are actively working to make the install and deployment of Apache Hadoop easier In terms of commercial support, HortonWorks is absolutely supporting the Apache releases. IBM is also supporting the Apache releases: http://davidmenninger.ventanaresearch.com/2011/05/18/ibm-chooses-hadoop-unity-not-shipping-the-elephant/ So lack of commercial support isn't a problem... -- Owen
RE: Which release to use?
One quick clarification - IBM GA'd a product called BigInsights in 2Q. It faithfully uses the Hadoop stack and many related projects - but provides a number of extensions (that are compatible) based on customer requests. Not appropriate to say any more on this list, but the info on it is all publically available. Tom Deutsch Program Director CTO Office: Information Management Hadoop Product Manager / Customer Exec IBM 3565 Harbor Blvd Costa Mesa, CA 92626-1420 tdeut...@us.ibm.com Michael Segel michael_se...@hotmail.com 07/15/2011 07:58 AM Please respond to common-user@hadoop.apache.org To common-user@hadoop.apache.org cc Subject RE: Which release to use? Unfortunately the picture is a bit more confusing. Yahoo! is now HortonWorks. Their stated goal is to not have their own derivative release but to sell commercial support for the official Apache release. So those selling commercial support are: *Cloudera *HortonWorks *MapRTech *EMC (reselling MapRTech, but had announced their own) *IBM (not sure what they are selling exactly... still seems like smoke and mirrors...) *DataStax So while you can use the Apache release, it may not make sense for your organization to do so. (Said as I don the flame retardant suit...) The issue is that outside of HortonWorks which is stating that they will support the official Apache release, everything else is a derivative work of Apache's Hadoop. From what I have seen, Cloudera's release is the closest to the Apache release. Like I said, things are getting interesting. HTH
Re: Which release to use?
Apache Hadoop is a volunteer driven, open-source project. The contributors to Apache Hadoop, both individuals and folks across a diverse set of organizations, are committed to driving the project forward and making timely releases - see discussion on hadoop-0.23 with a raft newer features such as HDFS Federation, NextGen MapReduce and plans for HA NameNode etc. As with most successful projects there are several options for commercial support to Hadoop or its derivatives. However, Apache Hadoop has thrived before there was any commercial support (I've personally been involved in over 20 releases of Apache Hadoop and deployed them while at Yahoo) and I'm sure it will in this new world order. We, the Apache Hadoop community, are committed to keeping Apache Hadoop 'free', providing support to our users and to move it forward at a rapid rate. Arun On Jul 15, 2011, at 7:58 AM, Michael Segel wrote: Unfortunately the picture is a bit more confusing. Yahoo! is now HortonWorks. Their stated goal is to not have their own derivative release but to sell commercial support for the official Apache release. So those selling commercial support are: *Cloudera *HortonWorks *MapRTech *EMC (reselling MapRTech, but had announced their own) *IBM (not sure what they are selling exactly... still seems like smoke and mirrors...) *DataStax So while you can use the Apache release, it may not make sense for your organization to do so. (Said as I don the flame retardant suit...) The issue is that outside of HortonWorks which is stating that they will support the official Apache release, everything else is a derivative work of Apache's Hadoop. From what I have seen, Cloudera's release is the closest to the Apache release. Like I said, things are getting interesting. HTH -Mike From: ev...@yahoo-inc.com To: common-user@hadoop.apache.org Date: Fri, 15 Jul 2011 07:35:45 -0700 Subject: Re: Which release to use? Adarsh, Yahoo! no longer has its own distribution of Hadoop. It has been merged into the 0.20.2XX line so 0.20.203 is what Yahoo is running internally right now, and we are moving towards 0.20.204 which should be out soon. I am not an expert on Cloudera so I cannot really map its releases to the Apache Releases, but their distro is based off of Apache Hadoop with a few bug fixes and maybe a few features like append added in on top of it, but you need to talk to Cloudera about the exact details. For the most part they are all very similar. You need to think most about support, there are several companies that can sell you support if you want/need it. You also need to think about features vs. stability. The 0.20.203 release has been tested on a lot of machines by many different groups, but may be missing some features that are needed in some situations. --Bobby On 7/14/11 11:49 PM, Adarsh Sharma adarsh.sha...@orkash.com wrote: Hadoop releases are issued time by time. But one more thing related to hadoop usage, There are so many providers that provides the distribution of Hadoop ; 1. Apache Hadoop 2. Cloudera 3. Yahoo etc. Which distribution is best among them on production usage. I think Cloudera's is best among them. Best Regards, Adarsh Owen O'Malley wrote: On Jul 14, 2011, at 4:33 PM, Teruhiko Kurosaka wrote: I'm a newbie and I am confused by the Hadoop releases. I thought 0.21.0 is the latest greatest release that I should be using but I noticed 0.20.203 has been released lately, and 0.21.X is marked unstable, unsupported. Should I be using 0.20.203? Yes, I apologize for confusing release numbering, but the best release to use is 0.20.203.0. It includes security, job limits, and many other improvements over 0.20.2 and 0.21.0. Unfortunately, it doesn't have the new sync support so it isn't suitable for using with HBase. Most large clusters use a separate version of HDFS for HBase. -- Owen
Re: Cluster Tuning
On 08/07/2011 16:25, Juan P. wrote: Here's another thought. I realized that the reduce operation in my map/reduce jobs is a flash. But it goes really slow until the mappers end. Is there a way to configure the cluster to make the reduce wait for the map operations to complete? Specially considering my hardware restraints take a look to see if its usually the same machine that's taking too long; test your HDDs to see if there are any signs of problems in the SMART messages. Then turn on speculation. It could be the problem with a slow mapper is caused by disk problems or an overloaded server.
Re: Which release to use?
On 15/07/2011 15:58, Michael Segel wrote: Unfortunately the picture is a bit more confusing. Yahoo! is now HortonWorks. Their stated goal is to not have their own derivative release but to sell commercial support for the official Apache release. So those selling commercial support are: *Cloudera *HortonWorks *MapRTech *EMC (reselling MapRTech, but had announced their own) *IBM (not sure what they are selling exactly... still seems like smoke and mirrors...) *DataStax + Amazon, indirectly, that do their own derivative work of some release of Hadoop (which version is it based on?) I've used 0.21, which was the first with the new APIs and, with MRUnit, has the best test framework. For my small-cluster uses, it worked well. (oh, and I didn't care about security)
Re: Which release to use?
On 15/07/2011 18:06, Arun C Murthy wrote: Apache Hadoop is a volunteer driven, open-source project. The contributors to Apache Hadoop, both individuals and folks across a diverse set of organizations, are committed to driving the project forward and making timely releases - see discussion on hadoop-0.23 with a raft newer features such as HDFS Federation, NextGen MapReduce and plans for HA NameNode etc. As with most successful projects there are several options for commercial support to Hadoop or its derivatives. However, Apache Hadoop has thrived before there was any commercial support (I've personally been involved in over 20 releases of Apache Hadoop and deployed them while at Yahoo) and I'm sure it will in this new world order. We, the Apache Hadoop community, are committed to keeping Apache Hadoop 'free', providing support to our users and to move it forward at a rapid rate. Arun makes a good point which is that the Apache project depends on contributions from the community to thrive. That includes -bug reports -patches to fix problems -more tests -documentation improvements: more examples, more on getting started, troubleshooting, etc. If there's something lacking in the codebase, and you think you can fix it, please do so. Helping with the documentation is a good start, as it can be improved, and you aren't going to break anything. Once you get into changing the code, you'll end up working with the head of whichever branch you are targeting. The other area everyone can contribute on is testing. Yes, Y! and FB can test at scale, yes, other people can test large clusters too -but nobody has a network that looks like yours but you. And Hadoop does care about network configurations. Testing beta and release candidate releases in your infrastructure, helps verify that the final release will work on your site, and you don't end up getting all the phone calls about something not working
Re: Which release to use?
Steve, this is so well said, do you mind if I repeat it here, http://shmsoft.blogspot.com/2011/07/hadoop-commercial-support-options.html Thank you, Mark On Fri, Jul 15, 2011 at 4:00 PM, Steve Loughran ste...@apache.org wrote: On 15/07/2011 15:58, Michael Segel wrote: Unfortunately the picture is a bit more confusing. Yahoo! is now HortonWorks. Their stated goal is to not have their own derivative release but to sell commercial support for the official Apache release. So those selling commercial support are: *Cloudera *HortonWorks *MapRTech *EMC (reselling MapRTech, but had announced their own) *IBM (not sure what they are selling exactly... still seems like smoke and mirrors...) *DataStax + Amazon, indirectly, that do their own derivative work of some release of Hadoop (which version is it based on?) I've used 0.21, which was the first with the new APIs and, with MRUnit, has the best test framework. For my small-cluster uses, it worked well. (oh, and I didn't care about security)
RE: Which release to use?
See, I knew there was something that I forgot. It all goes back to the question ... 'which release to use'... 2 years ago it was a very simple decision. Now, not so much. :-) And while Arun and Ownen work for a vendor, I do not and I try to follow each company and their offering. As Hadoop goes mainstream, the question of which vendor to choose gets interesting. Just like in the 90's during the database vendor wars, it looks like the vendor who has the best sales force and PR will win. (Not necessarily the best product.) JMHO -Mike Date: Fri, 15 Jul 2011 16:25:55 -0500 Subject: Re: Which release to use? From: markkerz...@gmail.com To: common-user@hadoop.apache.org Steve, this is so well said, do you mind if I repeat it here, http://shmsoft.blogspot.com/2011/07/hadoop-commercial-support-options.html Thank you, Mark On Fri, Jul 15, 2011 at 4:00 PM, Steve Loughran ste...@apache.org wrote: On 15/07/2011 15:58, Michael Segel wrote: Unfortunately the picture is a bit more confusing. Yahoo! is now HortonWorks. Their stated goal is to not have their own derivative release but to sell commercial support for the official Apache release. So those selling commercial support are: *Cloudera *HortonWorks *MapRTech *EMC (reselling MapRTech, but had announced their own) *IBM (not sure what they are selling exactly... still seems like smoke and mirrors...) *DataStax + Amazon, indirectly, that do their own derivative work of some release of Hadoop (which version is it based on?) I've used 0.21, which was the first with the new APIs and, with MRUnit, has the best test framework. For my small-cluster uses, it worked well. (oh, and I didn't care about security)