Re: Hadoop Cookbook?
Mark Kerzner wrote: Hi, guys, I think that there is a need for a collection of Hadoop exercises. The great books out there teach you how to use Hadoop, but the Hadoop Cookbook is missing, If people can submit their solutions, I can become an editor - or a group of editors can do it - but there are lots of people out there who have designed interesting solutions that they could share. Cheers, Mark would be good on the apache hadoop wiki
Re: Hadoop Cookbook?
Thank you On Tue, May 4, 2010 at 4:52 AM, Steve Loughran ste...@apache.org wrote: Mark Kerzner wrote: Hi, guys, I think that there is a need for a collection of Hadoop exercises. The great books out there teach you how to use Hadoop, but the Hadoop Cookbook is missing, If people can submit their solutions, I can become an editor - or a group of editors can do it - but there are lots of people out there who have designed interesting solutions that they could share. Cheers, Mark would be good on the apache hadoop wiki
Doubt: Using PBS to run mapreduce jobs.
Hi, I am given an account on a cluster which uses OpenPBS as the cluster management software. The only way I can run a job is by submitting it to OpenPBS. How to run mapreduce programs on it? Is there any possible work around? Thanks, Udaya.
Need a Jira?
Hi, Came across something ugly. I'm using the latest Hadoop version in Cloudera's CH2 :Hadoop 0.20.1+169.68 (At least I think its the latest version in CH2) Noticed that when I instantiate a JobClient() passing in a Configuration object, I have to cast it to the deprecated class (JobConf). Is this something that should be updated, or is this fixed in the next Cloudera (CH3) release? Thx -Mike _ Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox. http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1
Re: having a directory as input split
One way to do this will be: Create a DirectoryInputFormat which accepts the list of directories as inputs and emits each directory path in one split. Your custom RecordReader can then read this split and generate appropriate input for your mapper. Thanks and Regards, Sonal www.meghsoft.com On Fri, Apr 30, 2010 at 11:48 AM, akhil1988 akhilan...@gmail.com wrote: How can I make a directory as a InputSplit rather than a file. I want that the input split available to a map task should be a directory and not a file. And I will implement my own record reader which will read appropriate data from the directory and thus give the records to the map tasks. To explain in other words, I have a list of directories distributed over hdfs and I know that each of these directories is small enough to be present on a single node. I want that one directory to be given to each map task rather than the files present in it. How to do this? Thanks, Akhil -- View this message in context: http://old.nabble.com/having-a-directory-as-input-split-tp28408886p28408886.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Need a Jira?
On Tue, May 4, 2010 at 10:50 AM, Michael Segel michael_se...@hotmail.com wrote: Hi, Came across something ugly. I'm using the latest Hadoop version in Cloudera's CH2 :Hadoop 0.20.1+169.68 (At least I think its the latest version in CH2) Noticed that when I instantiate a JobClient() passing in a Configuration object, I have to cast it to the deprecated class (JobConf). Is this something that should be updated, or is this fixed in the next Cloudera (CH3) release? The reason / problem here is because JobClient is from the old (0.18) API and thus has no understanding of Configuration. You can initialize a JobConf from a Configuration rather than casting it which avoids the cast. JobConf conf = new JobConf(new Configuration()) This isn't a bug as much as it is confusion between the new and old APIs. As the new APIs become more feature complete (probably at or around 0.21) the recommendation will be to prefer those. There has been discussion around un-deprecating the old APIs. -- Eric Sammer phone: +1-917-287-2675 twitter: esammer data: www.cloudera.com
RE: Need a Jira?
Date: Tue, 4 May 2010 11:03:48 -0400 Subject: Re: Need a Jira? From: esam...@cloudera.com To: common-user@hadoop.apache.org The reason / problem here is because JobClient is from the old (0.18) API and thus has no understanding of Configuration. You can initialize a JobConf from a Configuration rather than casting it which avoids the cast. JobConf conf = new JobConf(new Configuration()) This isn't a bug as much as it is confusion between the new and old APIs. As the new APIs become more feature complete (probably at or around 0.21) the recommendation will be to prefer those. There has been discussion around un-deprecating the old APIs. Well that's why I asked about creating a Jira. Here's the code ... jc = new JobClient(new JobConf(conf) ); conf is actually an instance of Configuration which is what we are *supposed* to use. ;-) Of course JobConf has an ugly 'strikeout' through it. And that's what I meant by ugly. I wonder if there's a better interface to JobTracker than JobClient planned? (Not that I'm complaining. It does what I need...) I would hope that JobClient gets refactored to know about Configuration.. :-) Thx -Mike _ Hotmail is redefining busy with tools for the New Busy. Get more from your inbox. http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2
Applying HDFS-630 patch to hadoop-0.20.2 tarball release?
I am currently testing out a rollout of HBase 0.20.3 on top of Hadoop 0.20.2. The HBase doc recommends HDFS-630 patch be applied. I realize this is a newbieish question, but has anyone done this to the tarball Hadoop-0.20.2 release? Since this is a specific recommendation by the HBase release, I think a walk-through would be quite useful for anyone else similary coming up the Hadoop + HBase learning curve. (I'm afraid I've been away from the Linux / DB / Systems world for far too long, nearly a decade, and I've come back to work to a very changed landscape. But I digress...) Thanks in advance. Joseph
Re: Doubt: Using PBS to run mapreduce jobs.
HOD supports a PBS environment, namely Torque. Torque is the vastly improved fork of OpenPBS. You may be able to get HOD working on OpenPBS, or better still persuade your cluster admins to upgrade to a more recent version of Torque (e.g. at least 2.1.x) Craig On 22/07/28164 20:59, Udaya Lakshmi wrote: Hi, I am given an account on a cluster which uses OpenPBS as the cluster management software. The only way I can run a job is by submitting it to OpenPBS. How to run mapreduce programs on it? Is there any possible work around? Thanks, Udaya.
Re: Applying HDFS-630 patch to hadoop-0.20.2 tarball release?
Hi Joseph, You'll have to apply the patch with patch -p0 foo.patch and then recompile using ant. If you want to avoid this you can grab the CDH2 tarball here: http://archive.cloudera.com/cdh/2/ - it includes the HDFS-630 patch. Thanks -Todd On Tue, May 4, 2010 at 9:38 AM, Joseph Chiu joec...@joechiu.com wrote: I am currently testing out a rollout of HBase 0.20.3 on top of Hadoop 0.20.2. The HBase doc recommends HDFS-630 patch be applied. I realize this is a newbieish question, but has anyone done this to the tarball Hadoop-0.20.2 release? Since this is a specific recommendation by the HBase release, I think a walk-through would be quite useful for anyone else similary coming up the Hadoop + HBase learning curve. (I'm afraid I've been away from the Linux / DB / Systems world for far too long, nearly a decade, and I've come back to work to a very changed landscape. But I digress...) Thanks in advance. Joseph -- Todd Lipcon Software Engineer, Cloudera
Re: Applying HDFS-630 patch to hadoop-0.20.2 tarball release?
Thanks Todd.Where I really need help is to get up to speed on that process of recompiling (and re-installing the build outputs) with ant. Cheers, Joseph On Tue, May 4, 2010 at 9:48 AM, Todd Lipcon t...@cloudera.com wrote: Hi Joseph, You'll have to apply the patch with patch -p0 foo.patch and then recompile using ant. If you want to avoid this you can grab the CDH2 tarball here: http://archive.cloudera.com/cdh/2/ - it includes the HDFS-630 patch. Thanks -Todd On Tue, May 4, 2010 at 9:38 AM, Joseph Chiu joec...@joechiu.com wrote: I am currently testing out a rollout of HBase 0.20.3 on top of Hadoop 0.20.2. The HBase doc recommends HDFS-630 patch be applied. I realize this is a newbieish question, but has anyone done this to the tarball Hadoop-0.20.2 release? Since this is a specific recommendation by the HBase release, I think a walk-through would be quite useful for anyone else similary coming up the Hadoop + HBase learning curve. (I'm afraid I've been away from the Linux / DB / Systems world for far too long, nearly a decade, and I've come back to work to a very changed landscape. But I digress...) Thanks in advance. Joseph -- Todd Lipcon Software Engineer, Cloudera
Re: Doubt: Using PBS to run mapreduce jobs.
Thank you Craig. My cluster has got Torque. Can you please point me something which will have detailed explanation about using HOD on Torque. On Tue, May 4, 2010 at 10:17 PM, Craig Macdonald cra...@dcs.gla.ac.ukwrote: HOD supports a PBS environment, namely Torque. Torque is the vastly improved fork of OpenPBS. You may be able to get HOD working on OpenPBS, or better still persuade your cluster admins to upgrade to a more recent version of Torque (e.g. at least 2.1.x) Craig On 22/07/28164 20:59, Udaya Lakshmi wrote: Hi, I am given an account on a cluster which uses OpenPBS as the cluster management software. The only way I can run a job is by submitting it to OpenPBS. How to run mapreduce programs on it? Is there any possible work around? Thanks, Udaya.
Re: Doubt: Using PBS to run mapreduce jobs.
Udaya, Following link will help you for HOD on torque. http://hadoop.apache.org/common/docs/r0.20.0/hod_user_guide.html Thanks, --- Peeyush On Tue, 2010-05-04 at 22:49 +0530, Udaya Lakshmi wrote: Thank you Craig. My cluster has got Torque. Can you please point me something which will have detailed explanation about using HOD on Torque. On Tue, May 4, 2010 at 10:17 PM, Craig Macdonald cra...@dcs.gla.ac.ukwrote: HOD supports a PBS environment, namely Torque. Torque is the vastly improved fork of OpenPBS. You may be able to get HOD working on OpenPBS, or better still persuade your cluster admins to upgrade to a more recent version of Torque (e.g. at least 2.1.x) Craig On 22/07/28164 20:59, Udaya Lakshmi wrote: Hi, I am given an account on a cluster which uses OpenPBS as the cluster management software. The only way I can run a job is by submitting it to OpenPBS. How to run mapreduce programs on it? Is there any possible work around? Thanks, Udaya.
Re: Applying HDFS-630 patch to hadoop-0.20.2 tarball release?
On Tue, May 4, 2010 at 10:03 AM, Joseph Chiu joec...@joechiu.com wrote: Thanks Todd. Where I really need help is to get up to speed on that process of recompiling (and re-installing the build outputs) with ant. The place to look is in the wiki: http://wiki.apache.org/hadoop/HowToRelease It walks through the build process very well. -- Owen
Re: Applying HDFS-630 patch to hadoop-0.20.2 tarball release?
Thanks! On Tue, May 4, 2010 at 11:14 AM, Owen O'Malley owen.omal...@gmail.comwrote: On Tue, May 4, 2010 at 10:03 AM, Joseph Chiu joec...@joechiu.com wrote: Thanks Todd.Where I really need help is to get up to speed on that process of recompiling (and re-installing the build outputs) with ant. The place to look is in the wiki: http://wiki.apache.org/hadoop/HowToRelease It walks through the build process very well. -- Owen
RE: Hadoop User Group - May 19th at Yahoo!
Hi Agenda is available for the upcoming HUG. Hope to see you all there. http://www.meetup.com/hadoop/calendar/13048582/ thanks Dekel Register today for Hadoop Summit 2010 June 29th, Hyatt, Santa Clara, CA http://hadoopsummit2010.eventbrite.com/ Presentation submission deadline extended until May 10th http://developer.yahoo.com/events/hadoopsummit2010/presentationguidelines.html
Re: Doubt: Using PBS to run mapreduce jobs.
On May 4, 2010, at 7:46 AM, Udaya Lakshmi wrote: Hi, I am given an account on a cluster which uses OpenPBS as the cluster management software. The only way I can run a job is by submitting it to OpenPBS. How to run mapreduce programs on it? Is there any possible work around? Take a look at Hadoop on Demand. It was built with Torque in mind, but any PBS system should work with few changes.
Re: Doubt: Using PBS to run mapreduce jobs.
Thank you. Udaya. On Wed, May 5, 2010 at 12:23 AM, Allen Wittenauer awittena...@linkedin.comwrote: On May 4, 2010, at 7:46 AM, Udaya Lakshmi wrote: Hi, I am given an account on a cluster which uses OpenPBS as the cluster management software. The only way I can run a job is by submitting it to OpenPBS. How to run mapreduce programs on it? Is there any possible work around? Take a look at Hadoop on Demand. It was built with Torque in mind, but any PBS system should work with few changes.
about CombineFileInputFormat
Hi, I tried to use CombineFileInputFormat in 0.20.2. It seems I need to extend it because it is an abstract class. However, I need to implement getRecordReader method in the extended class. May I ask how to implement this getRecordReader method? I tried to do something like this: public RecordReader getRecordReader(InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException { // TODO Auto-generated method stub reporter.setStatus(genericSplit.toString()); return new CombineFileRecordReader(job, (CombineFileSplit) genericSplit, reporter, CombineFileRecordReader.class); } It doesn't seem to be working. I would be very appreciated if someone can shed a light on this. thanks zhenyu
new to hadoop
Hi, I am tring to set up a small hadoop cluster with 6 machines. the problem I have now is that if I set the memory allocated to a task low (e.g -Xmx512m) the application does not run, if I set it higher some machines in the cluster only have not got too much memory (1 or 2GB) and when the computation gets intensive hadoop create so many jobs and send them to these weaker machines, which brings the whole cluster down. my question is whether it is possible to specify -Xmx for each machine in the cluster and specify how many task can run on a machine. or what is the optimal setting in this situation? thanks for your help Tom -- View this message in context: http://old.nabble.com/new-to-hadoop-tp28454028p28454028.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Accepting contributions for the Hadooop in Practice book
Hi, guys, I am working on this book for Manning http://www.manning.com/, and I need your solutions. If you had a specific problem that you solved with Hadoop, and you can share your solution, even in general terms, I will accept it from you and put it in the book. You will be mentioned as the person/company who contributed this specific solution. Contributions about Pig, Hive, Scaling, etc. are also welcome. It does not have to be a formal documented description; a few written ideas are enough, a phone conversation where you will explain the problem and the solution will also be good, and if you can point me to something already out on the web, that will be great. Like, for example, dealing with many small files http://www.cloudera.com/blog/2009/02/the-small-files-problem/. Thank you. Sincerely, Mark
Re: new to hadoop
How much RAM ? With 6-8GB RAM you can go for 4 mappers and 2 reducers (this is my personal guess). - Ravi On 5/4/10 4:33 PM, Tamas Jambor jambo...@googlemail.com wrote: thank you. so what would be the optimal setting for mapred.map.tasks and mapred.reduce.tasks, say, on a dual-core machine? Tom On 05/05/2010 00:12, Ravi Phulari wrote: Re: new to hadoop You can configure (conf/hadoop-env.sh) configuration files on each node to specify -Xmx values. You can use conf/mapred-site.xml to configure default mappers and reducers running on a node. property namemapred.map.tasks/name value2/value descriptionThe default number of map tasks per job. Ignored when mapred.job.tracker is local. /description /property property namemapred.reduce.tasks/name value1/value descriptionThe default number of reduce tasks per job. Typically set to 99% of the cluster's reduce capacity, so that if a node fails the reduces can still be executed in a single wave. Ignored when mapred.job.tracker is local. /description /property - Ravi On 5/4/10 3:54 PM, jamborta jambo...@gmail.com wrote: Hi, I am tring to set up a small hadoop cluster with 6 machines. the problem I have now is that if I set the memory allocated to a task low (e.g -Xmx512m) the application does not run, if I set it higher some machines in the cluster only have not got too much memory (1 or 2GB) and when the computation gets intensive hadoop create so many jobs and send them to these weaker machines, which brings the whole cluster down. my question is whether it is possible to specify -Xmx for each machine in the cluster and specify how many task can run on a machine. or what is the optimal setting in this situation? thanks for your help Tom -- View this message in context: http://old.nabble.com/new-to-hadoop-tp28454028p28454028.html Sent from the Hadoop core-user mailing list archive at Nabble.com. Ravi Ravi --
Re: new to hadoop
thank you. so what would be the optimal setting for mapred.map.tasks and mapred.reduce.tasks, say, on a dual-core machine? Tom On 05/05/2010 00:12, Ravi Phulari wrote: You can configure (conf/hadoop-env.sh) configuration files on each node to specify --Xmx values. You can use conf/mapred-site.xml to configure default mappers and reducers running on a node. property namemapred.map.tasks/name value2/value descriptionThe default number of map tasks per job. Ignored when mapred.job.tracker is local. /description /property property namemapred.reduce.tasks/name value1/value descriptionThe default number of reduce tasks per job. Typically set to 99% of the cluster's reduce capacity, so that if a node fails the reduces can still be executed in a single wave. Ignored when mapred.job.tracker is local. /description /property - Ravi On 5/4/10 3:54 PM, jamborta jambo...@gmail.com wrote: Hi, I am tring to set up a small hadoop cluster with 6 machines. the problem I have now is that if I set the memory allocated to a task low (e.g -Xmx512m) the application does not run, if I set it higher some machines in the cluster only have not got too much memory (1 or 2GB) and when the computation gets intensive hadoop create so many jobs and send them to these weaker machines, which brings the whole cluster down. my question is whether it is possible to specify -Xmx for each machine in the cluster and specify how many task can run on a machine. or what is the optimal setting in this situation? thanks for your help Tom -- View this message in context: http://old.nabble.com/new-to-hadoop-tp28454028p28454028.html Sent from the Hadoop core-user mailing list archive at Nabble.com. Ravi --
Re: new to hadoop
great. thank you. I'll set it up that way. Tom On 05/05/2010 00:37, Ravi Phulari wrote: How much RAM ? With 6-8GB RAM you can go for 4 mappers and 2 reducers (this is my personal guess). - Ravi On 5/4/10 4:33 PM, Tamas Jambor jambo...@googlemail.com wrote: thank you. so what would be the optimal setting for mapred.map.tasks and mapred.reduce.tasks, say, on a dual-core machine? Tom On 05/05/2010 00:12, Ravi Phulari wrote: Re: new to hadoop You can configure (conf/hadoop-env.sh) configuration files on each node to specify --Xmx values. You can use conf/mapred-site.xml to configure default mappers and reducers running on a node. property namemapred.map.tasks/name value2/value descriptionThe default number of map tasks per job. Ignored when mapred.job.tracker is local. /description /property property namemapred.reduce.tasks/name value1/value descriptionThe default number of reduce tasks per job. Typically set to 99% of the cluster's reduce capacity, so that if a node fails the reduces can still be executed in a single wave. Ignored when mapred.job.tracker is local. /description /property - Ravi On 5/4/10 3:54 PM, jamborta jambo...@gmail.com wrote: Hi, I am tring to set up a small hadoop cluster with 6 machines. the problem I have now is that if I set the memory allocated to a task low (e.g -Xmx512m) the application does not run, if I set it higher some machines in the cluster only have not got too much memory (1 or 2GB) and when the computation gets intensive hadoop create so many jobs and send them to these weaker machines, which brings the whole cluster down. my question is whether it is possible to specify -Xmx for each machine in the cluster and specify how many task can run on a machine. or what is the optimal setting in this situation? thanks for your help Tom -- View this message in context: http://old.nabble.com/new-to-hadoop-tp28454028p28454028.html Sent from the Hadoop core-user mailing list archive at Nabble.com. Ravi Ravi --
Re: about CombineFileInputFormat
See patch on https://issues.apache.org/jira/browse/MAPREDUCE-364 as an example. -Amareshwari On 5/5/10 1:52 AM, Zhenyu Zhong zhongresea...@gmail.com wrote: Hi, I tried to use CombineFileInputFormat in 0.20.2. It seems I need to extend it because it is an abstract class. However, I need to implement getRecordReader method in the extended class. May I ask how to implement this getRecordReader method? I tried to do something like this: public RecordReader getRecordReader(InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException { // TODO Auto-generated method stub reporter.setStatus(genericSplit.toString()); return new CombineFileRecordReader(job, (CombineFileSplit) genericSplit, reporter, CombineFileRecordReader.class); } It doesn't seem to be working. I would be very appreciated if someone can shed a light on this. thanks zhenyu