Re: Reduce doesn't start until map finishes
So, is there currently no solution to my problem? Should I live with it? Or do we have to have a JIRA for this? What do you think? 2009/3/4 Nick Cen > Thanks, about the "Secondary Sort", can you provide some example. What does > the intermediate keys stands for? > > Assume I have two mapper, m1 and m2. The output of m1 is (k1,v1),(k2,v2) > and > the output of m2 is (k1,v3),(k2,v4). Assume k1 and k2 belongs to the same > partition and k1 < k2, so i think the order inside reducer maybe: > (k1,v1) > (k1,v3) > (k2,v2) > (k2,v4) > > can the Secondary Sort change this order? > > > > 2009/3/4 Chris Douglas > > > The output of each map is sorted by partition and by key within that > > partition. The reduce merges sorted map output assigned to its partition > > into the reduce. The following may be helpful: > > > > http://hadoop.apache.org/core/docs/current/mapred_tutorial.html > > > > If your job requires total order, consider > > o.a.h.mapred.lib.TotalOrderPartitioner. -C > > > > > > On Mar 3, 2009, at 7:24 PM, Nick Cen wrote: > > > > can you provide more info about sortint? The sort is happend on the > whole > >> data set, or just on the specified partion? > >> > >> 2009/3/4 Mikhail Yakshin > >> > >> On Wed, Mar 4, 2009 at 2:09 AM, Chris Douglas wrote: > >>> > This is normal behavior. The Reducer is guaranteed to receive all the > results for its partition in sorted order. No reduce can start until > all > > >>> the > >>> > maps are completed, since any running map could emit a result that > would > violate the order for the results it currently has. -C > > >>> > >>> _Reducers_ usually start almost immediately and start downloading data > >>> emitted by mappers as they go. This is their first phase. Their second > >>> phase can start only after completion of all mappers. In their second > >>> phase, they're sorting received data, and in their third phase they're > >>> doing real reduction. > >>> > >>> -- > >>> WBR, Mikhail Yakshin > >>> > >>> > >> > >> > >> -- > >> http://daily.appspot.com/food/ > >> > > > > > > > -- > http://daily.appspot.com/food/ > -- M. Raşit ÖZDAŞ
Fetch errors. 2 node cluster.
Hello to all. I have 2 nodes in cluster - master + slave. names "master1" and "slave1" stored in /etc/hosts on both hosts and they are 100% correct. conf/masters: master1 conf/slaves: master1 slave1 "conf/slaves" + "conf/masters" are empty on "slave1" node. I tried to fill them in many ways - it didn't helped. "master1" is AMD-64, "slave1" is Xeon-32. I have compiled one C++ wordcount-simple binary on 32bit machine and put it on HDFS. The binary successfully runs on both machines. I have 5 files in "/input" on HDFS: i1.txt - 2 MB i2.txt - 2 MB i3.txt - 2 MB i4.txt ~ 50 MB i5.txt ~ 50 MB I have tried 0.18.3, 0.19.1, "trunk" svn dir, "branch-0.20" svn dir. Result the same... running job on "master1": localhost$> bin/hadoop pipes -conf src/examples/pipes/conf/word.xml -input /input -output /o1 word.xml: http://pastebin.com/m25577ea4 conf/hadoop-default.xml: http://pastebin.com/m199c08f0 conf/hadoop-site.xml: http://pastebin.com/m321ead97 conf/hadoop-env.sh: http://pastebin.com/m41c36f2f Console output on "master1" contains WARN messages about fetching errors: 09/03/06 09:44:23 WARN mapred.JobClient: Error reading task outputhttp://localhost:50060/tasklog?plaintext=true&taskid=attempt_200903060939_0001_m_00_0&filter=stdout [master1] logs/hadoop-hadoop-tasktracker-localhost.log contains this many times: 2009-03-06 09:41:51,178 WARN org.apache.hadoop.mapred.TaskTracker: getMapOutput(attempt_200903060939_0001_m_00_0,1) failed : org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_200903060939_0001/attempt_200903060939_0001_m_00_0/output/file.out.index in any of the configured local directories 2009-03-06 09:41:51,179 WARN org.apache.hadoop.mapred.TaskTracker: Unknown child with bad map output: attempt_200903060939_0001_m_00_0. Ignored. 2009-03-06 09:41:51,224 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 127.0.0.1:50060, dest: 127.0.0.1:53917, bytes: 0, op: MAPRED_SHUFFLE, cliID: attempt_200903060939_0001_m_00_0 2009-03-06 09:41:51,224 WARN org.mortbay.log: /mapOutput: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_200903060939_0001/attempt_200903060939_0001_m_00_0/output/file.out.index in any of the configured local directories [slave1] logs/hadoop-hadoop-tasktracker-srv.log contains this: ... 2009-03-06 09:40:50,094 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200903060939_0001_m_00_0 0.61383915% hdfs://master1:9000/inputi4.txt:0+5336 2009-03-06 09:40:50,188 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200903060939_0001_m_01_0 0.59977823% hdfs://master1:9000/inputi5.txt:0+5336 2009-03-06 09:40:53,097 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200903060939_0001_m_00_0 0.66882175% hdfs://master1:9000/inputi4.txt:0+5336 2009-03-06 09:40:53,191 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200903060939_0001_m_01_0 0.64430434% hdfs://master1:9000/inputi5.txt:0+5336 2009-03-06 09:40:56,100 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200903060939_0001_m_00_0 0.7192957% hdfs://master1:9000/inputi4.txt:0+5336 2009-03-06 09:40:56,194 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200903060939_0001_m_01_0 0.68883044% hdfs://master1:9000/inputi5.txt:0+5336 2009-03-06 09:40:59,103 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200903060939_0001_m_00_0 0.7661652% hdfs://master1:9000/inputi4.txt:0+5336 2009-03-06 09:40:59,212 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200903060939_0001_m_01_0 0.7263261% hdfs://master1:9000/inputi5.txt:0+5336 2009-03-06 09:41:02,106 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200903060939_0001_m_00_0 0.80600435% hdfs://master1:9000/inputi4.txt:0+5336 2009-03-06 09:41:02,271 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200903060939_0001_m_01_0 0.7802261% hdfs://master1:9000/inputi5.txt:0+5336 ... I have read some mailing lists and saw discussions about the ability nodes to network connections to each other, but i cant imagine where my error is... Iptables is empty and i can ssh from master to slave and from slave to master... Also i checked tcp-connections from one to another with ports 9000, 9001 and other (by running "nc")... Just another description of this problem: http://dramele.livejournal.com/101634.html Pavel.
Re: The cpu preemption between MPI and Hadoop programs on Same Cluster
Song, you should be able to use 'nice' to reprioritize the MPI task below that of your Hadoop jobs. - Aaron On Thu, Mar 5, 2009 at 8:26 PM, 柳松 wrote: > > Dear all: > I run my hadoop program with another MPI program on the same cluster. > here is the result of "top". > PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND > 11750 qianglv 25 0 233m 99m 6100 R 99.7 2.5 116:05.59 rosetta.mpich > 18094 cip 17 0 3136m 68m 15m S 0.5 1.7 0:12.69 java > 18244 cip 17 0 3142m 80m 15m S 0.2 2.0 0:17.61 java > 18367 cip 18 0 2169m 88m 15m S 0.1 2.3 0:17.46 java > 18012 cip 18 0 3141m 77m 15m S 0.1 2.0 0:14.49 java > 18584 cip 21 0 m 46m 15m S 0.1 1.2 0:05.12 java > > My Hadoop program can only get no more than 1 percent cpu time slide in > total, compared with the rosetta.mpich program's 99.7%. > > I'm sure my program is in progress since the log files told me, they are > running normally. > > Someone told me, it's the nature of Java program, low cpu priority, > especially compared with C program. > > Is that true? > > Regards > Song Liu in Suzhou University.
Re: Throw an exception if the configure method fails
Try throwing RuntimeException, or any other unchecked exception (e.g., any descendant classes of RuntimeException) - Aaron On Thu, Mar 5, 2009 at 4:24 PM, Saptarshi Guha wrote: > hello, > I'm not that comfortable with java, so here is my question. In the > MapReduceBase class, i have implemented the configure method, which > does not throw an exception. Suppose I detect an error in some > options, i wish to raise an exception(in the configure method) - is > there a way to do that? Is there a way to stop the job in case the > configure method fails?, > Saptarshi Guha > > [1] My map extends MapReduceBase and my reduce extends MapReduceBase - > two separate classes. > > Thank you >
Re: Repartitioned Joins
Richa, Since the mappers run independently, you'd have a hard time determining whether a record in mapper A would be joined by a record in mapper B. The solution, as it were, would be to do this in two separate MapReduce passes: * Take an educated guess at which table is the smaller data set. * Run a MapReduce over this dataset, building up a bloom filter for the record ids. Set entries in the filter to 1 for each record id you see; leave the rest as 0. * The bloom filter now has 1 meaning "maybe joinable" and 0 meaning "definitely not joinable." * Run a second MapReduce job over both datasets. Use the distributed cache to send the filter to all mappers. Mappers emit all records where filter[hash(record_id)] == 1. - Aaron On Wed, Mar 4, 2009 at 11:18 AM, Richa Khandelwal wrote: > Hi All, > Does anyone know of tweaking in map-reduce joins that will optimize it > further in terms of the moving only those tuples to reduce phase that join > in the two tables? There are replicated joins and semi-join strategies but > they are more of databases than map-reduce. > > Thanks, > Richa Khandelwal > University Of California, > Santa Cruz. > Ph:425-241-7763 >
Re: Running 0.19.2 branch in production before release
Right, there's no sense in freezing your Hadoop version forever :) But if you're an ops team tasked with keeping a production cluster running 24/7, running on 0.19 (or even more daringly, TRUNK) is not something that I would consider a Best Practice. Ideally you'll be able to carve out some spare capacity (maybe 3--5 nodes) to use as a staging cluster that runs on 0.19 or TRUNK that you can use to evaluate the next version. Then when you are convinced that it's stable, and your staging cluster passes your internal tests (e.g., running test versions of your critical nightly jobs successfully), you can move that to production. - Aaron On Thu, Mar 5, 2009 at 2:33 AM, Steve Loughran wrote: > Aaron Kimball wrote: > >> I recommend 0.18.3 for production use and avoid the 19 branch entirely. If >> your priority is stability, then stay a full minor version behind, not >> just >> a revision. >> > > Of course, if everyone stays that far behind, they don't get to find the > bugs for other people. > > * If you play with the latest releases early, while they are in the beta > phase -you will encounter the problems specific to your > applications/datacentres, and get them fixed fast. > > * If you work with stuff further back you get stability, but not only are > you behind on features, you can't be sure that all "fixes" that matter to > you get pushed back. > > * If you plan on making changes, of adding features, get onto SVN_HEAD > > * If you want to catch changes being made that break your site, SVN_HEAD. > Better yet, have a private Hudson server checking out SVN_HEAD hadoop *then* > building and testing your app against it. > > Normally I work with stable releases of things I dont depend on, and > SVN_HEAD of OSS stuff whose code I have any intent to change; there is a > price -merge time, the odd change breaking your code- but you get to make > changes that help you long term. > > Where Hadoop is different is that it is a filesystem, and you don't want to > hit bugs that delete files that matter. I'm only bringing up transient > clusters on VMs, pulling in data from elsewhere, so this isn't an issue. All > that remains is changing APIs. > > -Steve >
Re: Throwing an IOException in Map, yet task does not fail
I meant, "not marked as failed" ... On 3/6/09 10:37 AM, "Jothi Padmanabhan" wrote: > Just trying to understand this better, are you observing that the task, which > failed with the IOException, not getting marked as killed? If yes, that does > not look right... > > Jothi > > On 3/6/09 8:12 AM, "Saptarshi Guha" wrote: > >> Hello, >> I have given a case where my mapper should fail. That is, based on a >> result it throws an exception >> if(res==0) throw new IOException("Error in code!, see stderr/out"); >> , >> When i go to the JobTracker website, e.g >> http://tracker.com:50030/jobdetails.jsp?jobid=job_200903051709_0024&refresh=3>> 0 >> and click on one of the running tasks, I see an IOException in the >> errors column. >> But on the jobtracker page for the job, it doesn't fail - it stays in >> the running column , never moving to the failed/killed columns (not >> even after 10 minutes) >> >> Why so? >> Regards >> >> >> Saptarshi Guha
Re: Throwing an IOException in Map, yet task does not fail
Is your job a streaming job? If so, Which version of hadoop are you using? what is the configured value for stream.non.zero.exit.is.failure? Can you see stream.non.zero.exit.is.failure to true and try again? Thanks Amareshwari Saptarshi Guha wrote: Hello, I have given a case where my mapper should fail. That is, based on a result it throws an exception if(res==0) throw new IOException("Error in code!, see stderr/out"); , When i go to the JobTracker website, e.g http://tracker.com:50030/jobdetails.jsp?jobid=job_200903051709_0024&refresh=30 and click on one of the running tasks, I see an IOException in the errors column. But on the jobtracker page for the job, it doesn't fail - it stays in the running column , never moving to the failed/killed columns (not even after 10 minutes) Why so? Regards Saptarshi Guha
Re: Throwing an IOException in Map, yet task does not fail
Just trying to understand this better, are you observing that the task, which failed with the IOException, not getting marked as killed? If yes, that does not look right... Jothi On 3/6/09 8:12 AM, "Saptarshi Guha" wrote: > Hello, > I have given a case where my mapper should fail. That is, based on a > result it throws an exception > if(res==0) throw new IOException("Error in code!, see stderr/out"); > , > When i go to the JobTracker website, e.g > http://tracker.com:50030/jobdetails.jsp?jobid=job_200903051709_0024&refresh=30 > and click on one of the running tasks, I see an IOException in the > errors column. > But on the jobtracker page for the job, it doesn't fail - it stays in > the running column , never moving to the failed/killed columns (not > even after 10 minutes) > > Why so? > Regards > > > Saptarshi Guha
The cpu preemption between MPI and Hadoop programs on Same Cluster
Dear all: I run my hadoop program with another MPI program on the same cluster. here is the result of "top". PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 11750 qianglv 25 0 233m 99m 6100 R 99.7 2.5 116:05.59 rosetta.mpich 18094 cip 17 0 3136m 68m 15m S 0.5 1.7 0:12.69 java 18244 cip 17 0 3142m 80m 15m S 0.2 2.0 0:17.61 java 18367 cip 18 0 2169m 88m 15m S 0.1 2.3 0:17.46 java 18012 cip 18 0 3141m 77m 15m S 0.1 2.0 0:14.49 java 18584 cip 21 0 m 46m 15m S 0.1 1.2 0:05.12 java My Hadoop program can only get no more than 1 percent cpu time slide in total, compared with the rosetta.mpich program's 99.7%. I'm sure my program is in progress since the log files told me, they are running normally. Someone told me, it's the nature of Java program, low cpu priority, especially compared with C program. Is that true? Regards Song Liu in Suzhou University.
Re: wordcount getting slower with more mappers and reducers?
As I metioned above, you should at least try like this: map2 reduce1 map4 reduce1 map8 reduce1 map4 reduce1 map4 reduce2 map4 reduce4 instead of : map2 reduce2 map4 reduce4 map8 reduce8 2009/3/6 Sandy > I was trying to control the maximum number of tasks per tasktracker by > using > the > mapred.tasktracker.tasks.maximum parameter > > I am interpreting your comment to mean that maybe this parameter is > malformed and should read: > mapred.tasktracker.map.tasks.maximum = 8 > mapred.tasktracker.map.tasks.maximum = 8 > > I did that, and reran on a 428MB input, and got the same results as before. > I also ran it on a 3.3G dataset, and got the same pattern. > > I am still trying to run it on a 20 GB input. This should confirm if the > filesystem cache thing is true. > > -SM > > On Thu, Mar 5, 2009 at 12:22 PM, Sandy wrote: > > > Arun, > > > > How can I check the number of slots per tasktracker? Which parameter > > controls that? > > > > Thanks, > > -SM > > > > > > On Thu, Mar 5, 2009 at 12:14 PM, Arun C Murthy > wrote: > > > >> I assume you have only 2 map and 2 reduce slots per tasktracker - which > >> totals to 2 maps/reduces for you cluster. This means with more > maps/reduces > >> they are serialized to 2 at a time. > >> > >> Also, the -m is only a hint to the JobTracker, you might see less/more > >> than the number of maps you have specified on the command line. > >> The -r however is followed faithfully. > >> > >> Arun > >> > >> > >> On Mar 4, 2009, at 2:46 PM, Sandy wrote: > >> > >> Hello all, > >>> > >>> For the sake of benchmarking, I ran the standard hadoop wordcount > example > >>> on > >>> an input file using 2, 4, and 8 mappers and reducers for my job. > >>> In other words, I do: > >>> > >>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2 > >>> sample.txt output > >>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4 > >>> sample.txt output2 > >>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8 > >>> sample.txt output3 > >>> > >>> Strangely enough, when this increase in mappers and reducers result in > >>> slower running times! > >>> -On 2 mappers and reducers it ran for 40 seconds > >>> on 4 mappers and reducers it ran for 60 seconds > >>> on 8 mappers and reducers it ran for 90 seconds! > >>> > >>> Please note that the "sample.txt" file is identical in each of these > >>> runs. > >>> > >>> I have the following questions: > >>> - Shouldn't wordcount get -faster- with additional mappers and > reducers, > >>> instead of slower? > >>> - If it does get faster for other people, why does it become slower for > >>> me? > >>> I am running hadoop on psuedo-distributed mode on a single 64-bit Mac > >>> Pro > >>> with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs > >>> > >>> I would greatly appreciate it if someone could explain this behavior to > >>> me, > >>> and tell me if I'm running this wrong. How can I change my settings (if > >>> at > >>> all) to get wordcount running faster when i increases that number of > maps > >>> and reduces? > >>> > >>> Thanks, > >>> -SM > >>> > >> > >> > > >
Throwing an IOException in Map, yet task does not fail
Hello, I have given a case where my mapper should fail. That is, based on a result it throws an exception if(res==0) throw new IOException("Error in code!, see stderr/out"); , When i go to the JobTracker website, e.g http://tracker.com:50030/jobdetails.jsp?jobid=job_200903051709_0024&refresh=30 and click on one of the running tasks, I see an IOException in the errors column. But on the jobtracker page for the job, it doesn't fail - it stays in the running column , never moving to the failed/killed columns (not even after 10 minutes) Why so? Regards Saptarshi Guha
Throw an exception if the configure method fails
hello, I'm not that comfortable with java, so here is my question. In the MapReduceBase class, i have implemented the configure method, which does not throw an exception. Suppose I detect an error in some options, i wish to raise an exception(in the configure method) - is there a way to do that? Is there a way to stop the job in case the configure method fails?, Saptarshi Guha [1] My map extends MapReduceBase and my reduce extends MapReduceBase - two separate classes. Thank you
Re: Mapreduce jobconf options:webpage
Thank you Saptarshi Guha On Thu, Mar 5, 2009 at 6:56 PM, james warren wrote: > Are you referring to > > http://hadoop.apache.org/core/docs/current/hadoop-default.html > > ? The default settings are also available in the conf/ directory of your > hadoop installation. > > cheers, > -jw > > On Thu, Mar 5, 2009 at 3:51 PM, Saptarshi Guha > wrote: > >> Hello, >> I came across a page, i think, on the hadoop website, listing all the >> mapreduce options. Does anyone have a link? >> >> Regards >> >> Saptarshi Guha >> >
Re: Mapreduce jobconf options:webpage
Are you referring to http://hadoop.apache.org/core/docs/current/hadoop-default.html ? The default settings are also available in the conf/ directory of your hadoop installation. cheers, -jw On Thu, Mar 5, 2009 at 3:51 PM, Saptarshi Guha wrote: > Hello, > I came across a page, i think, on the hadoop website, listing all the > mapreduce options. Does anyone have a link? > > Regards > > Saptarshi Guha >
Mapreduce jobconf options:webpage
Hello, I came across a page, i think, on the hadoop website, listing all the mapreduce options. Does anyone have a link? Regards Saptarshi Guha
Re: wordcount getting slower with more mappers and reducers?
I was trying to control the maximum number of tasks per tasktracker by using the mapred.tasktracker.tasks.maximum parameter I am interpreting your comment to mean that maybe this parameter is malformed and should read: mapred.tasktracker.map.tasks.maximum = 8 mapred.tasktracker.map.tasks.maximum = 8 I did that, and reran on a 428MB input, and got the same results as before. I also ran it on a 3.3G dataset, and got the same pattern. I am still trying to run it on a 20 GB input. This should confirm if the filesystem cache thing is true. -SM On Thu, Mar 5, 2009 at 12:22 PM, Sandy wrote: > Arun, > > How can I check the number of slots per tasktracker? Which parameter > controls that? > > Thanks, > -SM > > > On Thu, Mar 5, 2009 at 12:14 PM, Arun C Murthy wrote: > >> I assume you have only 2 map and 2 reduce slots per tasktracker - which >> totals to 2 maps/reduces for you cluster. This means with more maps/reduces >> they are serialized to 2 at a time. >> >> Also, the -m is only a hint to the JobTracker, you might see less/more >> than the number of maps you have specified on the command line. >> The -r however is followed faithfully. >> >> Arun >> >> >> On Mar 4, 2009, at 2:46 PM, Sandy wrote: >> >> Hello all, >>> >>> For the sake of benchmarking, I ran the standard hadoop wordcount example >>> on >>> an input file using 2, 4, and 8 mappers and reducers for my job. >>> In other words, I do: >>> >>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2 >>> sample.txt output >>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4 >>> sample.txt output2 >>> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8 >>> sample.txt output3 >>> >>> Strangely enough, when this increase in mappers and reducers result in >>> slower running times! >>> -On 2 mappers and reducers it ran for 40 seconds >>> on 4 mappers and reducers it ran for 60 seconds >>> on 8 mappers and reducers it ran for 90 seconds! >>> >>> Please note that the "sample.txt" file is identical in each of these >>> runs. >>> >>> I have the following questions: >>> - Shouldn't wordcount get -faster- with additional mappers and reducers, >>> instead of slower? >>> - If it does get faster for other people, why does it become slower for >>> me? >>> I am running hadoop on psuedo-distributed mode on a single 64-bit Mac >>> Pro >>> with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs >>> >>> I would greatly appreciate it if someone could explain this behavior to >>> me, >>> and tell me if I'm running this wrong. How can I change my settings (if >>> at >>> all) to get wordcount running faster when i increases that number of maps >>> and reduces? >>> >>> Thanks, >>> -SM >>> >> >> >
Batch processing map reduce jobs
Hi All, Does anyone know how to run map reduce jobs using pipes or batch process map reduce jobs? Thanks, Richa Khandelwal University Of California, Santa Cruz. Ph:425-241-7763
Re: Avoiding Ganglia NPE on EC2
News from ScaleUnlimited bootcamp - where I am now - use hadoop-0.17.2.1 On Thu, Mar 5, 2009 at 3:53 PM, Stuart Sierra wrote: > Hi all, > > I'm getting this NPE on Hadoop 0.18.3, using the EC2 contrib scripts: > >Exception in thread "Timer thread for monitoring dfs" > java.lang.NullPointerException >at > org.apache.hadoop.metrics.ganglia.GangliaContext.xdr_string(GangliaContext.java:195) > > This is reported as: https://issues.apache.org/jira/browse/HADOOP-4137 > > What's the easiest workaround? Switch to another Hadoop version > (which one)? Or disable Ganglia entirely (how)? > > Thanks, > -Stuart >
Avoiding Ganglia NPE on EC2
Hi all, I'm getting this NPE on Hadoop 0.18.3, using the EC2 contrib scripts: Exception in thread "Timer thread for monitoring dfs" java.lang.NullPointerException at org.apache.hadoop.metrics.ganglia.GangliaContext.xdr_string(GangliaContext.java:195) This is reported as: https://issues.apache.org/jira/browse/HADOOP-4137 What's the easiest workaround? Switch to another Hadoop version (which one)? Or disable Ganglia entirely (how)? Thanks, -Stuart
Live Datanodes only 1; all the time
Hi All, Very Interesting behavior: http://machine2.xxx.xxx.xxx:50070/dfshealth.jsp shows that only one Live Nodes exist. Every time I refresh this page it shows a different node as alive. But Jobtracker Shows there are 8 nodes in the cluster summary. Any Idea what could be going on here with the following detailed setup I am trying? I am trying to configure hadoop as follows: Cluster Setup: Version 0-18.3 1) I want every user working on Login Nodes of our cluster to have their own Config dir. Hence edited the following in $HADOOP_HOME/conf/hadoop-env.sh HADOOP_CONF_DIR=$HOME/hadoop/conf Similarly HADOOP_LOD_DIR=$HOME/hadoop/logs Note: $HADOOP_HOME is shared NFS hadoop install folder on the cluster head node. There are three Login Nodes for our cluster, excluding head node. Head node is inaccessible to users. 2) Every user will have his own 'masters' and 'slaves' files under their $ HADOOP_CONF_DIR a. When I had this setup and removed the masters file from $HADOOP_HOME it complained that it could not start SecondaryNameNode. Hence I replaced the 'masters' file with an entry for our Login node. This worked and SecondaryNameNode starts without any error. 3) As a user, I chose One of the Login box as an entry to my $HOME/hadoop/conf/masters file. 'Slaves' file includes few compute nodes. 4) I don't see any errors when I start the Hadoop daemons, using start-dfs.sh and start-mapred.sh 5) Only when I try to 'bin/hadoop fs -put conf input' files onto HDFS it complains as shown below in the snip section. NOTE: "grep ERROR *" in logs directory had no results. Does any of the below error messages lights a bulb? Please help me understand what I could be doing wrong ??? Thank you, Amit [ahku...@machine2 ~/hadoop]$ $hbin/hadoop fs -put conf input 09/03/05 15:20:26 INFO dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/ahkumar/input/hadoop-metrics.properties could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1123) at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:890) at org.apache.hadoop.ipc.Client.call(Client.java:716) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2450) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2333) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1745) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1922) 09/03/05 15:20:26 WARN dfs.DFSClient: NotReplicatedYetException sleeping /user/ahkumar/input/hadoop-metrics.properties retries left 4 09/03/05 15:20:26 INFO dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/ahkumar/input/hadoop-metrics.properties could only be replicated to 0 nodes, instead of 1 <... same as above> 09/03/05 15:20:26 WARN dfs.DFSClient: NotReplicatedYetException sleeping /user/ahkumar/input/hadoop-metrics.properties retries left 3 09/03/05 15:20:27 INFO dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/ahkumar/input/hadoop-metrics.properties could only be replicated to 0 nodes, instead of 1 <... same as above> 09/03/05 15:20:27 WARN dfs.DFSClient: NotReplicatedYetException sleeping /user/ahkumar/input/hadoop-metrics.properties retries left 2 09/03/05 15:20:29 INFO dfs.DFSClient: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/ahkumar/input/hadoop-metrics.properties could only be replicated to 0 nodes, instead of 1 <... same as above> 09/03/05 15:20:29 WARN dfs.DFSClient: NotReplicatedYetException sleeping /use
Re: DataNode stops cleaning disk?
Igor Bolotin wrote: That's what I saw just yesterday on one of the data nodes with this situation (will confirm also next time it happens): - Tmp and current were either empty or almost empty last time I checked. - du on the entire data directory matched exactly with reported used space in NameNode web UI and it did report that it uses some most of the available disk space. - nothing else was using disk space (actually - it's dedicated DFS cluster). If 'du' command (you can run in the shell) counts properly then you should be able to see which files are taking space. If 'du' can't but 'df' reports very less space available, then it is possible (though never seen) that datanode is keeping a a lot these files open.. 'ls -l /proc/datanodepid/fd' lists these files. If it is not datanode, then check lsof to find who is holding these files. hope this helps. Raghu. Thank you for help! Igor -Original Message- From: Raghu Angadi [mailto:rang...@yahoo-inc.com] Sent: Thursday, March 05, 2009 11:05 AM To: core-user@hadoop.apache.org Subject: Re: DataNode stops cleaning disk? This is unexpected unless some other process is eating up space. Couple of things to collect next time (along with log): - All the contents under datanode-directory/ (especially including 'tmp' and 'current') - Does 'du' of this directory match with what is reported to NameNode (shown on webui) by this DataNode. - Is there anything else taking disk space on the machine? Raghu. Igor Bolotin wrote: Normally I dislike writing about problems without being able to provide some more information, but unfortunately in this case I just can't find anything. Here is the situation - DFS cluster running Hadoop version 0.19.0. The cluster is running on multiple servers with practically identical hardware. Everything works perfectly well, except for one thing - from time to time one of the data nodes (every time it's a different node) starts to consume more and more disk space. The node keeps going and if we don't do anything - it runs out of space completely (ignoring 20GB reserved space settings). Once restarted - it cleans disk rapidly and goes back to approximately the same utilization as the rest of data nodes in the cluster. Scanning datanodes and namenode logs and comparing thread dumps (stacks) from nodes experiencing problem and those that run normally didn't produce any clues. Running balancer tool didn't help at all. FSCK shows that everything is healthy and number of over-replicated blocks is not significant. To me - it just looks like at some point the data node stops cleaning invalidated/deleted blocks, but keeps reporting space consumed by these blocks as "not used", but I'm not familiar enough with the internals and just plain don't have enough free time to start digging deeper. Anyone has an idea what is wrong or what else we can do to find out what's wrong or maybe where to start looking in the code? Thanks, Igor
RE: DataNode stops cleaning disk?
That's what I saw just yesterday on one of the data nodes with this situation (will confirm also next time it happens): - Tmp and current were either empty or almost empty last time I checked. - du on the entire data directory matched exactly with reported used space in NameNode web UI and it did report that it uses some most of the available disk space. - nothing else was using disk space (actually - it's dedicated DFS cluster). Thank you for help! Igor -Original Message- From: Raghu Angadi [mailto:rang...@yahoo-inc.com] Sent: Thursday, March 05, 2009 11:05 AM To: core-user@hadoop.apache.org Subject: Re: DataNode stops cleaning disk? This is unexpected unless some other process is eating up space. Couple of things to collect next time (along with log): - All the contents under datanode-directory/ (especially including 'tmp' and 'current') - Does 'du' of this directory match with what is reported to NameNode (shown on webui) by this DataNode. - Is there anything else taking disk space on the machine? Raghu. Igor Bolotin wrote: > Normally I dislike writing about problems without being able to provide > some more information, but unfortunately in this case I just can't find > anything. > > > > Here is the situation - DFS cluster running Hadoop version 0.19.0. The > cluster is running on multiple servers with practically identical > hardware. Everything works perfectly well, except for one thing - from > time to time one of the data nodes (every time it's a different node) > starts to consume more and more disk space. The node keeps going and if > we don't do anything - it runs out of space completely (ignoring 20GB > reserved space settings). Once restarted - it cleans disk rapidly and > goes back to approximately the same utilization as the rest of data > nodes in the cluster. > > > > Scanning datanodes and namenode logs and comparing thread dumps (stacks) > from nodes experiencing problem and those that run normally didn't > produce any clues. Running balancer tool didn't help at all. FSCK shows > that everything is healthy and number of over-replicated blocks is not > significant. > > > > To me - it just looks like at some point the data node stops cleaning > invalidated/deleted blocks, but keeps reporting space consumed by these > blocks as "not used", but I'm not familiar enough with the internals and > just plain don't have enough free time to start digging deeper. > > > > Anyone has an idea what is wrong or what else we can do to find out > what's wrong or maybe where to start looking in the code? > > > > Thanks, > > Igor > > > >
Re: DataNode stops cleaning disk?
This is unexpected unless some other process is eating up space. Couple of things to collect next time (along with log): - All the contents under datanode-directory/ (especially including 'tmp' and 'current') - Does 'du' of this directory match with what is reported to NameNode (shown on webui) by this DataNode. - Is there anything else taking disk space on the machine? Raghu. Igor Bolotin wrote: Normally I dislike writing about problems without being able to provide some more information, but unfortunately in this case I just can't find anything. Here is the situation - DFS cluster running Hadoop version 0.19.0. The cluster is running on multiple servers with practically identical hardware. Everything works perfectly well, except for one thing - from time to time one of the data nodes (every time it's a different node) starts to consume more and more disk space. The node keeps going and if we don't do anything - it runs out of space completely (ignoring 20GB reserved space settings). Once restarted - it cleans disk rapidly and goes back to approximately the same utilization as the rest of data nodes in the cluster. Scanning datanodes and namenode logs and comparing thread dumps (stacks) from nodes experiencing problem and those that run normally didn't produce any clues. Running balancer tool didn't help at all. FSCK shows that everything is healthy and number of over-replicated blocks is not significant. To me - it just looks like at some point the data node stops cleaning invalidated/deleted blocks, but keeps reporting space consumed by these blocks as "not used", but I'm not familiar enough with the internals and just plain don't have enough free time to start digging deeper. Anyone has an idea what is wrong or what else we can do to find out what's wrong or maybe where to start looking in the code? Thanks, Igor
Re: Recommend JSON Library? net.sf.json has memory leak
Ian Swett wrote: We've used Jackson(http://jackson.codehaus.org/), which we've found to be easy to use and faster than any other option. I also use Jackson and recommend it. Doug
Re: wordcount getting slower with more mappers and reducers?
Arun, How can I check the number of slots per tasktracker? Which parameter controls that? Thanks, -SM On Thu, Mar 5, 2009 at 12:14 PM, Arun C Murthy wrote: > I assume you have only 2 map and 2 reduce slots per tasktracker - which > totals to 2 maps/reduces for you cluster. This means with more maps/reduces > they are serialized to 2 at a time. > > Also, the -m is only a hint to the JobTracker, you might see less/more than > the number of maps you have specified on the command line. > The -r however is followed faithfully. > > Arun > > > On Mar 4, 2009, at 2:46 PM, Sandy wrote: > > Hello all, >> >> For the sake of benchmarking, I ran the standard hadoop wordcount example >> on >> an input file using 2, 4, and 8 mappers and reducers for my job. >> In other words, I do: >> >> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2 >> sample.txt output >> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4 >> sample.txt output2 >> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8 >> sample.txt output3 >> >> Strangely enough, when this increase in mappers and reducers result in >> slower running times! >> -On 2 mappers and reducers it ran for 40 seconds >> on 4 mappers and reducers it ran for 60 seconds >> on 8 mappers and reducers it ran for 90 seconds! >> >> Please note that the "sample.txt" file is identical in each of these runs. >> >> I have the following questions: >> - Shouldn't wordcount get -faster- with additional mappers and reducers, >> instead of slower? >> - If it does get faster for other people, why does it become slower for >> me? >> I am running hadoop on psuedo-distributed mode on a single 64-bit Mac Pro >> with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs >> >> I would greatly appreciate it if someone could explain this behavior to >> me, >> and tell me if I'm running this wrong. How can I change my settings (if at >> all) to get wordcount running faster when i increases that number of maps >> and reduces? >> >> Thanks, >> -SM >> > >
Re: wordcount getting slower with more mappers and reducers?
I assume you have only 2 map and 2 reduce slots per tasktracker - which totals to 2 maps/reduces for you cluster. This means with more maps/reduces they are serialized to 2 at a time. Also, the -m is only a hint to the JobTracker, you might see less/more than the number of maps you have specified on the command line. The -r however is followed faithfully. Arun On Mar 4, 2009, at 2:46 PM, Sandy wrote: Hello all, For the sake of benchmarking, I ran the standard hadoop wordcount example on an input file using 2, 4, and 8 mappers and reducers for my job. In other words, I do: time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2 sample.txt output time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4 sample.txt output2 time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8 sample.txt output3 Strangely enough, when this increase in mappers and reducers result in slower running times! -On 2 mappers and reducers it ran for 40 seconds on 4 mappers and reducers it ran for 60 seconds on 8 mappers and reducers it ran for 90 seconds! Please note that the "sample.txt" file is identical in each of these runs. I have the following questions: - Shouldn't wordcount get -faster- with additional mappers and reducers, instead of slower? - If it does get faster for other people, why does it become slower for me? I am running hadoop on psuedo-distributed mode on a single 64-bit Mac Pro with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs I would greatly appreciate it if someone could explain this behavior to me, and tell me if I'm running this wrong. How can I change my settings (if at all) to get wordcount running faster when i increases that number of maps and reduces? Thanks, -SM
DataNode stops cleaning disk?
Normally I dislike writing about problems without being able to provide some more information, but unfortunately in this case I just can't find anything. Here is the situation - DFS cluster running Hadoop version 0.19.0. The cluster is running on multiple servers with practically identical hardware. Everything works perfectly well, except for one thing - from time to time one of the data nodes (every time it's a different node) starts to consume more and more disk space. The node keeps going and if we don't do anything - it runs out of space completely (ignoring 20GB reserved space settings). Once restarted - it cleans disk rapidly and goes back to approximately the same utilization as the rest of data nodes in the cluster. Scanning datanodes and namenode logs and comparing thread dumps (stacks) from nodes experiencing problem and those that run normally didn't produce any clues. Running balancer tool didn't help at all. FSCK shows that everything is healthy and number of over-replicated blocks is not significant. To me - it just looks like at some point the data node stops cleaning invalidated/deleted blocks, but keeps reporting space consumed by these blocks as "not used", but I'm not familiar enough with the internals and just plain don't have enough free time to start digging deeper. Anyone has an idea what is wrong or what else we can do to find out what's wrong or maybe where to start looking in the code? Thanks, Igor
Re: Recommend JSON Library? net.sf.json has memory leak
I had discovered a memory leak in net.sf.json as well. I filed an issue and it got fixed in the latest release: http://sourceforge.net/tracker/?func=detail&atid=857928&aid=2063201&group_id=171425 Have you tried the latest version 2.2.3? On Thu, Mar 5, 2009 at 9:48 AM, Kevin Peterson wrote: > We're using JSON serialization for all our data, but we can't seem to find > a > good library. We just discovered that the root cause of out of memory > errors > is a leak in the net.sf.json library. Can anyone out there recommend a java > json library that they have actually used successfully within Hadoop? >
Re: Recommend JSON Library? net.sf.json has memory leak
We've used Jackson(http://jackson.codehaus.org/), which we've found to be easy to use and faster than any other option. We've also had problems with net.sf in terms of memory and performance. You can see a performance comparison here: http://www.cowtowncoder.com/blog/archives/2009/02/entry_204.html -Ian --- On Thu, 3/5/09, Kevin Peterson wrote: > From: Kevin Peterson > Subject: Recommend JSON Library? net.sf.json has memory leak > To: core-user@hadoop.apache.org > Date: Thursday, March 5, 2009, 9:48 AM > We're using JSON serialization for all our data, but we > can't seem to find a > good library. We just discovered that the root cause of out > of memory errors > is a leak in the net.sf.json library. Can anyone out there > recommend a java > json library that they have actually used successfully > within Hadoop?
Recommend JSON Library? net.sf.json has memory leak
We're using JSON serialization for all our data, but we can't seem to find a good library. We just discovered that the root cause of out of memory errors is a leak in the net.sf.json library. Can anyone out there recommend a java json library that they have actually used successfully within Hadoop?
Re: wordcount getting slower with more mappers and reducers?
I specified a directory containing my 428MB file split into 8 files. Same results. I should summarize my hadoop-site.xml file: mapred.tasktracker.tasks.maximum = 4 mapred.line.input.format.linespermap = 1 mapred.task.timeout = 0 mapred.min.split.size = 1 mapred.child.java.opts = -Xmx2M io.sort.factor = 200 io.sort.mb = 100 fs.inmemory.size.mb = 200 mapred.inmem.merge.threshold = 1000 dfs.replication = 1 mapred.reduce.parallel.copies = 5 I know the mapred.child.java.opts parameter is a little ridiculous, but I was just playing around and seeing what could possibly make things faster. For some reason, that did. Nick, I'm going to try larger files and get back to you. -SM On Thu, Mar 5, 2009 at 10:37 AM, Nick Cen wrote: > Try to split your sample.txt into multi files. and try it again. > For text input format , the number of task is equals to the input size. > > > 2009/3/6 Sandy > > > I used three different sample.txt files, and was able to replicate the > > error. The first was 1.5MB, the second 66MB, and the last 428MB. I get > the > > same problem despite what size of input file I use: the running time of > > wordcount increases with the number of mappers and reducers specified. If > > it > > is the problem of the input file, how big do I have to go before it > > disappears entirely? > > > > If it is psuedo-distributed mode that's the issue, what mode should I be > > running on my machine, given it's specs? Once again, it is a SINGLE > MacPro > > with 16GB of RAM, 4 1TB hard disks, and 2 quad-core processors. > > > > I'm not sure if it's HADOOP-2771, since the sort/merge(shuffle) is what > > seems to be taking the longest: > > 2 M/R ==> map: 18 sec, shuffle: 15 sec, reduce: 9 sec > > 4 M/R ==> map: 19 sec, shuffle: 37 sec, reduce: 2 sec > > 8 M/R ==> map: 21 sec, shuffle: 1 min 10 sec, 1 sec > > > > To make sure it's not because of the combiner, I removed it and reran > > everything again, and got the same bottom-line: With increasing maps and > > reducers, running time goes up, with majority of time seeming to be in > > sort/merge. > > > > Also, another thing we noticed is that the CPUs seem to be very active > > during the map phase, but when the map phase reaches 100%, and only > reduce > > appears to be running, the CPUs all become idle. Furthermore, despite the > > number of mappers I specify, all the CPUs become very active when a job > is > > running. Why is this so? If I specify 2 mappers and 2 reducers, won't > there > > be just 2 or 4 CPUs that should be active? Why are all 8 active? > > > > Since I can reproduce this error using Hadoop's standard word count > > example, > > I was hoping that someone else could tell me if they can reproduce this > > too. > > Is it true that when you increase the number of mappers and reducers on > > your > > systems, the running time of wordcount goes up? > > > > Thanks for the help! I'm looking forward to your responses. > > > > -SM > > > > On Thu, Mar 5, 2009 at 2:57 AM, Amareshwari Sriramadasu < > > amar...@yahoo-inc.com> wrote: > > > > > Are you hitting HADOOP-2771? > > > -Amareshwari > > > > > > Sandy wrote: > > > > > >> Hello all, > > >> > > >> For the sake of benchmarking, I ran the standard hadoop wordcount > > example > > >> on > > >> an input file using 2, 4, and 8 mappers and reducers for my job. > > >> In other words, I do: > > >> > > >> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2 > > >> sample.txt output > > >> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4 > > >> sample.txt output2 > > >> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8 > > >> sample.txt output3 > > >> > > >> Strangely enough, when this increase in mappers and reducers result in > > >> slower running times! > > >> -On 2 mappers and reducers it ran for 40 seconds > > >> on 4 mappers and reducers it ran for 60 seconds > > >> on 8 mappers and reducers it ran for 90 seconds! > > >> > > >> Please note that the "sample.txt" file is identical in each of these > > runs. > > >> > > >> I have the following questions: > > >> - Shouldn't wordcount get -faster- with additional mappers and > reducers, > > >> instead of slower? > > >> - If it does get faster for other people, why does it become slower > for > > >> me? > > >> I am running hadoop on psuedo-distributed mode on a single 64-bit Mac > > Pro > > >> with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs > > >> > > >> I would greatly appreciate it if someone could explain this behavior > to > > >> me, > > >> and tell me if I'm running this wrong. How can I change my settings > (if > > at > > >> all) to get wordcount running faster when i increases that number of > > maps > > >> and reduces? > > >> > > >> Thanks, > > >> -SM > > >> > > >> > > >> > > > > > > > > > > > > -- > http://daily.appspot.com/food/ >
Re: System Layout Best Practices
Thank you - that certainly is useful, and I would love to see more information and discussion on that sort of thing. However, I'm also looking for some lower-level configuration, such as disk partitioning. David On Thu, Mar 5, 2009 at 11:36 AM, Sandy wrote: > Hi David, > > I don't know if you've seen this already, but this might be of some help: > http://hadoop.apache.org/core/docs/r0.18.3/cluster_setup.html > > Near the bottom, there is a section called "Real-World Cluster > Configurations" with some sample configuration parameters that were used to > run a very large sort benchmark. > > All the best, > -SM > > On Thu, Mar 5, 2009 at 10:20 AM, David Ritch > wrote: > > > Are there any published guidelines on system configuration for Hadoop? > > > > I've seen hardware suggestions, but I'm really interested in > > recommendations > > on disk layout and partitioning. The defaults, as shipped and defined in > > hadoop-default.xml, may be appropriate for testing, but are not really > > appropriate for sustained use. For example, data and metadata are both > > stored in /tmp. In typical use on a cluster with a couple hundred nodes, > > the NameNode can generate 3-5GB of logs per day. If you configure your > > namenode host badly, it's easy to fill up the partition used by dfs for > > metadata, and clobber your dfs filesystem. I would think that > thresholding > > logs on WARN would be preferable to INFO. > > > > On a datanode, we would like to reserve as much space as we can for data, > > but we know that map-reduce jobs need some local storage. How do people > > generally estimate the amount of space required for temporary storage? I > > would assume that it would be good to partition it from data storage, to > > prevent running out of temp space on some nodes. I would also think that > > it > > would be preferable for performance to have temp space on a different > > spindle, so it and hdfs data can be accessed independently. > > > > I would be interested to know how other sites configure their systems, > and > > I > > would love to see some guidelines for system configuration for Hadoop. > > > > Thank you! > > > > David > > >
Re: wordcount getting slower with more mappers and reducers?
Sandy wrote: I used three different sample.txt files, and was able to replicate the error. The first was 1.5MB, the second 66MB, and the last 428MB. I get the same problem despite what size of input file I use: the running time of wordcount increases with the number of mappers and reducers specified. If it is the problem of the input file, how big do I have to go before it disappears entirely? Keep in mind that as long as the file < memory, it's likely coming straight out of filesystem cache. In your kind of system configuration, running as fast a possible, a core or two can saturate a memory controller and then there would be contention showing no speedup with more mappers. If you really want a feel for what this would be like, you should probably have much more input data. It will entirely change as soon as you have to wait on disk IO. Hope that helps, - Matt If it is psuedo-distributed mode that's the issue, what mode should I be running on my machine, given it's specs? Once again, it is a SINGLE MacPro with 16GB of RAM, 4 1TB hard disks, and 2 quad-core processors. I'm not sure if it's HADOOP-2771, since the sort/merge(shuffle) is what seems to be taking the longest: 2 M/R ==> map: 18 sec, shuffle: 15 sec, reduce: 9 sec 4 M/R ==> map: 19 sec, shuffle: 37 sec, reduce: 2 sec 8 M/R ==> map: 21 sec, shuffle: 1 min 10 sec, 1 sec To make sure it's not because of the combiner, I removed it and reran everything again, and got the same bottom-line: With increasing maps and reducers, running time goes up, with majority of time seeming to be in sort/merge. Also, another thing we noticed is that the CPUs seem to be very active during the map phase, but when the map phase reaches 100%, and only reduce appears to be running, the CPUs all become idle. Furthermore, despite the number of mappers I specify, all the CPUs become very active when a job is running. Why is this so? If I specify 2 mappers and 2 reducers, won't there be just 2 or 4 CPUs that should be active? Why are all 8 active? Since I can reproduce this error using Hadoop's standard word count example, I was hoping that someone else could tell me if they can reproduce this too. Is it true that when you increase the number of mappers and reducers on your systems, the running time of wordcount goes up? Thanks for the help! I'm looking forward to your responses. -SM On Thu, Mar 5, 2009 at 2:57 AM, Amareshwari Sriramadasu < amar...@yahoo-inc.com> wrote: Are you hitting HADOOP-2771? -Amareshwari Sandy wrote: Hello all, For the sake of benchmarking, I ran the standard hadoop wordcount example on an input file using 2, 4, and 8 mappers and reducers for my job. In other words, I do: time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2 sample.txt output time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4 sample.txt output2 time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8 sample.txt output3 Strangely enough, when this increase in mappers and reducers result in slower running times! -On 2 mappers and reducers it ran for 40 seconds on 4 mappers and reducers it ran for 60 seconds on 8 mappers and reducers it ran for 90 seconds! Please note that the "sample.txt" file is identical in each of these runs. I have the following questions: - Shouldn't wordcount get -faster- with additional mappers and reducers, instead of slower? - If it does get faster for other people, why does it become slower for me? I am running hadoop on psuedo-distributed mode on a single 64-bit Mac Pro with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs I would greatly appreciate it if someone could explain this behavior to me, and tell me if I'm running this wrong. How can I change my settings (if at all) to get wordcount running faster when i increases that number of maps and reduces? Thanks, -SM
Re: wordcount getting slower with more mappers and reducers?
Try to split your sample.txt into multi files. and try it again. For text input format , the number of task is equals to the input size. 2009/3/6 Sandy > I used three different sample.txt files, and was able to replicate the > error. The first was 1.5MB, the second 66MB, and the last 428MB. I get the > same problem despite what size of input file I use: the running time of > wordcount increases with the number of mappers and reducers specified. If > it > is the problem of the input file, how big do I have to go before it > disappears entirely? > > If it is psuedo-distributed mode that's the issue, what mode should I be > running on my machine, given it's specs? Once again, it is a SINGLE MacPro > with 16GB of RAM, 4 1TB hard disks, and 2 quad-core processors. > > I'm not sure if it's HADOOP-2771, since the sort/merge(shuffle) is what > seems to be taking the longest: > 2 M/R ==> map: 18 sec, shuffle: 15 sec, reduce: 9 sec > 4 M/R ==> map: 19 sec, shuffle: 37 sec, reduce: 2 sec > 8 M/R ==> map: 21 sec, shuffle: 1 min 10 sec, 1 sec > > To make sure it's not because of the combiner, I removed it and reran > everything again, and got the same bottom-line: With increasing maps and > reducers, running time goes up, with majority of time seeming to be in > sort/merge. > > Also, another thing we noticed is that the CPUs seem to be very active > during the map phase, but when the map phase reaches 100%, and only reduce > appears to be running, the CPUs all become idle. Furthermore, despite the > number of mappers I specify, all the CPUs become very active when a job is > running. Why is this so? If I specify 2 mappers and 2 reducers, won't there > be just 2 or 4 CPUs that should be active? Why are all 8 active? > > Since I can reproduce this error using Hadoop's standard word count > example, > I was hoping that someone else could tell me if they can reproduce this > too. > Is it true that when you increase the number of mappers and reducers on > your > systems, the running time of wordcount goes up? > > Thanks for the help! I'm looking forward to your responses. > > -SM > > On Thu, Mar 5, 2009 at 2:57 AM, Amareshwari Sriramadasu < > amar...@yahoo-inc.com> wrote: > > > Are you hitting HADOOP-2771? > > -Amareshwari > > > > Sandy wrote: > > > >> Hello all, > >> > >> For the sake of benchmarking, I ran the standard hadoop wordcount > example > >> on > >> an input file using 2, 4, and 8 mappers and reducers for my job. > >> In other words, I do: > >> > >> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2 > >> sample.txt output > >> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4 > >> sample.txt output2 > >> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8 > >> sample.txt output3 > >> > >> Strangely enough, when this increase in mappers and reducers result in > >> slower running times! > >> -On 2 mappers and reducers it ran for 40 seconds > >> on 4 mappers and reducers it ran for 60 seconds > >> on 8 mappers and reducers it ran for 90 seconds! > >> > >> Please note that the "sample.txt" file is identical in each of these > runs. > >> > >> I have the following questions: > >> - Shouldn't wordcount get -faster- with additional mappers and reducers, > >> instead of slower? > >> - If it does get faster for other people, why does it become slower for > >> me? > >> I am running hadoop on psuedo-distributed mode on a single 64-bit Mac > Pro > >> with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs > >> > >> I would greatly appreciate it if someone could explain this behavior to > >> me, > >> and tell me if I'm running this wrong. How can I change my settings (if > at > >> all) to get wordcount running faster when i increases that number of > maps > >> and reduces? > >> > >> Thanks, > >> -SM > >> > >> > >> > > > > > -- http://daily.appspot.com/food/
Re: System Layout Best Practices
Hi David, I don't know if you've seen this already, but this might be of some help: http://hadoop.apache.org/core/docs/r0.18.3/cluster_setup.html Near the bottom, there is a section called "Real-World Cluster Configurations" with some sample configuration parameters that were used to run a very large sort benchmark. All the best, -SM On Thu, Mar 5, 2009 at 10:20 AM, David Ritch wrote: > Are there any published guidelines on system configuration for Hadoop? > > I've seen hardware suggestions, but I'm really interested in > recommendations > on disk layout and partitioning. The defaults, as shipped and defined in > hadoop-default.xml, may be appropriate for testing, but are not really > appropriate for sustained use. For example, data and metadata are both > stored in /tmp. In typical use on a cluster with a couple hundred nodes, > the NameNode can generate 3-5GB of logs per day. If you configure your > namenode host badly, it's easy to fill up the partition used by dfs for > metadata, and clobber your dfs filesystem. I would think that thresholding > logs on WARN would be preferable to INFO. > > On a datanode, we would like to reserve as much space as we can for data, > but we know that map-reduce jobs need some local storage. How do people > generally estimate the amount of space required for temporary storage? I > would assume that it would be good to partition it from data storage, to > prevent running out of temp space on some nodes. I would also think that > it > would be preferable for performance to have temp space on a different > spindle, so it and hdfs data can be accessed independently. > > I would be interested to know how other sites configure their systems, and > I > would love to see some guidelines for system configuration for Hadoop. > > Thank you! > > David >
Re: wordcount getting slower with more mappers and reducers?
I used three different sample.txt files, and was able to replicate the error. The first was 1.5MB, the second 66MB, and the last 428MB. I get the same problem despite what size of input file I use: the running time of wordcount increases with the number of mappers and reducers specified. If it is the problem of the input file, how big do I have to go before it disappears entirely? If it is psuedo-distributed mode that's the issue, what mode should I be running on my machine, given it's specs? Once again, it is a SINGLE MacPro with 16GB of RAM, 4 1TB hard disks, and 2 quad-core processors. I'm not sure if it's HADOOP-2771, since the sort/merge(shuffle) is what seems to be taking the longest: 2 M/R ==> map: 18 sec, shuffle: 15 sec, reduce: 9 sec 4 M/R ==> map: 19 sec, shuffle: 37 sec, reduce: 2 sec 8 M/R ==> map: 21 sec, shuffle: 1 min 10 sec, 1 sec To make sure it's not because of the combiner, I removed it and reran everything again, and got the same bottom-line: With increasing maps and reducers, running time goes up, with majority of time seeming to be in sort/merge. Also, another thing we noticed is that the CPUs seem to be very active during the map phase, but when the map phase reaches 100%, and only reduce appears to be running, the CPUs all become idle. Furthermore, despite the number of mappers I specify, all the CPUs become very active when a job is running. Why is this so? If I specify 2 mappers and 2 reducers, won't there be just 2 or 4 CPUs that should be active? Why are all 8 active? Since I can reproduce this error using Hadoop's standard word count example, I was hoping that someone else could tell me if they can reproduce this too. Is it true that when you increase the number of mappers and reducers on your systems, the running time of wordcount goes up? Thanks for the help! I'm looking forward to your responses. -SM On Thu, Mar 5, 2009 at 2:57 AM, Amareshwari Sriramadasu < amar...@yahoo-inc.com> wrote: > Are you hitting HADOOP-2771? > -Amareshwari > > Sandy wrote: > >> Hello all, >> >> For the sake of benchmarking, I ran the standard hadoop wordcount example >> on >> an input file using 2, 4, and 8 mappers and reducers for my job. >> In other words, I do: >> >> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2 >> sample.txt output >> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4 >> sample.txt output2 >> time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8 >> sample.txt output3 >> >> Strangely enough, when this increase in mappers and reducers result in >> slower running times! >> -On 2 mappers and reducers it ran for 40 seconds >> on 4 mappers and reducers it ran for 60 seconds >> on 8 mappers and reducers it ran for 90 seconds! >> >> Please note that the "sample.txt" file is identical in each of these runs. >> >> I have the following questions: >> - Shouldn't wordcount get -faster- with additional mappers and reducers, >> instead of slower? >> - If it does get faster for other people, why does it become slower for >> me? >> I am running hadoop on psuedo-distributed mode on a single 64-bit Mac Pro >> with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs >> >> I would greatly appreciate it if someone could explain this behavior to >> me, >> and tell me if I'm running this wrong. How can I change my settings (if at >> all) to get wordcount running faster when i increases that number of maps >> and reduces? >> >> Thanks, >> -SM >> >> >> > >
Re: Hadoop AMI for EC2
Hi All, I am trying trying to log map reduce jobs in HADOOP_LOG_DIR by setting its value in hadoop-env.sh. But the directory has no log records when the job finishes running. I am adding JobConf.setProfileEnabled(true) in my job. Can anyone point out how to log in hadoop? Thanks, Richa On Thu, Mar 5, 2009 at 8:20 AM, Richa Khandelwal wrote: > Thats pretty cool. Thanks > > > On Thu, Mar 5, 2009 at 8:17 AM, tim robertson > wrote: > >> Yeps, >> >> A good starting read: http://wiki.apache.org/hadoop/AmazonEC2 >> >> These are the AMIs: >> >> $ ec2-describe-images -a | grep hadoop >> IMAGE ami-245db94dcloudbase-1.1-hadoop-fc64/image.manifest.xml >> 247610401714available public x86_64 machine >> IMAGE ami-791ffb10 >> cloudbase-hadoop-fc64/cloudbase-hadoop-fc64.manifest.xml >> 247610401714available public x86_64 machine >> IMAGE ami-f73adf9ecs345-hadoop-EC2-0.15.3/hadoop-0.15.3.manifest.xml >> 825431212034available public i386machine >> IMAGE ami-c55db8acfedora8-hypertable-hadoop-kfs/image.manifest.xml >> 291354417104available public x86_64 machine >> aki-b51cf9dcari-b31cf9da >> IMAGE ami-ce6b8fa7hachero-hadoop/hadoop-0.19.0-i386.manifest.xml >> 118946012109available public i386machine >> aki-a71cf9ceari-a51cf9cc >> IMAGE ami-dd48acb4hachero-hadoop/hadoop-0.19.0-x86_64.manifest.xml >> 118946012109available public x86_64 machine >> aki-b51cf9dcari-b31cf9da >> IMAGE ami-ee53b687hadoop-ec2-images/hadoop-0.17.0-i386.manifest.xml >> 111560892610available public i386machine >> aki-a71cf9ceari-a51cf9cc >> IMAGE ami-f853b691 >> hadoop-ec2-images/hadoop-0.17.0-x86_64.manifest.xml 111560892610 >> available public x86_64 machine aki-b51cf9dc >> ari-b31cf9da >> IMAGE ami-65987c0chadoop-images/hadoop-0.17.1-i386.manifest.xml >> 914733919441available public i386machine aki-a71cf9ce >>ari-a51cf9cc >> IMAGE ami-4b987c22hadoop-images/hadoop-0.17.1-x86_64.manifest.xml >> 914733919441available public x86_64 machine aki-b51cf9dc >>ari-b31cf9da >> IMAGE ami-b0fe1ad9hadoop-images/hadoop-0.18.0-i386.manifest.xml >> 914733919441available public i386machine aki-a71cf9ce >>ari-a51cf9cc >> IMAGE ami-90fe1af9hadoop-images/hadoop-0.18.0-x86_64.manifest.xml >> 914733919441available public x86_64 machine aki-b51cf9dc >>ari-b31cf9da >> IMAGE ami-ea36d283hadoop-images/hadoop-0.18.1-i386.manifest.xml >> 914733919441available public i386machine aki-a71cf9ce >>ari-a51cf9cc >> IMAGE ami-fe37d397hadoop-images/hadoop-0.18.1-x86_64.manifest.xml >> 914733919441available public x86_64 machine aki-b51cf9dc >>ari-b31cf9da >> IMAGE ami-fa6a8e93hadoop-images/hadoop-0.19.0-i386.manifest.xml >> 914733919441available public i386machine aki-a71cf9ce >>ari-a51cf9cc >> IMAGE ami-cd6a8ea4hadoop-images/hadoop-0.19.0-x86_64.manifest.xml >> 914733919441available public x86_64 machine aki-b51cf9dc >>ari-b31cf9da >> IMAGE ami-15e80f7c >> hadoop-images/hadoop-base-20090210-i386.manifest.xml914733919441 >> available public i386machine aki-a71cf9ce >> ari-a51cf9cc >> IMAGE ami-1ee80f77 >> hadoop-images/hadoop-base-20090210-x86_64.manifest.xml 914733919441 >> available public x86_64 machine aki-b51cf9dc >> ari-b31cf9da >> IMAGE ami-4de30724 >> hbase-ami/hbase-0.2.0-hadoop-0.17.1-i386.manifest.xml 834125115996 >> available public i386machine aki-a71cf9ce >> ari-a51cf9cc >> IMAGE ami-fe7c9997radlab-hadoop-4-large/image.manifest.xml >> 117716615155available public x86_64 machine >> IMAGE ami-7f7f9a16radlab-hadoop-4/image.manifest.xml >> 117716615155available public i386machine >> $ >> >> Cheers, >> >> Tim >> >> >> >> On Thu, Mar 5, 2009 at 5:13 PM, Richa Khandelwal >> wrote: >> > Hi All, >> > Is there an existing Hadoop AMI for EC2 which had Hadaoop setup on it? >> > >> > Thanks, >> > Richa Khandelwal >> > >> > >> > University Of California, >> > Santa Cruz. >> > Ph:425-241-7763 >> > >> > > > > -- > Richa Khandelwal > > > University Of California, > Santa Cruz. > Ph:425-241-7763 > -- Richa Khandelwal University Of California, Santa Cruz. Ph:425-241-7763
Re: Hadoop AMI for EC2
Thats pretty cool. Thanks On Thu, Mar 5, 2009 at 8:17 AM, tim robertson wrote: > Yeps, > > A good starting read: http://wiki.apache.org/hadoop/AmazonEC2 > > These are the AMIs: > > $ ec2-describe-images -a | grep hadoop > IMAGE ami-245db94dcloudbase-1.1-hadoop-fc64/image.manifest.xml > 247610401714available public x86_64 machine > IMAGE ami-791ffb10 > cloudbase-hadoop-fc64/cloudbase-hadoop-fc64.manifest.xml > 247610401714available public x86_64 machine > IMAGE ami-f73adf9ecs345-hadoop-EC2-0.15.3/hadoop-0.15.3.manifest.xml > 825431212034available public i386machine > IMAGE ami-c55db8acfedora8-hypertable-hadoop-kfs/image.manifest.xml > 291354417104available public x86_64 machine > aki-b51cf9dcari-b31cf9da > IMAGE ami-ce6b8fa7hachero-hadoop/hadoop-0.19.0-i386.manifest.xml > 118946012109available public i386machine > aki-a71cf9ceari-a51cf9cc > IMAGE ami-dd48acb4hachero-hadoop/hadoop-0.19.0-x86_64.manifest.xml > 118946012109available public x86_64 machine > aki-b51cf9dcari-b31cf9da > IMAGE ami-ee53b687hadoop-ec2-images/hadoop-0.17.0-i386.manifest.xml > 111560892610available public i386machine > aki-a71cf9ceari-a51cf9cc > IMAGE ami-f853b691hadoop-ec2-images/hadoop-0.17.0-x86_64.manifest.xml > 111560892610available public x86_64 machine > aki-b51cf9dcari-b31cf9da > IMAGE ami-65987c0chadoop-images/hadoop-0.17.1-i386.manifest.xml > 914733919441available public i386machine aki-a71cf9ce >ari-a51cf9cc > IMAGE ami-4b987c22hadoop-images/hadoop-0.17.1-x86_64.manifest.xml > 914733919441available public x86_64 machine aki-b51cf9dc >ari-b31cf9da > IMAGE ami-b0fe1ad9hadoop-images/hadoop-0.18.0-i386.manifest.xml > 914733919441available public i386machine aki-a71cf9ce >ari-a51cf9cc > IMAGE ami-90fe1af9hadoop-images/hadoop-0.18.0-x86_64.manifest.xml > 914733919441available public x86_64 machine aki-b51cf9dc >ari-b31cf9da > IMAGE ami-ea36d283hadoop-images/hadoop-0.18.1-i386.manifest.xml > 914733919441available public i386machine aki-a71cf9ce >ari-a51cf9cc > IMAGE ami-fe37d397hadoop-images/hadoop-0.18.1-x86_64.manifest.xml > 914733919441available public x86_64 machine aki-b51cf9dc >ari-b31cf9da > IMAGE ami-fa6a8e93hadoop-images/hadoop-0.19.0-i386.manifest.xml > 914733919441available public i386machine aki-a71cf9ce >ari-a51cf9cc > IMAGE ami-cd6a8ea4hadoop-images/hadoop-0.19.0-x86_64.manifest.xml > 914733919441available public x86_64 machine aki-b51cf9dc >ari-b31cf9da > IMAGE ami-15e80f7c > hadoop-images/hadoop-base-20090210-i386.manifest.xml914733919441 > available public i386machine aki-a71cf9ce > ari-a51cf9cc > IMAGE ami-1ee80f77 > hadoop-images/hadoop-base-20090210-x86_64.manifest.xml 914733919441 > available public x86_64 machine aki-b51cf9dc > ari-b31cf9da > IMAGE ami-4de30724 > hbase-ami/hbase-0.2.0-hadoop-0.17.1-i386.manifest.xml 834125115996 > available public i386machine aki-a71cf9ce > ari-a51cf9cc > IMAGE ami-fe7c9997radlab-hadoop-4-large/image.manifest.xml > 117716615155available public x86_64 machine > IMAGE ami-7f7f9a16radlab-hadoop-4/image.manifest.xml > 117716615155available public i386machine > $ > > Cheers, > > Tim > > > > On Thu, Mar 5, 2009 at 5:13 PM, Richa Khandelwal > wrote: > > Hi All, > > Is there an existing Hadoop AMI for EC2 which had Hadaoop setup on it? > > > > Thanks, > > Richa Khandelwal > > > > > > University Of California, > > Santa Cruz. > > Ph:425-241-7763 > > > -- Richa Khandelwal University Of California, Santa Cruz. Ph:425-241-7763
System Layout Best Practices
Are there any published guidelines on system configuration for Hadoop? I've seen hardware suggestions, but I'm really interested in recommendations on disk layout and partitioning. The defaults, as shipped and defined in hadoop-default.xml, may be appropriate for testing, but are not really appropriate for sustained use. For example, data and metadata are both stored in /tmp. In typical use on a cluster with a couple hundred nodes, the NameNode can generate 3-5GB of logs per day. If you configure your namenode host badly, it's easy to fill up the partition used by dfs for metadata, and clobber your dfs filesystem. I would think that thresholding logs on WARN would be preferable to INFO. On a datanode, we would like to reserve as much space as we can for data, but we know that map-reduce jobs need some local storage. How do people generally estimate the amount of space required for temporary storage? I would assume that it would be good to partition it from data storage, to prevent running out of temp space on some nodes. I would also think that it would be preferable for performance to have temp space on a different spindle, so it and hdfs data can be accessed independently. I would be interested to know how other sites configure their systems, and I would love to see some guidelines for system configuration for Hadoop. Thank you! David
Re: Hadoop AMI for EC2
Yeps, A good starting read: http://wiki.apache.org/hadoop/AmazonEC2 These are the AMIs: $ ec2-describe-images -a | grep hadoop IMAGE ami-245db94dcloudbase-1.1-hadoop-fc64/image.manifest.xml 247610401714available public x86_64 machine IMAGE ami-791ffb10 cloudbase-hadoop-fc64/cloudbase-hadoop-fc64.manifest.xml247610401714 available public x86_64 machine IMAGE ami-f73adf9ecs345-hadoop-EC2-0.15.3/hadoop-0.15.3.manifest.xml 825431212034available public i386machine IMAGE ami-c55db8acfedora8-hypertable-hadoop-kfs/image.manifest.xml 291354417104available public x86_64 machine aki-b51cf9dc ari-b31cf9da IMAGE ami-ce6b8fa7hachero-hadoop/hadoop-0.19.0-i386.manifest.xml 118946012109available public i386machine aki-a71cf9ce ari-a51cf9cc IMAGE ami-dd48acb4hachero-hadoop/hadoop-0.19.0-x86_64.manifest.xml 118946012109available public x86_64 machine aki-b51cf9dc ari-b31cf9da IMAGE ami-ee53b687hadoop-ec2-images/hadoop-0.17.0-i386.manifest.xml 111560892610available public i386machine aki-a71cf9ce ari-a51cf9cc IMAGE ami-f853b691hadoop-ec2-images/hadoop-0.17.0-x86_64.manifest.xml 111560892610available public x86_64 machine aki-b51cf9dc ari-b31cf9da IMAGE ami-65987c0chadoop-images/hadoop-0.17.1-i386.manifest.xml 914733919441available public i386machine aki-a71cf9ce ari-a51cf9cc IMAGE ami-4b987c22hadoop-images/hadoop-0.17.1-x86_64.manifest.xml 914733919441available public x86_64 machine aki-b51cf9dc ari-b31cf9da IMAGE ami-b0fe1ad9hadoop-images/hadoop-0.18.0-i386.manifest.xml 914733919441available public i386machine aki-a71cf9ce ari-a51cf9cc IMAGE ami-90fe1af9hadoop-images/hadoop-0.18.0-x86_64.manifest.xml 914733919441available public x86_64 machine aki-b51cf9dc ari-b31cf9da IMAGE ami-ea36d283hadoop-images/hadoop-0.18.1-i386.manifest.xml 914733919441available public i386machine aki-a71cf9ce ari-a51cf9cc IMAGE ami-fe37d397hadoop-images/hadoop-0.18.1-x86_64.manifest.xml 914733919441available public x86_64 machine aki-b51cf9dc ari-b31cf9da IMAGE ami-fa6a8e93hadoop-images/hadoop-0.19.0-i386.manifest.xml 914733919441available public i386machine aki-a71cf9ce ari-a51cf9cc IMAGE ami-cd6a8ea4hadoop-images/hadoop-0.19.0-x86_64.manifest.xml 914733919441available public x86_64 machine aki-b51cf9dc ari-b31cf9da IMAGE ami-15e80f7chadoop-images/hadoop-base-20090210-i386.manifest.xml 914733919441available public i386machine aki-a71cf9ce ari-a51cf9cc IMAGE ami-1ee80f77hadoop-images/hadoop-base-20090210-x86_64.manifest.xml 914733919441available public x86_64 machine aki-b51cf9dc ari-b31cf9da IMAGE ami-4de30724hbase-ami/hbase-0.2.0-hadoop-0.17.1-i386.manifest.xml 834125115996available public i386machine aki-a71cf9ce ari-a51cf9cc IMAGE ami-fe7c9997radlab-hadoop-4-large/image.manifest.xml 117716615155available public x86_64 machine IMAGE ami-7f7f9a16radlab-hadoop-4/image.manifest.xml 117716615155 available public i386machine $ Cheers, Tim On Thu, Mar 5, 2009 at 5:13 PM, Richa Khandelwal wrote: > Hi All, > Is there an existing Hadoop AMI for EC2 which had Hadaoop setup on it? > > Thanks, > Richa Khandelwal > > > University Of California, > Santa Cruz. > Ph:425-241-7763 >
Re: Hadoop AMI for EC2
Hi Richa, Yes there is. Please see http://wiki.apache.org/hadoop/AmazonEC2. Tom On Thu, Mar 5, 2009 at 4:13 PM, Richa Khandelwal wrote: > Hi All, > Is there an existing Hadoop AMI for EC2 which had Hadaoop setup on it? > > Thanks, > Richa Khandelwal > > > University Of California, > Santa Cruz. > Ph:425-241-7763 >
Re: contrib EC2 with hadoop 0.17
I haven't used Eucalyptus, but you could start by trying out the Hadoop EC2 scripts (http://wiki.apache.org/hadoop/AmazonEC2) with your Eucalyptus installation. Cheers, Tom On Tue, Mar 3, 2009 at 2:51 PM, falcon164 wrote: > > I am new to hadoop. I want to run hadoop on eucalyptus. Please let me know > how to do this. > -- > View this message in context: > http://www.nabble.com/contrib-EC2-with-hadoop-0.17-tp17711758p22310068.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > >
Re: Running 0.19.2 branch in production before release
Aaron Kimball wrote: I recommend 0.18.3 for production use and avoid the 19 branch entirely. If your priority is stability, then stay a full minor version behind, not just a revision. Of course, if everyone stays that far behind, they don't get to find the bugs for other people. * If you play with the latest releases early, while they are in the beta phase -you will encounter the problems specific to your applications/datacentres, and get them fixed fast. * If you work with stuff further back you get stability, but not only are you behind on features, you can't be sure that all "fixes" that matter to you get pushed back. * If you plan on making changes, of adding features, get onto SVN_HEAD * If you want to catch changes being made that break your site, SVN_HEAD. Better yet, have a private Hudson server checking out SVN_HEAD hadoop *then* building and testing your app against it. Normally I work with stable releases of things I dont depend on, and SVN_HEAD of OSS stuff whose code I have any intent to change; there is a price -merge time, the odd change breaking your code- but you get to make changes that help you long term. Where Hadoop is different is that it is a filesystem, and you don't want to hit bugs that delete files that matter. I'm only bringing up transient clusters on VMs, pulling in data from elsewhere, so this isn't an issue. All that remains is changing APIs. -Steve
Re: wordcount getting slower with more mappers and reducers?
Are you hitting HADOOP-2771? -Amareshwari Sandy wrote: Hello all, For the sake of benchmarking, I ran the standard hadoop wordcount example on an input file using 2, 4, and 8 mappers and reducers for my job. In other words, I do: time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2 sample.txt output time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4 sample.txt output2 time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8 sample.txt output3 Strangely enough, when this increase in mappers and reducers result in slower running times! -On 2 mappers and reducers it ran for 40 seconds on 4 mappers and reducers it ran for 60 seconds on 8 mappers and reducers it ran for 90 seconds! Please note that the "sample.txt" file is identical in each of these runs. I have the following questions: - Shouldn't wordcount get -faster- with additional mappers and reducers, instead of slower? - If it does get faster for other people, why does it become slower for me? I am running hadoop on psuedo-distributed mode on a single 64-bit Mac Pro with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs I would greatly appreciate it if someone could explain this behavior to me, and tell me if I'm running this wrong. How can I change my settings (if at all) to get wordcount running faster when i increases that number of maps and reduces? Thanks, -SM