Re: Spilled Records
Thank you Saurabh, but the following setting didn't change # of spilled records: conf.set("mapred.job.shuffle.merge.percent", ".9");//instead of .66 conf.set("mapred.inmem.merge.threshold", "1000");// instead of 1000 IS it's because of my memory being 4GB ?? I'm using the pseudo distributed mode. Thank you, Maha On Feb 21, 2011, at 7:46 PM, Saurabh Dutta wrote: > Hi Maha, > > The spilled record has to do with the transient data during the map and > reduce operations. Note that it's not just the map operations that generate > the spilled records. When the in-memory buffer (controlled by > mapred.job.shuffle.merge.percent) runs out or reaches the threshold number of > map outputs (mapred.inmem.merge.threshold), it is merged and spilled to disk. > > You are going in the right direction by tuning the io.sort.mb parameter and > try increasing it further. If it still doesn't work out, try the > io.sort.factor, fs.inmemory.size.mb. Also, try the other two variables that i > mentioned earlier. > > Let us know what worked for you. > > Sincerely, > Saurabh Dutta > Impetus Infotech India Pvt. Ltd., > Sarda House, 24-B, Palasia, A.B.Road, Indore - 452 001 > Phone: +91-731-4269200 4623 > Fax: + 91-731-4071256 > Email: saurabh.du...@impetus.co.in > www.impetus.com > > From: maha [m...@umail.ucsb.edu] > Sent: Tuesday, February 22, 2011 8:21 AM > To: common-user > Subject: Spilled Records > > Hello every one, > > Does spilled records mean that the sort-buffer size for sorting is not enough > to sort all the input records, hence some records are written to local disk ? > > If so, I tried setting my io.sort.mb from the default 100 to 200 and there > was still the same # of spilled records. Why ? > > Does changing io.sort.record.percent to be .9 instead .8 might produce > unexpected exceptions ? > > > Thank you, > Maha > > > > Impetus to Present Big Data -- Analytics Solutions and Strategies at TDWI > World Conference (Feb 13-18) in Las Vegas.We are also bringing cloud experts > together at CloudCamp, Delhi on Feb 12. CloudCamp is an unconference where > early adopters of Cloud Computing technologies exchange ideas. > > Click http://www.impetus.com to know more. > > > NOTE: This message may contain information that is confidential, proprietary, > privileged or otherwise protected by law. The message is intended solely for > the named addressee. If received in error, please destroy and notify the > sender. Any use of this email is prohibited when received in error. Impetus > does not represent, warrant and/or guarantee, that the integrity of this > communication has been maintained nor that the communication is free of > errors, virus, interception or interference.
Re: multiple hadoop instances on same cluster
Make sure the instances' ports aren't conflicting and all directories (NN, JT, etc.) are unique. That should do it. -- Take care, Konstantin (Cos) Boudnik On Mon, Feb 21, 2011 at 20:09, Gang Luo wrote: > Hello folks, > I am trying to run multiple hadoop instances on the same cluster. I find it > hard > to share. First I try two instances, each of them run with the same master > and > slaves. Only one of them could work. I try to divide the cluster such that > hadoop 1 use machine 0-9 and hadoop 2 uses machine 10-19. Still, only one of > them could work. The HDFS of the second hadoop is working well, but > start-mapred.sh will result in such exception "java.io.IOException: Connection > reset by peer" in the log. > > > Any ideas on this or suggestion on how to run multiple hadoop instance on one > cluster? I can total divide up the cluster such that different instances run > on > different set of machines. > > Thanks. > > -Gang > > > > >
multiple hadoop instances on same cluster
Hello folks, I am trying to run multiple hadoop instances on the same cluster. I find it hard to share. First I try two instances, each of them run with the same master and slaves. Only one of them could work. I try to divide the cluster such that hadoop 1 use machine 0-9 and hadoop 2 uses machine 10-19. Still, only one of them could work. The HDFS of the second hadoop is working well, but start-mapred.sh will result in such exception "java.io.IOException: Connection reset by peer" in the log. Any ideas on this or suggestion on how to run multiple hadoop instance on one cluster? I can total divide up the cluster such that different instances run on different set of machines. Thanks. -Gang
RE: Spilled Records
Hi Maha, The spilled record has to do with the transient data during the map and reduce operations. Note that it's not just the map operations that generate the spilled records. When the in-memory buffer (controlled by mapred.job.shuffle.merge.percent) runs out or reaches the threshold number of map outputs (mapred.inmem.merge.threshold), it is merged and spilled to disk. You are going in the right direction by tuning the io.sort.mb parameter and try increasing it further. If it still doesn't work out, try the io.sort.factor, fs.inmemory.size.mb. Also, try the other two variables that i mentioned earlier. Let us know what worked for you. Sincerely, Saurabh Dutta Impetus Infotech India Pvt. Ltd., Sarda House, 24-B, Palasia, A.B.Road, Indore - 452 001 Phone: +91-731-4269200 4623 Fax: + 91-731-4071256 Email: saurabh.du...@impetus.co.in www.impetus.com From: maha [m...@umail.ucsb.edu] Sent: Tuesday, February 22, 2011 8:21 AM To: common-user Subject: Spilled Records Hello every one, Does spilled records mean that the sort-buffer size for sorting is not enough to sort all the input records, hence some records are written to local disk ? If so, I tried setting my io.sort.mb from the default 100 to 200 and there was still the same # of spilled records. Why ? Does changing io.sort.record.percent to be .9 instead .8 might produce unexpected exceptions ? Thank you, Maha Impetus to Present Big Data -- Analytics Solutions and Strategies at TDWI World Conference (Feb 13-18) in Las Vegas.We are also bringing cloud experts together at CloudCamp, Delhi on Feb 12. CloudCamp is an unconference where early adopters of Cloud Computing technologies exchange ideas. Click http://www.impetus.com to know more. NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Spilled Records
Hello every one, Does spilled records mean that the sort-buffer size for sorting is not enough to sort all the input records, hence some records are written to local disk ? If so, I tried setting my io.sort.mb from the default 100 to 200 and there was still the same # of spilled records. Why ? Does changing io.sort.record.percent to be .9 instead .8 might produce unexpected exceptions ? Thank you, Maha
Re: how many output files can support by MultipleOutputs?
hi, I think the third error pattern is are not caused by xceiver key. org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#5 at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:124) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742) at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.io.BoundedByteArrayOutputStream.(BoundedByteArrayOutputStream.java:58) at org.apache.hadoop.io.BoundedByteArrayOutputStream.(BoundedByteArrayOutputStream.java:45) at org.apache.hadoop.mapreduce.task.reduce.MapOutput.(MapOutput.java:104) at org.apache.hadoop.mapreduce.task.reduce.MergeManager.unconditionalReserve(MergeManager.java:267) at org.apache.hadoop.mapreduce.task.re by the google, this is by wrong ip entires which is the one of my cluster. but, I've checked several times again. ip addresses of my cluster are normal. my cluster size is 9 (1 master, 8 slaves) this is my mapred-site.xml: mapreduce.job.tracker thadpm01.scast:54311 The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. mapreduce.jobtracker.taskscheduler org.apache.hadoop.mapred.FairScheduler mapreduce.child.java.opts -Xmx1024m true mapreduce.map.java.opts -Xmx1024m true mapreduce.reduce.java.opts -Xmx1024m true mapreduce.tasktracker.map.tasks.maximum 83 true mapreduce.tasktracker.reduce.tasks.maximum 11 true mapreduce.jobtracker.handler.count 20 true mapreduce.reduce.shuffle.parallelcopies 10 true mapreduce.task.io.sort.factor 100 true mapreduce.task.io.sort.mb 400 true error log on stdout: attempt_201102181827_0113_r_00_1: 2011-02-22 10:24:28[WARN ][Child.java]main()(234) : Exception running child : org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#8 attempt_201102181827_0113_r_00_1: at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:124) attempt_201102181827_0113_r_00_1: at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362) attempt_201102181827_0113_r_00_1: at org.apache.hadoop.mapred.Child$4.run(Child.java:217) attempt_201102181827_0113_r_00_1: at java.security.AccessController.doPrivileged(Native Method) attempt_201102181827_0113_r_00_1: at javax.security.auth.Subject.doAs(Subject.java:396) attempt_201102181827_0113_r_00_1: at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742) attempt_201102181827_0113_r_00_1: at org.apache.hadoop.mapred.Child.main(Child.java:211) attempt_201102181827_0113_r_00_1: Caused by: java.lang.OutOfMemoryError: Java heap space attempt_201102181827_0113_r_00_1: at org.apache.hadoop.io.BoundedByteArrayOutputStream.(BoundedByteArrayOutputStream.java:58) attempt_201102181827_0113_r_00_1: at org.apache.hadoop.io.BoundedByteArrayOutputStream.(BoundedByteArrayOutputStream.java:45) attempt_201102181827_0113_r_00_1: at org.apache.hadoop.mapreduce.task.reduce.MapOutput.(MapOutput.java:104) attempt_201102181827_0113_r_00_1: at org.apache.hadoop.mapreduce.task.reduce.MergeManager.unconditionalReserve(MergeManager.java:267) attempt_201102181827_0113_r_00_1: at org.apache.hadoop.mapreduce.task.reduce.MergeManager.reserve(MergeManager.java:257) attempt_201102181827_0113_r_00_1: at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:305) attempt_201102181827_0113_r_00_1: at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:251) attempt_201102181827_0113_r_00_1: at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:149) attempt_201102181827_0113_r_00_1: 2011-02-22 10:24:28[INFO ][Task.java]taskCleanup()(996) : Runnning cleanup for the task 11/02/22 10:24:44 INFO mapreduce.Job: map 21% reduce 0% 11/02/22 10:24:54 INFO mapreduce.Job: map 22% reduce 0% thanks. Junyoung Kim (juneng...@gmail.com) On 02/21/2011 10:47 AM, Yifeng Jiang wrote: We were using 0.20.2 when the issue occurred, then we set it to 2048, and the failure was fixed. Now we are using 0.20-append (HBase requires it), it works well too. On 2011/02/21 10:35, Jun Young Kim wrote: hi, yifeng. Coung I know which version of a hadoop you are using? thanks for your response. Junyoung Kim (juneng...@gmail.com) On 02/21/2011 10:28 AM, Yifeng Jiang wrote: Hi, We have met the same issue. It seems that this error occurs, when the threads connected to the Datanode reaches the maximum
Re: benchmark choices
I wonder what companies like Amazon, Cloudera, RackSpace, Facebook, Yahoo etc. look at for the purpose of benchmarking. I guess GridMix v3 might be of more interest to Yahoo. I would appreciate if someone can comment more on this. Thanks, -Shrinivas On Fri, Feb 18, 2011 at 4:50 PM, Konstantin Boudnik wrote: > On Fri, Feb 18, 2011 at 14:35, Ted Dunning wrote: > > I just read the malstone report. They report times for a Java version > that > > is many (5x) times slower than for a streaming implementation. That > single > > fact indicates that the Java code is so appallingly bad that this is a > very > > bad benchmark. > > Slow Java code? That's funny ;) Running with Hotspot on by any chance? > > > On Fri, Feb 18, 2011 at 2:27 PM, Jim Falgout >wrote: > > > >> We use MalStone and TeraSort. For Hive, you can use TPC-H, at least the > >> data and the queries, if not the query generator. There is a Jira issue > in > >> Hive that discusses the TPC-H "benchmark" if you're interested. Sorry, I > >> don't remember the issue number offhand. > >> > >> -Original Message- > >> From: Shrinivas Joshi [mailto:jshrini...@gmail.com] > >> Sent: Friday, February 18, 2011 3:32 PM > >> To: common-user@hadoop.apache.org > >> Subject: benchmark choices > >> > >> Which workloads are used for serious benchmarking of Hadoop clusters? Do > >> you care about any of the following workloads : > >> TeraSort, GridMix v1, v2, or v3, MalStone, CloudBurst, MRBench, NNBench, > >> sample apps shipped with Hadoop distro like PiEstimator, dbcount etc. > >> > >> Thanks, > >> -Shrinivas > >> > >> > > >
Re: ObjectWritable
Thank you for the explanation. Avro is a good serialization tool. I haven't looked at the codes yet. I will probably dig into the codes very soon. On Mon, Feb 21, 2011 at 10:20 AM, Harsh J wrote: > Hello, > > On Mon, Feb 21, 2011 at 9:33 PM, Weishung Chung > wrote: > > What is the main use of org.apache.hadoop.io.ObjectWritable ? Thank you > :) > > To use any primitive Java object as a Writable without requiring it to > be implementing that interface. It will write out a class name for > every type of object you put into it along with the object itself, > when serializing - to deserialize properly. > > Maybe not so offtopic: The more I see Writables being used, the more I > feel like promoting the use of Apache's Avro instead. > > -- > Harsh J > www.harshj.com >
Re: Quick question
How can then I produce an output/file per mapper not map-task? Thank you, Maha On Feb 20, 2011, at 10:22 PM, Ted Dunning wrote: > This is the most important thing that you have said. The map function > is called once per unit of input but the mapper object persists for > many input units of input. > > You have a little bit of control over how many mapper objects there > are and how many machines they are created on and how many pieces your > input is broken into. That control is limited, however, unless you > build your own input format. The standard input formats are optimized > for very large inputs and may not give you the flexibility that you > want for your experiments. That is unfortunate for the purpose of > learning about hadoop but hadoop is designed mostly for dealing with > very large data and isn't usually designed to be easy to understand. > Where easy coincides with powerful then easy is good but powerful > isn't always easy. > > On Sunday, February 20, 2011, maha wrote: >> So first question: is there a difference between Mappers and maps ?
measure the time taken by stragglers
Hi, Is there a way in which we can measure the execution time for stragglers and non-stragglers tasks separately in Hadoop mapreduce? -bikash
Re: ObjectWritable
Hello, On Mon, Feb 21, 2011 at 9:33 PM, Weishung Chung wrote: > What is the main use of org.apache.hadoop.io.ObjectWritable ? Thank you :) To use any primitive Java object as a Writable without requiring it to be implementing that interface. It will write out a class name for every type of object you put into it along with the object itself, when serializing - to deserialize properly. Maybe not so offtopic: The more I see Writables being used, the more I feel like promoting the use of Apache's Avro instead. -- Harsh J www.harshj.com
ObjectWritable
What is the main use of org.apache.hadoop.io.ObjectWritable ? Thank you :)
Re: Quick question
Thanks for your answers Ted and Jim :) Maha On Feb 21, 2011, at 6:41 AM, Jim Falgout wrote: > You're scenario matches the capability of NLineInputFormat exactly, so that > looks to be the best solution. If you wrote your own input format, it would > have to basically do what NLineInputFormat is already doing for you. > > -Original Message- > From: maha [mailto:m...@umail.ucsb.edu] > Sent: Sunday, February 20, 2011 2:00 PM > To: common-user@hadoop.apache.org > Subject: Re: Quick question > > Actually the following solved my problem ... but I'm a little suspicious of > the side effect of doing the following instead of using my own InputSplit to > be 5 lines. > > conf.setInputFormat(org.apache.hadoop.mapred.lib.NLineInputFormat.class); // > # of maps = # lines conf.setInt("mapred.line.input.format.linespermap", 5); > //# of lines per mapper = 5 > > If you have any thought of whether the upper solution is worst that writing > my own inputSplit to be about 5 lines, let me know. > > Thanks everyone ! > > Maha > > On Feb 20, 2011, at 11:47 AM, maha wrote: > >> Hi again Jim and Ted, >> >> I understood that each mapper will be getting a block of lines... but even >> thought I had only 2 mappers for a 16 lines of input file and >> TextInputFormat is used. A map-function is processed for each of those 16 >> lines! >> >> I wanted a block of lines per map ... hence something like map1 has 8 lines >> and map2 has 8 lines. >> >> So first question: is there a difference between Mappers and maps ? >> >> Second: Does that mean I need to write my own inputFormat to make the >> InputSplit equal to multipleLines ??? >> >> Thank you, >> >> Maha >> >> >> On Feb 18, 2011, at 11:55 AM, Jim Falgout wrote: >> >>> That's right. The TextInputFormat handles situations where records cross >>> split boundaries. What your mapper will see is "whole" records. >>> >>> -Original Message- >>> From: maha [mailto:m...@umail.ucsb.edu] >>> Sent: Friday, February 18, 2011 1:14 PM >>> To: common-user >>> Subject: Quick question >>> >>> Hi all, >>> >>> I want to check if the following statement is right: >>> >>> If I use TextInputFormat to process a text file with 2000 lines (each >>> ending with \n) with 20 mappers. Then each map will have a sequence of >>> COMPLETE LINES . >>> >>> In other words, the input is not split byte-wise but by lines. >>> >>> Is that right? >>> >>> >>> Thank you, >>> Maha >>> >> > >
task scheduling based on slots in Hadoop
Hi, Can anyone throw some more light on resource based scheduling in Hadoop. Specifically, are resources like CPU, Memory partitioned across slots? >From the blog by Arun on capacity scheduler, http://developer.yahoo.com/blogs/hadoop/posts/2011/02/capacity-scheduler/ I understand that memory is the only resource supported, does that mean both memory and CPU are partitioned across map/reduce tasks in slots? Thanks in advance. -bikash
RE: Quick question
You're scenario matches the capability of NLineInputFormat exactly, so that looks to be the best solution. If you wrote your own input format, it would have to basically do what NLineInputFormat is already doing for you. -Original Message- From: maha [mailto:m...@umail.ucsb.edu] Sent: Sunday, February 20, 2011 2:00 PM To: common-user@hadoop.apache.org Subject: Re: Quick question Actually the following solved my problem ... but I'm a little suspicious of the side effect of doing the following instead of using my own InputSplit to be 5 lines. conf.setInputFormat(org.apache.hadoop.mapred.lib.NLineInputFormat.class); // # of maps = # lines conf.setInt("mapred.line.input.format.linespermap", 5); //# of lines per mapper = 5 If you have any thought of whether the upper solution is worst that writing my own inputSplit to be about 5 lines, let me know. Thanks everyone ! Maha On Feb 20, 2011, at 11:47 AM, maha wrote: > Hi again Jim and Ted, > > I understood that each mapper will be getting a block of lines... but even > thought I had only 2 mappers for a 16 lines of input file and TextInputFormat > is used. A map-function is processed for each of those 16 lines! > > I wanted a block of lines per map ... hence something like map1 has 8 lines > and map2 has 8 lines. > > So first question: is there a difference between Mappers and maps ? > > Second: Does that mean I need to write my own inputFormat to make the > InputSplit equal to multipleLines ??? > > Thank you, > > Maha > > > On Feb 18, 2011, at 11:55 AM, Jim Falgout wrote: > >> That's right. The TextInputFormat handles situations where records cross >> split boundaries. What your mapper will see is "whole" records. >> >> -Original Message- >> From: maha [mailto:m...@umail.ucsb.edu] >> Sent: Friday, February 18, 2011 1:14 PM >> To: common-user >> Subject: Quick question >> >> Hi all, >> >> I want to check if the following statement is right: >> >> If I use TextInputFormat to process a text file with 2000 lines (each ending >> with \n) with 20 mappers. Then each map will have a sequence of COMPLETE >> LINES . >> >> In other words, the input is not split byte-wise but by lines. >> >> Is that right? >> >> >> Thank you, >> Maha >> >