Behaviour of reducer's Iterable in MR unit.
Hi, This relates to a bug we had a while back. When running a reducer, if you want to buffer the values, you normally need to take a copy of each value as you iterate through them. This is because the iterator always returns the same object but the contents of the object get filled with each value as the iterator steps through. However *this behaviour is not reproduced by the reducer drivers in MR unit*. Even if you give the reduce driver a List (why do we have to give a List when reducer specifies merely an Iterable?) designed to behave this way, MR unit copies the values into a normal List before presenting them to the reducer. At least this is the case with the 0.20.1 install we have. Anyway, in order to test our bug fix we extended the ReduceDriver class to actually copy the values into an iterable that does reproduce the behaviour so that we can test for bugs caused by failing to copy the values. In more recent versions of Hadoop (we use 0.20.1) is the behaviour of the reduce drivers altered to match that of actual running reducers in this respect? Are there any plans to do this? Alternatively, I'd be willing to fix this in the Hadoop codebase myself if necessary. Regards, James -- James Hammerton | Senior Data Mining Engineer www.mendeley.com/profiles/james-hammerton Mendeley Limited | London, UK | www.mendeley.com Registered in England and Wales | Company Number 6419015
Memory Manager in Hadoop MR
Hi, 1 - Hadoop MR contains a TaskMemoryManagerThread class that is used to manage memory usage of tasks running under a TaskTracker. Why Hadoop MR needs a class to manage memory? Why it couldn't rely on the JVM, or this class is here for another purpose? 2 - How the JT knows that a Map or Reduce Task finished? Is through the heartbeat? Thanks -- Pedro
Map-Reduce Applicability With All-In Memory Data
Hi All, We have a problem in hand which we would like to solve using Distributed and Parallel Processing. *Problem context* : We have a Map (Entity, Value). The entity can have a parent which in turn will have its parent and so on till we reach the head. I have to traverse this tree and do some calculations at every step using the value of the Map. The final output will again be a map containing the aggregated results of the computation (Entity, Computed Value). The tree structure can be quite deep and we have a huge number of entries in Map to process before coming to the final result. Processing them sequentially takes quite long time. We were thinking of using Map-Reduce to split the computation across multiple nodes in a Hadoop Cluster and then aggregate the results to get the final output. Having a quick read at the documentation and the samples, I see that both Mapper and Reducer work with implementations of InputFormat and OutPutFormat respectively. Most of the implementations appeared to me to be either File or DB based. Do we have some input-output format which directly takes/updates things from/into Memory ? or I need to provide my own Custom Input/Output Format and Record Reader/Writer implementations for the purpose ? Based upon your experiences, do you think whether Map-Reduce is the appropriate platform for these kind of scenarios or we should think of it more for huge File based data only ? Best Regards Narinder
Re: Memory Manager in Hadoop MR
For 1, TMMT uses ProcessTree to check for task that is running beyond memory-limits and kills it. On Thu, Dec 9, 2010 at 3:05 AM, Pedro Costa wrote: > Hi, > > 1 - Hadoop MR contains a TaskMemoryManagerThread class that is used to > manage memory usage of tasks running under a TaskTracker. Why Hadoop > MR needs a class to manage memory? Why it couldn't rely on the JVM, or > this class is here for another purpose? > > 2 - How the JT knows that a Map or Reduce Task finished? Is through > the heartbeat? > > Thanks > > -- > Pedro >
Re: Map-Reduce Applicability With All-In Memory Data
Take a look at NLineInputFormat. You might want to use it in combination with DistributedCache. Sent from my iPhone On Dec 9, 2010, at 5:02 AM, Narinder Kumar wrote: > Hi All, > > We have a problem in hand which we would like to solve using Distributed and > Parallel Processing. > > Problem context : We have a Map (Entity, Value). The entity can have a parent > which in turn will have its parent and so on till we reach the head. I have > to traverse this tree and do some calculations at every step using the value > of the Map. The final output will again be a map containing the aggregated > results of the computation (Entity, Computed Value). The tree structure can > be quite deep and we have a huge number of entries in Map to process before > coming to the final result. Processing them sequentially takes quite long > time. We were thinking of using Map-Reduce to split the computation across > multiple nodes in a Hadoop Cluster and then aggregate the results to get the > final output. > > Having a quick read at the documentation and the samples, I see that both > Mapper and Reducer work with implementations of InputFormat and OutPutFormat > respectively. Most of the implementations appeared to me to be either File or > DB based. Do we have some input-output format which directly takes/updates > things from/into Memory ? or I need to provide my own Custom Input/Output > Format and Record Reader/Writer implementations for the purpose ? > > Based upon your experiences, do you think whether Map-Reduce is the > appropriate platform for these kind of scenarios or we should think of it > more for huge File based data only ? > > Best Regards > Narinder
MultipleInputs and Paths Containing Commas
Hello. I'm unsure of if this is a bug or an oversight, but since I've not found any reference anywhere to this, I figured I might bring it to light. I've been using MultipleInputs for several of my MapReduce jobs, where I am joining together different forms of data. However, I have encountered the following exception with some uses of MultipleInputs in Hadoop 0.20.2: java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.hadoop.mapred.lib.MultipleInputs.getInputFormatMap(MultipleInputs.java:94) at org.apache.hadoop.mapred.lib.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:51) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) After tracing through the source code, it appears that this occurs when the input path specified in MultipleInputs#addInputPath() contains a comma, which most often using globs (for example, "/months/{March,April,May}.txt"). Because the path itself contains commas, one of the two special delimiters used in MultipleInputs#getInputFormatMap(), when the input format map is being created, it parses the path-inputformat data incorrectly. Could someone verify this behavior in other versions of Hadoop? And possibly the more important question, should this actually be considered a bug in MultipleInputs? Thanks. This e-mail and any attachments may contain confidential material for the sole use of the intended recipient. If you are not the intended recipient, please be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please contact the sender and delete all copies. Thank you for your cooperation.
RE: distcp just fails (was:distcp fails with ConnectException)
Thanks everyone. It turned out that I was using the wrong port. The issue was resolved. From: hadoopman [mailto:hadoop...@gmail.com] Sent: Monday, December 06, 2010 6:26 PM To: mapreduce-user@hadoop.apache.org Subject: Re: distcp just fails (was:distcp fails with ConnectException) On 12/06/2010 06:52 PM, jingguo yao wrote: A few things worth of a check. 1. Can hadoop-A connect to hadoop-B usng 8020 port? You can check firewall settings. Maybe your firewall settings allow ssh and ping. But port 8020 is disallowed. 2. Which user are you using to run distcp? Does this user has access rights to hadoop-A? I'm curious. I've run into a problem with distcp also and we're using ports 9000, 9001 and 50010 for two hadoop clusters we're running. It looks like we're connecting however when it gets to 80% it bails. I've validated our firewall is open to those ports between the two clusters. What are you referring to regarding access rights to hadoop-A? I'm a little confused by that statement I'm afraid :-) Below is what we're seeing. Thanks!! --==--==--==-- # u...@hnn1:~$ hadoop distcp hdfs://hnn1:9000/user/testing hdfs://hnn2:9000/user 10/12/03 15:58:10 INFO tools.DistCp: srcPaths=[hdfs://hnn1:9000/user/testing] 10/12/03 15:58:10 INFO tools.DistCp: destPath=hdfs://hnn2:9000/user 10/12/03 15:58:11 INFO tools.DistCp: srcCount=6 10/12/03 15:58:11 INFO mapred.JobClient: Running job: job_201011221457_0019 10/12/03 15:58:12 INFO mapred.JobClient: map 0% reduce 0% 10/12/03 15:58:36 INFO mapred.JobClient: map 19% reduce 0% 10/12/03 15:58:45 INFO mapred.JobClient: map 39% reduce 0% 10/12/03 15:59:03 INFO mapred.JobClient: map 60% reduce 0% 10/12/03 15:59:12 INFO mapred.JobClient: map 80% reduce 0% 10/12/03 15:59:32 INFO mapred.JobClient: Task Id : attempt_201011221457_0019_m_00_0, Status : FAILED java.io.IOException: Copied: 0 Skipped: 0 Failed: 5 at org.apache.hadoop.tools.DistCp$CopyFilesMapper.close(DistCp.java:572) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) 10/12/03 15:59:33 INFO mapred.JobClient: map 0% reduce 0% 10/12/03 15:59:55 INFO mapred.JobClient: map 19% reduce 0% 10/12/03 16:00:04 INFO mapred.JobClient: map 39% reduce 0% 10/12/03 16:00:22 INFO mapred.JobClient: map 60% reduce 0% 10/12/03 16:00:31 INFO mapred.JobClient: map 80% reduce 0% 10/12/03 16:00:51 INFO mapred.JobClient: Task Id : attempt_201011221457_0019_m_00_1, Status : FAILED java.io.IOException: Copied: 0 Skipped: 0 Failed: 5 at org.apache.hadoop.tools.DistCp$CopyFilesMapper.close(DistCp.java:572) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170)
Re: Memory Manager in Hadoop MR
> 2 - How the JT knows that a Map or Reduce Task finished? Is through > the heartbeat? Exactly. Tasks communicate with their TTs through the umbilical, and each TT communicates with the JT via heartbeat (and heartbeat response). Greg
Re: Behaviour of reducer's Iterable in MR unit.
Hi James, The ReduceDriver is configured to receive a list of inputs because lists have ordering guarantees whereas other Iterables/Collections do not; for determinism's sake, it is best to guarantee that you're calling reduce() with an ordered set of values when testing. It would be stellar if you could improve the ReduceDriver to reuse a writable instance between calls. You'll need to infer the appropriate container class type from the first instance you see in the reducer's output, and use the serialization API to make a copy. If you look at o.a.h.mrunit.mock.MockOutputCollector, this will show a pattern you can work on. Cheers, - Aaron On Thu, Dec 9, 2010 at 2:21 AM, James Hammerton wrote: > Hi, > > This relates to a bug we had a while back. > > When running a reducer, if you want to buffer the values, you normally need > to take a copy of each value as you iterate through them. This is because > the iterator always returns the same object but the contents of the object > get filled with each value as the iterator steps through. > > However this behaviour is not reproduced by the reducer drivers in MR unit. > Even if you give the reduce driver a List (why do we have to give a List > when reducer specifies merely an Iterable?) designed to behave this way, MR > unit copies the values into a normal List before presenting them to the > reducer. At least this is the case with the 0.20.1 install we have. > > Anyway, in order to test our bug fix we extended the ReduceDriver class to > actually copy the values into an iterable that does reproduce the behaviour > so that we can test for bugs caused by failing to copy the values. In more > recent versions of Hadoop (we use 0.20.1) is the behaviour of the reduce > drivers altered to match that of actual running reducers in this respect? > Are there any plans to do this? Alternatively, I'd be willing to fix this in > the Hadoop codebase myself if necessary. > > Regards, > > James > > -- > James Hammerton | Senior Data Mining Engineer > www.mendeley.com/profiles/james-hammerton > > Mendeley Limited | London, UK | www.mendeley.com > Registered in England and Wales | Company Number 6419015 > > > >
Re: Memory Manager in Hadoop MR
Hi, On Thu, Dec 9, 2010 at 4:35 PM, Pedro Costa wrote: > Hi, > > 1 - Hadoop MR contains a TaskMemoryManagerThread class that is used to > manage memory usage of tasks running under a TaskTracker. Why Hadoop > MR needs a class to manage memory? Why it couldn't rely on the JVM, or > this class is here for another purpose? > There are streaming and pipes map/reduce applications that launch native processes from the map/reduce tasks that are outside the control of the JVM. Indeed, even regular Java map/reduce programs could fork/exec other programs. All of these processes could consume memory that would not be accounted for if we relied only on the JVM to get the memory usage. Hence a separate class that looks at the entire process tree of the map/reduce task to account for memory consumed. > 2 - How the JT knows that a Map or Reduce Task finished? Is through > the heartbeat? > Yes. > Thanks > > -- > Pedro >
How to share Same Counter in Multiple Jobs?
Hi, I chain multiple jobs in my program. Job 1's reduce function has a counter. I want job 3's reduce function to read this Job 1's counter. How? Thanks.
Re: How to share Same Counter in Multiple Jobs?
I wrote the following code today. We have our own flow execution logic which calls the following to collect counters. enum COUNT_COLLECTION { LOG,// log the counters ADD_TO_CONF// add counters to JobConf } protected static void collectCounters(RunningJob running, JobConf jobConf, EnumSet collFlags) { try { Counters counters = running.getCounters(); Collection counterGroupNames = counters.getGroupNames(); if (counterGroupNames == null) { LOG.warn("No counters returned from job " + running.getJobName()); } else { String[] groupsToCollect = { "Map-Reduce Framework", "FileSystemCounters" }; for (String counterGroupName : groupsToCollect) { for (Iterator iterator = counters.getGroup(counterGroupName).iterator(); iterator.hasNext();) { Counter counter = iterator.next(); String counterName = counters.getGroup(counterGroupName).getDisplayName()+"."+ counter.getDisplayName(); if (collFlags.contains(COUNT_COLLECTION.LOG)) { LOG.info(counterName + ": " + counter.getCounter()); } } } } } catch (IOException e) { LOG.error("unable to retrieve counters", e); } } You can pass the counter from Job 1 to Job 3 via JobConf. On Thu, Dec 9, 2010 at 9:45 PM, Savannah Beckett < savannah_becket...@yahoo.com> wrote: > Hi, > I chain multiple jobs in my program. Job 1's reduce function has a > counter. I want job 3's reduce function to read this Job 1's counter. > How? > Thanks. > >
-libjars?
disclaimer : a newbie!!! Howdy? Got a quick question. -libjars option doesn't seem to work for me in - prettymuch - my first (or mayby second) mapreduce job. Here's what i'm doing : $bin/hadoop jar sherlock.jar somepkg.FindSchoolsJob -libjars HStats-1A18.jar input output sherlock.jar has my main class (ofcourse) FindSchoolsJob, which runs just fine by itself till I add a dependency on a class in HStats-1A18.jar. When I run the above command with -libjars specified - it fails to find my classes that 'are' inside HStats jar file. Exception in thread "main" java.lang.NoClassDefFoundError: com/*/HAgent at com.*.FindSchoolsJob.run(FindSchoolsJob.java:46) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at com.**.FindSchoolsJob.main(FindSchoolsJob.java:101) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: java.lang.ClassNotFoundException:com/*/HAgent at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) ... 8 more My main class is defined as below : public class FindSchoolsJob extends Configured implements Tool { : public int run(String[] args) throws Exception { : : } : public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new FindSchoolsJob(), args); System.exit(res); } } Any hint would be highly appreciated. Thank You! ~V