Furthermore, the 'syslog' file of a failing reduce task attempt doesn't show any warning about it, here is the tail of it: 2010-12-18 15:49:29,773 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 60 segments left of total size: 35288842 bytes 2010-12-18 15:49:31,012 INFO org.apache.hadoop.mapred.ReduceTask: Merged 60 segments, 35288842 bytes to disk to satisfy reduce memory limit 2010-12-18 15:49:31,013 INFO org.apache.hadoop.mapred.ReduceTask: Merging 1 files, 35288768 bytes from disk 2010-12-18 15:49:31,013 INFO org.apache.hadoop.mapred.ReduceTask: Merging 0 segments, 0 bytes from memory into reduce 2010-12-18 15:49:31,013 INFO org.apache.hadoop.mapred.Merger: Merging 1 sorted segments 2010-12-18 15:49:31,016 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 35288764 bytes 2010-12-18 15:59:47,795 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=CLEANUP, sessionId= 2010-12-18 15:59:47,899 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task 2010-12-18 15:59:47,912 INFO org.apache.hadoop.mapred.TaskRunner: Task:attempt_201012181440_0009_r_000009_0 is done. And is in the process of commiting 2010-12-18 15:59:47,921 INFO org.apache.hadoop.mapred.TaskRunner: Task 'attempt_201012181440_0009_r_000009_0' done.
2010/12/18 Ivan Leonardi <ivanleona...@gmail.com>: > Yes, I'm performing computation on large graphs (3.7M of nodes) and > set the numReduceTasks to 60. Below there is the code. I'm using a > grouping function grouper.getGroup(u) that computes the group ID of > the node having u as ID: the number of groups actually corresponds to > the numReduceTasks parameter. Am I doing it wrong, maybe? If I use > e.g. 20 reduce tasks, I got memory problems of Java heap, since a > reducer task receives all the information about a group (thus in the > order of 3.7M/${groups_count}). > /etc/hosts is ok, the hosts knows each other. > > public static class IEMapper extends Mapper<LongWritable, Vertex, > LongWritable, Message> { > private Grouper grouper; > public void map(LongWritable key, Vertex value, Context context) > throws IOException, InterruptedException { > long u = key.get(); > long e = value.getEstimate(); > LongWritable groupId = new LongWritable(); > > Message indexMessage = new Message(Global.MSG_INDEX); > indexMessage.setPayload(new PairOfLongs(u, e)); > groupId.set(grouper.getGroup(u)); > context.write(groupId, indexMessage); > > Message updateMessage = new Message(Global.MSG_UPDATE); > updateMessage.setPayload(new TripleOfLongs()); > > for (Long v : value.getNeighbors().getAll()) { > groupId.set(grouper.getGroup(v)); > updateMessage.getPayload(new TripleOfLongs[0]).set(v, > u, e); > context.write(groupId, updateMessage); > context.progress(); > } > } > } > > public static class IEReducer extends Reducer<LongWritable, Message, > LongWritable, Vertex> { > > public void reduce(LongWritable key, Iterable<Message> values, > Context context) throws IOException, InterruptedException { > TreeSet<Long> S = new TreeSet<Long>(); > Hashtable<Long, Set<Long>> D = new Hashtable<Long, > Set<Long>>(); > Hashtable<Long, Long> est = new Hashtable<Long, Long>(); > > long dst, src, u, val; > for (Message msg : values) { > switch (msg.getType()) { > case Global.MSG_INDEX: > PairOfLongs iPayload = msg.getPayload(new > PairOfLongs[0]); > u = iPayload.getLeftElement(); > val = iPayload.getRightElement(); > est.put(u, val); > S.add(u); > break; > case Global.MSG_UPDATE: > TripleOfLongs uPayload = msg.getPayload(new > TripleOfLongs[0]); > dst = uPayload.getLeftElement(); > src = uPayload.getCenterElement(); > val = uPayload.getRightElement(); > est.put(src, val); > Set<Long> s = D.get(dst); > if (s == null) { > s = new HashSet<Long>(); > D.put(dst, s); > } > s.add(src); > break; > default: > break; > } > } > > boolean changed = Global.improveEstimate(est, S); > > LongWritable wDest = new LongWritable(); > Vertex vertex = new Vertex(); > for (long s : S) { > wDest.set(s); > vertex.getNeighbors().set(D.get(s)); > vertex.setEstimate(est.get(s)); > context.write(wDest, vertex); > context.progress(); > } > > if (changed) { > context.getCounter(Global.COUNTERS_GROUP_NAME, > Global.COUNTER_UPDATES).increment(1L); > } > } > } > > 2010/12/18 nitin reddy <chintu.ni...@gmail.com>: >> Hi Ivan >> >> i faced the same thing , try adding in the /etc/hosts the host names >> with their IP address like >> >> <ip_master> master >> <ip_master> masterhostname >> <ip_slave> slave >> <ip_slave> slavehostname >> >> >> >> On Sat, Dec 18, 2010 at 7:54 AM, Ivan Leonardi <ivanleona...@gmail.com> >> wrote: >>> Hi all, >>> I'm running Hadoop 0.20.2 in a cluster setup built by 4 machines, and >>> I'm trying to run my own job. Here is my configuration: >>> >>> core-site.xml >>> <configuration> >>> <property> >>> <name>fs.default.name</name><value>hdfs://net-server00:44310</value> >>> </property> >>> <property> >>> <name>hadoop.tmp.dir</name><value>/tmp/hadoop-leonardi</value> >>> </property> >>> </configuration> >>> >>> hdfs-site.xml >>> <configuration> >>> <property> >>> <name>dfs.replication</name><value>1</value> >>> </property> >>> </configuration> >>> >>> mapred-site.xml >>> <configuration> >>> <property> >>> <name>mapred.job.tracker</name><value>net-server00:44311</value> >>> </property> >>> <property> >>> <name>mapred.tasktracker.map.tasks.maximum</name><value>3</value> >>> </property> >>> <property> >>> <name>mapred.tasktracker.reduce.tasks.maximum</name><value>3</value> >>> </property> >>> <property> >>> <name>tasktracker.http.threads</name><value>60</value> >>> </property> >>> <property> >>> <name>mapred.reduce.parallel.copies</name><value>50</value> >>> </property> >>> </configuration> >>> >>> My job runs normally at the beginning: >>> 10/12/18 14:16:36 INFO input.FileInputFormat: Total input paths to process >>> : 30 >>> 10/12/18 14:16:36 INFO mapred.JobClient: Running job: job_201012181359_0004 >>> 10/12/18 14:16:37 INFO mapred.JobClient: map 0% reduce 0% >>> 10/12/18 14:16:51 INFO mapred.JobClient: map 20% reduce 0% >>> 10/12/18 14:16:54 INFO mapred.JobClient: map 40% reduce 0% >>> 10/12/18 14:17:00 INFO mapred.JobClient: map 60% reduce 0% >>> 10/12/18 14:17:03 INFO mapred.JobClient: map 80% reduce 2% >>> 10/12/18 14:17:06 INFO mapred.JobClient: map 80% reduce 5% >>> 10/12/18 14:17:09 INFO mapred.JobClient: map 90% reduce 7% >>> 10/12/18 14:17:12 INFO mapred.JobClient: map 100% reduce 9% >>> 10/12/18 14:17:15 INFO mapred.JobClient: map 100% reduce 10% >>> 10/12/18 14:17:18 INFO mapred.JobClient: map 100% reduce 11% >>> 10/12/18 14:17:19 INFO mapred.JobClient: map 100% reduce 13% >>> 10/12/18 14:17:21 INFO mapred.JobClient: map 100% reduce 21% >>> 10/12/18 14:17:22 INFO mapred.JobClient: map 100% reduce 23% >>> 10/12/18 14:17:24 INFO mapred.JobClient: map 100% reduce 28% >>> 10/12/18 14:17:27 INFO mapred.JobClient: map 100% reduce 33% >>> 10/12/18 14:17:29 INFO mapred.JobClient: map 100% reduce 35% >>> 10/12/18 14:17:31 INFO mapred.JobClient: map 100% reduce 38% >>> 10/12/18 14:17:34 INFO mapred.JobClient: map 100% reduce 40% >>> >>> at this point, this percentages are blocked for 10 minutes (600 secs). >>> In the meanwhile I looked at the Task Tracker Status, I see 3 running >>> reduce tasks in each machine's proper web page having Status=RUNNING >>> and Progress=100%; here is a sample: >>> TaskAttempts Status Progress >>> attempt_201012181359_0004_r_000001_0 RUNNING 100.00% >>> attempt_201012181359_0004_r_000003_0 RUNNING 100.00% >>> attempt_201012181359_0004_r_000006_0 RUNNING 100.00% >>> >>> Therefore, after 600 secs, the console continues the logging by >>> emitting the following: >>> 10/12/18 14:27:34 INFO mapred.JobClient: map 100% reduce 36% >>> 10/12/18 14:27:36 INFO mapred.JobClient: Task Id : >>> attempt_201012181359_0004_r_000006_0, Status : FAILED Task >>> attempt_201012181359_0004_r_000006_0 failed to report status for 602 >>> seconds. Killing! >>> 10/12/18 14:27:38 INFO mapred.JobClient: map 100% reduce 23% >>> 10/12/18 14:27:40 INFO mapred.JobClient: map 100% reduce 6% >>> 10/12/18 14:27:40 INFO mapred.JobClient: Task Id : >>> attempt_201012181359_0004_r_000007_0, Status : FAILED >>> Task attempt_201012181359_0004_r_000007_0 failed to report status for >>> 601 seconds. Killing! >>> 10/12/18 14:27:40 INFO mapred.JobClient: Task Id : >>> attempt_201012181359_0004_r_000010_0, Status : FAILED >>> Task attempt_201012181359_0004_r_000010_0 failed to report status for >>> 601 seconds. Killing! >>> 10/12/18 14:27:41 INFO mapred.JobClient: map 100% reduce 0% >>> 10/12/18 14:27:42 INFO mapred.JobClient: Task Id : >>> attempt_201012181359_0004_r_000000_0, Status : FAILED >>> Task attempt_201012181359_0004_r_000000_0 failed to report status for >>> 600 seconds. Killing! >>> 10/12/18 14:27:42 INFO mapred.JobClient: Task Id : >>> attempt_201012181359_0004_r_000009_0, Status : FAILED >>> Task attempt_201012181359_0004_r_000009_0 failed to report status for >>> 600 seconds. Killing! >>> 10/12/18 14:27:43 INFO mapred.JobClient: Task Id : >>> attempt_201012181359_0004_r_000001_0, Status : FAILED >>> Task attempt_201012181359_0004_r_000001_0 failed to report status for >>> 601 seconds. Killing! >>> 10/12/18 14:27:43 INFO mapred.JobClient: Task Id : >>> attempt_201012181359_0004_r_000003_0, Status : FAILED >>> Task attempt_201012181359_0004_r_000003_0 failed to report status for >>> 602 seconds. Killing! >>> 10/12/18 14:27:45 INFO mapred.JobClient: Task Id : >>> attempt_201012181359_0004_r_000004_0, Status : FAILED >>> Task attempt_201012181359_0004_r_000004_0 failed to report status for >>> 601 seconds. Killing! >>> 10/12/18 14:27:45 INFO mapred.JobClient: Task Id : >>> attempt_201012181359_0004_r_000008_0, Status : FAILED >>> Task attempt_201012181359_0004_r_000008_0 failed to report status for >>> 601 seconds. Killing! >>> 10/12/18 14:27:46 INFO mapred.JobClient: map 100% reduce 1% >>> 10/12/18 14:27:46 INFO mapred.JobClient: Task Id : >>> attempt_201012181359_0004_r_000011_0, Status : FAILED >>> Task attempt_201012181359_0004_r_000011_0 failed to report status for >>> 601 seconds. Killing! >>> 10/12/18 14:27:46 INFO mapred.JobClient: Task Id : >>> attempt_201012181359_0004_r_000002_0, Status : FAILED >>> Task attempt_201012181359_0004_r_000002_0 failed to report status for >>> 602 seconds. Killing! >>> 10/12/18 14:27:48 INFO mapred.JobClient: Task Id : >>> attempt_201012181359_0004_r_000005_0, Status : FAILED >>> Task attempt_201012181359_0004_r_000005_0 failed to report status for >>> 602 seconds. Killing! >>> 10/12/18 14:27:49 INFO mapred.JobClient: map 100% reduce 2% >>> 10/12/18 14:27:55 INFO mapred.JobClient: map 100% reduce 4% >>> 10/12/18 14:27:56 INFO mapred.JobClient: map 100% reduce 6% >>> 10/12/18 14:27:58 INFO mapred.JobClient: map 100% reduce 9% >>> 10/12/18 14:28:00 INFO mapred.JobClient: map 100% reduce 14% >>> 10/12/18 14:28:02 INFO mapred.JobClient: map 100% reduce 22% >>> 10/12/18 14:28:03 INFO mapred.JobClient: map 100% reduce 25% >>> 10/12/18 14:28:05 INFO mapred.JobClient: map 100% reduce 31% >>> 10/12/18 14:28:06 INFO mapred.JobClient: map 100% reduce 33% >>> 10/12/18 14:28:08 INFO mapred.JobClient: map 100% reduce 36% >>> 10/12/18 14:28:09 INFO mapred.JobClient: map 100% reduce 37% >>> 10/12/18 14:28:11 INFO mapred.JobClient: map 100% reduce 38% >>> 10/12/18 14:28:12 INFO mapred.JobClient: map 100% reduce 39% >>> 10/12/18 14:28:14 INFO mapred.JobClient: map 100% reduce 40% >>> >>> and such behaviour continues. >>> >>> Is there anyone had the same problem and can help me? >>> Thank you all! >>> >>> Ivan >>> >> >