log files not found
Hi all, I am running a series of jobs one after another. While executing the 4th job, the job fails. It fails in the reducer --- the progress percentage would be map 100%, reduce 99%. It gives out the following message 10/04/01 01:04:15 INFO mapred.JobClient: Task Id : attempt_201003240138_0110_r_18_1, Status : FAILED Task attempt_201003240138_0110_r_18_1 failed to report status for 602 seconds. Killing! It makes several attempts again to execute it but fails with similar message. I couldn't get anything from this error message and wanted to look at logs (located in the default dir of ${HADOOP_HOME/logs}). But I don't find any files which match the timestamp of the job. Also I did not find history and userlogs in the logs folder. Should I look at some other place for the logs? What could be the possible causes for the above error? I am using Hadoop 0.20.2 and I am running it on a cluster with 14 nodes. Thank you. Regards, Raghava.
reduce takes too long time
Hello Everyone, One of our job's has 4 reduce tasks, but we find that one of them runs normally, and others takes too long time. Following is the normal task's log: 2010-04-01 15:01:48,596 INFO org.apache.hadoop.mapred.Merger: Merging 1 sorted segments 2010-04-01 15:01:48,601 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 9907055 bytes 2010-04-01 15:01:48,605 WARN org.apache.hadoop.mapred.JobConf: The variable mapred.task.maxvmem is no longer used. Instead use mapred.job.map.memory.mb and mapred.job.reduce.memory.mb 2010-04-01 15:01:48,622 WARN org.apache.hadoop.mapred.JobConf: The variable mapred.task.maxvmem is no longer used. Instead use mapred.job.map.memory.mb and mapred.job.reduce.memory.mb 2010-04-01 15:01:48,672 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor 2010-04-01 15:02:03,744 INFO org.apache.hadoop.mapred.TaskRunner: Task:attempt_201003301656_0139_r_01_0 is done. And is in the process of commiting 2010-04-01 15:02:05,756 INFO org.apache.hadoop.mapred.TaskRunner: Task attempt_201003301656_0139_r_01_0 is allowed to commit now 2010-04-01 15:02:05,762 INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output of task 'attempt_201003301656_0139_r_01_0' to /user/root/nginxlog/sessionjob/output/20100401140001-20100401150001 2010-04-01 15:02:05,765 INFO org.apache.hadoop.mapred.TaskRunner: Task 'attempt_201003301656_0139_r_01_0' done. And following is one of others: 2010-04-01 15:01:49,549 INFO org.apache.hadoop.mapred.Merger: Merging 1 sorted segments 2010-04-01 15:01:49,554 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 9793700 bytes 2010-04-01 15:01:49,563 WARN org.apache.hadoop.mapred.JobConf: The variable mapred.task.maxvmem is no longer used. Instead use mapred.job.map.memory.mb and mapred.job.reduce.memory.mb 2010-04-01 15:01:49,582 WARN org.apache.hadoop.mapred.JobConf: The variable mapred.task.maxvmem is no longer used. Instead use mapred.job.map.memory.mb and mapred.job.reduce.memory.mb 2010-04-01 15:04:49,690 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor 2010-04-01 15:05:07,103 INFO org.apache.hadoop.mapred.TaskRunner: Task:attempt_201003301656_0139_r_00_0 is done. And is in the process of commiting 2010-04-01 15:05:09,114 INFO org.apache.hadoop.mapred.TaskRunner: Task attempt_201003301656_0139_r_00_0 is allowed to commit now 2010-04-01 15:05:09,120 INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output of task 'attempt_201003301656_0139_r_00_0' to /user/root/nginxlog/sessionjob/output/20100401140001-20100401150001 2010-04-01 15:05:09,123 INFO org.apache.hadoop.mapred.TaskRunner: Task 'attempt_201003301656_0139_r_00_0' done. It looks like sth is waiting before 2010-04-01 15:05:07,103 INFO org.apache.hadoop.mapred.TaskRunner: Task:attempt_201003301656_0139_r_00_0 is done. And is in the process of commiting.Any suggestion? Regards, LvZheng
Re: Errors reading lzo-compressed files from Hadoop
Hey Dmitriy, This is very interesting (and worrisome in a way!) I'll try to take a look this afternoon. -Todd On Thu, Apr 1, 2010 at 12:16 AM, Dmitriy Ryaboy dmit...@twitter.com wrote: Hi folks, We write a lot of lzo-compressed files to HDFS -- some via scribe, some using internal tools. Occasionally, we discover that the created lzo files cannot be read from HDFS -- they get through some (often large) portion of the file, and then fail with the following stack trace: Exception in thread main java.lang.InternalError: lzo1x_decompress_safe returned: at com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native Method) at com.hadoop.compression.lzo.LzoDecompressor.decompress(LzoDecompressor.java:303) at com.hadoop.compression.lzo.LzopDecompressor.decompress(LzopDecompressor.java:122) at com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:223) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74) at java.io.InputStream.read(InputStream.java:85) at com.twitter.twadoop.jobs.LzoReadTest.main(LzoReadTest.java:51) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) The initial thought is of course that the lzo file is corrupt -- however, plain-jane lzop is able to read these files. Moreover, if we pull the files out of hadoop, uncompress them, compress them again, and put them back into HDFS, we can usually read them from HDFS as well. We've been thinking that this strange behavior is caused by a bug in the hadoop-lzo libraries (we use the version with Twitter and Cloudera fixes, on github: http://github.com/kevinweil/hadoop-lzo ) However, today I discovered that using the exact same environment, codec, and InputStreams, we can successfully read from the local file system, but cannot read from HDFS. This appears to point at possible issues in the FSDataInputStream or further down the stack. Here's a small test class that tries to read the same file from HDFS and from the local FS, and the output of running it on our cluster. We are using the CDH2 distribution. https://gist.github.com/e1bf7e4327c7aef56303 Any ideas on what could be going on? Thanks, -Dmitriy -- Todd Lipcon Software Engineer, Cloudera
Re: OutOfMemoryError: Cannot create GC thread. Out of system resources
The default size of Java's young GC generation is 1/3 of the heap. (-XX:NewRatio defaults to 2) You have told it to use 100MB for in memory file system. There is a default setting of 64MB sort space. if -Xmx is 128M then the above sums to over 200MB and won't fit. Turning down the use of any of the three above could help, or increasing -Xmx. Additionally, when a thread can't be allocated it could potentially be due to a limit on the OS side for file system handles per process or user. On Mar 31, 2010, at 11:48 AM, Edson Ramiro wrote: Hi all, When I run the pi Hadoop sample I get this error: 10/03/31 15:46:13 WARN mapred.JobClient: Error reading task outputhttp:// h04.ctinfra.ufpr.br:50060/tasklog?plaintext=truetaskid=attempt_201003311545_0001_r_02_0filter=stdout 10/03/31 15:46:13 WARN mapred.JobClient: Error reading task outputhttp:// h04.ctinfra.ufpr.br:50060/tasklog?plaintext=truetaskid=attempt_201003311545_0001_r_02_0filter=stderr 10/03/31 15:46:20 INFO mapred.JobClient: Task Id : attempt_201003311545_0001_m_06_1, Status : FAILED java.io.IOException: Task process exit with nonzero status of 134. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418) May be its because the datanode can't create more threads. ram...@lcpad:~/hadoop-0.20.2$ cat logs/userlogs/attempt_201003311457_0001_r_01_2/stdout # # A fatal error has been detected by the Java Runtime Environment: # # java.lang.OutOfMemoryError: Cannot create GC thread. Out of system resources. # # Internal Error (gcTaskThread.cpp:38), pid=28840, tid=140010745776400 # Error: Cannot create GC thread. Out of system resources. # # JRE version: 6.0_17-b04 # Java VM: Java HotSpot(TM) 64-Bit Server VM (14.3-b01 mixed mode linux-amd64 ) # An error report file with more information is saved as: # /var-host/tmp/hadoop-ramiro/mapred/local/taskTracker/jobcache/job_201003311457_0001/attempt_201003311457_0001_r_01_2/work/hs_err_pid28840.log # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp # I configured the limits bellow, but I'm still getting the same error. property namefs.inmemory.size.mb/name value100/value /property property namemapred.child.java.opts/name value-Xmx128M/value /property Do you know what limit should I configure to fix it? Thanks in Advance Edson Ramiro
Re: swapping on hadoop
On Apr 1, 2010, at 8:38 AM, Vasilis Liaskovitis wrote: In this example, what hadoop config parameters do the above 2 buffers refer to? io.sort.mb=250, but which parameter does the map side join 100MB refer to? Are you referring to the split size of the input data handled by a single map task? Apart from that question, the example is clear to me and useful, thanks. Map side join in just an example of one of many possible use cases where a particular map implementation may hold on to some semi-permanent data for the whole task. It could be anything that takes 100MB of heap and holds the data across individual calls to map(). Quoting Allen: Java takes more RAM than just the heap size. Sometimes 2-3x as much. Is there a clear indication that Java memory usage extends so far beyond its allocated heap? E.g. would java thread stacks really account for such a big increase 2x to 3x? Tasks seem to be heavily threaded. What are the relevant config options to control number of threads within a task? Java typically uses 5MB to 60MB for classloader data (statics, classes) and some space for threads, etc. The default thread stack on most OS's is about 1MB, and the number of threads for a task process is on the order of a dozen. Getting 2-3x the space in a java process outside the heap would require either a huge thread count, a large native library loaded, or perhaps a non-java hadoop job using pipes. It would be rather obvious in 'top' if you sort by memory (shift-M on linux), or vmstat, etc. To get the current size of the heap of a process, you can use jstat or 'kill -3' to create a stack dump and heap summary. With this new setup, I don't normally get swapping for a single job e.g. terasort or hive job. However, the problem in general is exacerbated if one spawns multiple indepenendent hadoop jobs simultaneously. I 've noticed that JVMs are not re-used across jobs, in an earlier post: http://www.mail-archive.com/common-...@hadoop.apache.org/msg01174.html This implies that Java memory usage would blow up when submitting multiple independent jobs. So this multiple job scenario sounds more susceptible to swapping The maximum number of map and reduce tasks per node applies no matter how many jobs are running. A relevant question is: in production environments, do people run jobs in parallel? Or is it that the majority of jobs is a serial pipeline / cascade of jobs being run back to back? Jobs are absolutely run in parallel. I recommend using the fair scheduler with no config parameters other than 'assignmultiple = true' as the 'baseline' scheduler, and adjust from there accordingly. The Capacity Scheduler has more tuning knobs for dealing with memory constraints if jobs have drastically different memory needs. The out-of-the-box FIFO scheduler tends to have a hard time keeping the cluster utilization high when there are multiple jobs to run. thanks, - Vasilis
Re: Error converting WordCount to v0.20.x
I've tried to try the same thing and I noted that even the map function was not executed! here are the logs : $ hadoop jar wordcount.jar org.stebourbi.hadoop.training.WordCount input output 10/04/01 23:39:53 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=30 10/04/01 23:39:53 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 10/04/01 23:39:53 DEBUG mapreduce.JobSubmitter: Configuring job job_201004012334_0007 with hdfs://localhost:9000/tmp/hadoop-tebourbi/mapred/staging/tebourbi/.staging/job_201004012334_0007 as the submit dir 10/04/01 23:39:53 WARN mapreduce.JobSubmitter: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 10/04/01 23:39:53 DEBUG mapreduce.JobSubmitter: default FileSystem: hdfs://localhost:9000 10/04/01 23:39:54 DEBUG mapreduce.JobSubmitter: Creating splits at hdfs://localhost:9000/tmp/hadoop-tebourbi/mapred/staging/tebourbi/.staging/job_201004012334_0007 10/04/01 23:39:54 INFO input.FileInputFormat: Total input paths to process : 3 10/04/01 23:39:54 DEBUG input.FileInputFormat: Total # of splits: 3 10/04/01 23:39:54 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 10/04/01 23:39:54 INFO mapreduce.JobSubmitter: number of splits:3 10/04/01 23:39:54 INFO mapreduce.JobSubmitter: adding the following namenodes' delegation tokens:null 10/04/01 23:39:54 INFO mapreduce.Job: Running job: job_201004012334_0007 10/04/01 23:39:55 INFO mapreduce.Job: map 0% reduce 0% 10/04/01 23:39:55 INFO mapreduce.Job: Job complete: job_201004012334_0007 10/04/01 23:39:55 INFO mapreduce.Job: Counters: 4 Job Counters Total time spent by all maps waiting after reserving slots (ms)=0 Total time spent by all reduces waiting after reserving slots (ms)=0 SLOTS_MILLIS_MAPS=0 SLOTS_MILLIS_REDUCES=0 However, the same code works well on eclipse as a simple java program! Slim. 2010/3/28 Chris Williams chris.d.willi...@gmail.com I am working through the WordCount example to get rid of all the deprecation warnings. While running it, my reduce function isn't being called. Any ideas? The code below can also be found here: http://gist.github.com/346975 Thanks! Chris package hadoop.examples; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; public class WordCount extends Configured implements Tool { public static class Map extends MapperLongWritable, Text, Text, IntWritable { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends ReducerText, IntWritable, Text, IntWritable { public void reduce(Text key, IteratorIntWritable values, Context context) throws IOException, InterruptedException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new WordCount(), args); System.exit(res); } @Override public int run(String[] args) throws Exception { Configuration conf = getConf(); Job job = new Job(conf, wordcount); job.setJarByClass(WordCount.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); //job.setCombinerClass(Reduce.class);
How to Recommission?
How to Recommission or decommission DataNode(s) in hadoop??? Decommission(Del some Datanodes): On a large cluster removing one or two data-nodes will not lead to any data loss, because name-node will replicate their blocks as long as it will detect that the nodes are dead. With a large number of nodes getting removed or dying the probability of losing data is higher. Hadoop offers the decommission feature to retire a set of existing data-nodes. The nodes to be retired should be included into the exclude file, and the exclude file name should be specified as a configuration parameter dfs.hosts.exclude. This file should have been specified during namenode startup. It could be a zero length file. You must use the full hostname, ip or ip:port format in this file. Then the shell command bin/hadoop dfsadmin -refreshNodes should be called, which forces the name-node to re-read the exclude file and start the decommission process. Decommission does not happen momentarily since it requires replication of potentially a large number of blocks and we do not want the cluster to be overwhelmed with just this one job. The decommission progress can be monitored on the name-node Web UI. Until all blocks are replicated the node will be in Decommission In Progress state. When decommission is done the state will change to Decommissioned. The nodes can be removed whenever decommission is finished. But how to Recommission? Wish your help. Thanks.