What is the best to terminate a Map job without it being retried
I have a collection of dirty data files, which I can detect during the setup() phase of my Map job. It would be best that I can quit the map job and prevent it from being retried again. What is the best practice to do this? Thanks in advance.
Re: Parallel out-of-order Map capability?
How about batching up calls from n line, make a synchronous call to server with the batch, get the batch results and go through the result set one by one? This assumes the server can return batched calls in order.
Re: Access Error
This could be caused by different user accounts. Is the user "hadoop" when running job on the master and "bhardy" on remote client?
Re: How to config Map only job to read .gz input files and output result in .lzo
Thanks, Ed. It works like a charm.
How to config Map only job to read .gz input files and output result in .lzo
We have TB worth of XML data in .gz format where each file is about 20 MB. This dataset is not expected to change. My goal is to write a map-only job to read in one .gz file at a time and output the result in .lzo format. Since there are a large number of .gz files, the map parallelism is expected to be maximized. I am using Kevin Weil's LZO distribution and there does not seem to be a LzoTextOutputFormat. When I got lzo to work before, I set InputFormatClass to LzoTextInputFormat.class and map's output got lzo compressed automatically. What does one configure for LZO output. Current Job configuration code listed below does not work. XmlInputFormat is my custom input format to read XML files. job.setInputFormatClass(XmlInputFormat.class); job.setMapperClass(XmlAnalyzer.XmlAnalyzerMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); String mapredOutputCompress = conf.get("mapred.output.compress"); if ("true".equals(mapredOutputCompress)) // this reads input and write output in lzo format job.setInputFormatClass(LzoTextInputFormat.class);
Re: Best way to reduce a 8-node cluster in half and get hdfs to come out of safe mode
Thanks Allen for your advice.
Best way to reduce a 8-node cluster in half and get hdfs to come out of safe mode
As part of our experimentation, the plan is to pull 4 slave nodes out of a 8-slave/1-master cluster. With replication factor set to 3, I thought losing half of the cluster may be too much for hdfs to recover. Thus I copied out all relevant data from hdfs to local disk and reconfigure the cluster. The 4 slave nodes started okay but hdfs never left safe mode. The nn.log has the following line. What is the best way to deal with this? Shall I restart the cluster with 8-node and then delete /data/hadoop-hadoop/mapred/system? Or shall I reformat hdfs? Thanks. 2010-08-05 22:28:12,921 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop,hadoop ip=/10.128.135.100 cmd=listStatus src=/data/hadoop-hadoop/mapred/system dst=nullperm=null 2010-08-05 22:28:12,923 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 9000, call delete(/data/hadoop-hadoop/mapred/system, true) from 10.128.135.100:52368: error: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /data/hadoop-hadoop/mapred/system. Name node is in safe mode. The reported blocks 64 needs additional 3 blocks to reach the threshold 0.9990 of total blocks 68. Safe mode will be turned off automatically. org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /data/hadoop-hadoop/mapred/system. Name node is in safe mode. The reported blocks 64 needs additional 3 blocks to reach the threshold 0.9990 of total blocks 68. Safe mode will be turned off automatically. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:1741) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:1721) at org.apache.hadoop.hdfs.server.namenode.NameNode.delete(NameNode.java:565) at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)
Re: Problem with large .lzo files
Splitting the problem file into two smaller ones allowed the process to succeed. This, I believe pointed to some bug in LzoDecompressor.
Re: Problem with large .lzo files
Here is the ArrayIndexOutOfBoundException from LzoDecompressor. I have 30 800MB of lzo files. 16 of them were processed by the Mapper successfully with splits around 60. File part-r-00012.lzo failed at split 45. It's clear that it was not size related. I have copy out part-r-00012.lzo, split it in half, and lzo'ed the small files. I am now running the same set of input but with the smaller files instead to see if the process complete. Below is the abridged stack trace. 2010-02-21 11:15:34,718 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processNa me=MAP, sessionId= 2010-02-21 11:15:34,909 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100 2010-02-21 11:15:34,967 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720 2010-02-21 11:15:34,967 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680 2010-02-21 11:15:34,972 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded native gpl library 2010-02-21 11:15:34,977 INFO com.hadoop.compression.lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev fatal: Not a git repository (or any of the parent directories): .git] 2010-02-21 11:15:35,124 INFO com.foo.bar.mapreduce.baz.BazFaz: InputFile=hdfs://foo-ndbig03.lax1.foo.com:9000/ user/hadoop/baz/faz_old/part-r-00012.lzo 2010-02-21 11:15:35,124 INFO com.foo.bar.mapreduce.baz.BazFaz: maxFazCount =1000 2010-02-21 11:15:38,438 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true 2010-02-21 11:15:38,438 INFO org.apache.hadoop.mapred.MapTask: bufstart = 0; bufend = 18821149; bufvoid = 9961 4720 2010-02-21 11:15:38,438 INFO org.apache.hadoop.mapred.MapTask: kvstart = 0; kvend = 262144; length = 327680 2010-02-21 11:15:39,125 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor 2010-02-21 11:15:39,738 INFO org.apache.hadoop.mapred.MapTask: Finished spill 0 2010-02-21 11:15:41,917 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true 2010-02-21 11:15:41,917 INFO org.apache.hadoop.mapred.MapTask: bufstart = 18821149; bufend = 37722479; bufvoid = 99614720 2010-02-21 11:15:41,918 INFO org.apache.hadoop.mapred.MapTask: kvstart = 262144; kvend = 196607; length = 3276 80 2010-02-21 11:15:42,642 INFO org.apache.hadoop.mapred.MapTask: Finished spill 1 2010-02-21 11:15:44,564 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true 2010-02-21 11:15:44,565 INFO org.apache.hadoop.mapred.MapTask: bufstart = 37722479; bufend = 56518058; bufvoid = 99614720 2010-02-21 11:15:44,565 INFO org.apache.hadoop.mapred.MapTask: kvstart = 196607; kvend = 131070; length = 3276 80 2010-02-21 11:15:45,485 INFO org.apache.hadoop.mapred.MapTask: Finished spill 2 ... 2010-02-21 11:17:33,280 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true 2010-02-21 11:17:33,280 INFO org.apache.hadoop.mapred.MapTask: bufstart = 32585253; bufend = 51221187; bufvoid = 99614720 2010-02-21 11:17:33,280 INFO org.apache.hadoop.mapred.MapTask: kvstart = 65501; kvend = 327645; length = 32768 0 2010-02-21 11:17:34,056 INFO org.apache.hadoop.mapred.MapTask: Finished spill 44 2010-02-21 11:17:35,914 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true 2010-02-21 11:17:35,914 INFO org.apache.hadoop.mapred.MapTask: bufstart = 51221187; bufend = 70104763; bufvoid = 99614720 2010-02-21 11:17:35,914 INFO org.apache.hadoop.mapred.MapTask: kvstart = 327645; kvend = 262108; length = 3276 80 2010-02-21 11:17:36,731 INFO org.apache.hadoop.mapred.MapTask: Finished spill 45 2010-02-21 11:17:37,802 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.lang.ArrayIndexOutOfBoundsException at com.hadoop.compression.lzo.LzoDecompressor.setInput(LzoDecompressor.java:200) at com.hadoop.compression.lzo.LzopDecompressor.setInput(LzopDecompressor.java:98) at com.hadoop.compression.lzo.LzopInputStream.getCompressedData(LzopInputStream.java:297) at com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:232) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74) at java.io.InputStream.read(InputStream.java:85) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:187) at com.hadoop.mapreduce.LzoLineRecordReader.nextKeyValue(LzoLineRecordReader.java:126) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) 2010-02-21 11:17:37,809 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task
Re: Problem with large .lzo files
You should be able to downcast the InputSplit to FileSplit, if you're > using the new API. From there you can get the start and length of the > split. > > Cool, let me give it a shot. > Interesting. If you can somehow make a reproducible test case I'd be > happy to look into this. > > This sounds great. As the input file is 1G, let me do some work on my side to see if I can pinpoint it so as not have to transfer a 1G file around. Thanks.
Re: Problem with large .lzo files
On Sun, Feb 14, 2010 at 12:46 PM, Todd Lipcon wrote: > Hi Steve, > > I'm not sure here whether you mean that the DistributedLzoIndexer job > is failing, or if the job on the resulting split file is faiing. Could > you clarify? > > DistributedLzoIndexer job did successfully complete. It was one of the jobs on the resulting split file always failed while other splits succeeded. By the way, if all files have been indexed, DistributedLzoIndexer does not detect that and hadoop throws an exception complaining that the input dir (or file) does not exist. I work around this by catching the exception. > > - It's possible to sacrifice parallelism by having hadoop work on each > > .lzo file without indexing. This worked well until the file size > exceeded > > 30G when array indexing exception got thrown. Apparently the code > processed > > the file in chunks and stored the references to the chunk in an array. > When > > the number of chunks was greater than a certain number (around 256 was > my > > recollection), exception was thrown. > > - My current work around is to increase the number of reducers to keep > > the .lzo file size low. > > > > I would like to get advices on how people handle large .lzo files. Any > > pointers on the cause of the stack trace below and best way to resolve it > > are greatly appreciated. > > > > Is this reproducible every time? If so, is it always at the same point > in the LZO file that it occurs? > > It's at the same point. Do you know how to print out the lzo index for the task? I only print out the input file now. > Would it be possible to download that lzo file to your local box and > use lzop -d to see if it decompresses successfully? That way we can > isolate whether it's a compression bug or decompression. > > Bothe java LzoDecompressor and lzop -d were able to decompress the file correctly. As a matter of fact, my job does not index .lzo files now but process each as a whole and it works
Problem with large .lzo files
I am running a hadoop job that combines daily results with results with previous days. The reduce output is lzo compressed and growing daily in size. - DistributedLzoIndexer is used to index lzo files to provide parallelism. When the size of the lzo files were small, everything went well. As the size of .lzo files grow, the chance that one of the partition does not complete increases. The exception I got for one such case is listed at the end of the post. - It's possible to sacrifice parallelism by having hadoop work on each .lzo file without indexing. This worked well until the file size exceeded 30G when array indexing exception got thrown. Apparently the code processed the file in chunks and stored the references to the chunk in an array. When the number of chunks was greater than a certain number (around 256 was my recollection), exception was thrown. - My current work around is to increase the number of reducers to keep the .lzo file size low. I would like to get advices on how people handle large .lzo files. Any pointers on the cause of the stack trace below and best way to resolve it are greatly appreciated. Task Logs: 'attempt_201002041203_0032_m_000134_3' syslog logs 2010-02-13 19:59:51,839 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processNa me=MAP, sessionId= 2010-02-13 19:59:52,045 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100 2010-02-13 19:59:52,094 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720 2010-02-13 19:59:52,094 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680 2010-02-13 19:59:52,099 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded native gpl library 2010-02-13 19:59:52,111 INFO com.hadoop.compression.lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev fatal: Not a git repository (or any of the parent directories): .git] 2010-02-13 19:59:52,144 INFO com.baz.facetmr.mapreduce.foo.FooBar: InputFile=hdfs://ndbig03:9000/user/hadoop/f oo/basr_old/part-r-0.lzo 2010-02-13 19:59:52,144 INFO com.baz.facetmr.mapreduce.foo.FooBar: maxBarCount =1000 2010-02-13 19:59:55,267 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true 2010-02-13 19:59:55,268 INFO org.apache.hadoop.mapred.MapTask: bufstart = 0; bufend = 18641936; bufvoid = 9961 4720 2010-02-13 19:59:55,268 INFO org.apache.hadoop.mapred.MapTask: kvstart = 0; kvend = 262144; length = 327680 2010-02-13 19:59:55,958 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor 2010-02-13 19:59:56,730 INFO org.apache.hadoop.mapred.MapTask: Finished spill 0 2010-02-13 19:59:58,567 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true 2010-02-13 19:59:58,567 INFO org.apache.hadoop.mapred.MapTask: bufstart = 18641936; bufend = 37532762; bufvoid = 99614720 2010-02-13 19:59:58,567 INFO org.apache.hadoop.mapred.MapTask: kvstart = 262144; kvend = 196607; length = 3276 80 2010-02-13 19:59:59,487 INFO org.apache.hadoop.mapred.MapTask: Finished spill 1 2010-02-13 20:00:01,382 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true 2010-02-13 20:00:01,382 INFO org.apache.hadoop.mapred.MapTask: bufstart = 37532762; bufend = 56161807; bufvoid = 99614720 2010-02-13 20:00:01,382 INFO org.apache.hadoop.mapred.MapTask: kvstart = 196607; kvend = 131070; length = 3276 80 2010-02-13 20:00:02,282 INFO org.apache.hadoop.mapred.MapTask: Finished spill 2 2010-02-13 20:00:04,100 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true 2010-02-13 20:00:04,100 INFO org.apache.hadoop.mapred.MapTask: bufstart = 56161807; bufend = 74935116; bufvoid = 99614720 2010-02-13 20:00:04,100 INFO org.apache.hadoop.mapred.MapTask: kvstart = 131070; kvend = 65533; length = 32768 0 2010-02-13 20:00:05,100 INFO org.apache.hadoop.mapred.MapTask: Finished spill 3 2010-02-13 20:00:06,177 FATAL org.apache.hadoop.mapred.TaskTracker: Error running child : java.lang.InternalEr ror: lzo1x_decompress_safe returned: at com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native Method) at com.hadoop.compression.lzo.LzoDecompressor.decompress(LzoDecompressor.java:303) at com.hadoop.compression.lzo.LzopDecompressor.decompress(LzopDecompressor.java:104) at com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:223) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74) at java.io.InputStream.read(InputStream.java:85) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:187) at com.hadoop.mapreduce.LzoLineRecordReader.nextKeyValue(LzoLineRecordReader.java:126) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.
Re: isSplitable() deprecated
Ted, You may want to consider LZO compression, which allows splitting for a comporessed file for Map jobs. On the other hand, gzip is not splittable. Check out these links. http://www.cloudera.com/blog/2009/11/17/hadoop-at-twitter-part-1-splittable-lzo-compression/ http://wiki.apache.org/hadoop/UsingLzoCompression On Fri, Jan 8, 2010 at 1:13 PM, Ted Yu wrote: > The input file is in .gz format > FYI > > On Fri, Jan 8, 2010 at 11:08 AM, Ted Yu wrote: > > > My current project processes input file of size 02161 bytes. > > What I plan to do is to split the file into equal size pieces (and on > blank > > line boundary) to improve performance. > > > > I found 12 classes in 0.20.1 source code which implement InputSplit. > > > > If someone has written code similar to what I plan to do, please share > some > > hint. > > > > Thanks > > > > > > On Fri, Jan 8, 2010 at 2:27 AM, Amogh Vasekar > wrote: > > > >> Hi, > >> The deprecation is due to the new evolving mapreduce ( o.a.h.mapreduce ) > >> APIs. Old APIs are supported for available distributions. The equivalent > of > >> TextInputFormat is available in new API : > >> > >> > >> > http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.html > >> > >> Thanks, > >> Amogh > >> > >> > >> On 1/8/10 3:47 AM, "Ted Yu" wrote: > >> > >> According to: > >> > >> > http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.html#isSplitable%28org.apache.hadoop.fs.FileSystem,%20org.apache.hadoop.fs.Path%29 > >> > >> isSplitable() is deprecated. > >> > >> Which method should I use to replace it ? > >> > >> Thanks > >> > >> > > >
Re: Cannot pass dynamic values by Configuration.Set()
There seemed to be a change between 0.20 and 0.19 API in that 0.20 no longer set "map.input.file". config.set(), as far as I can tell, should work. I however use the following to pass the parameters. String[] params = new String[] { "-D", "tag1=string_value", ...} ToolRunner(new Configuration(), someJob.class, params); On Mon, Jan 4, 2010 at 9:52 AM, Farhan Husain wrote: > Hello all, > > I am using hadoop-0.20.1. I need to know the input file name in my map > processes and pass an integer and a string to my reducer processes. I used > the following method calls for that: > > config.set("tag1", "string_value"); > config.setInt("tag2", int_value); > > In setup() method of mapper: > String filename = > context.getConfiguration().get("map.input.file")// returns null > > In setup() method of reducer: > String val = > context.getConfiguration().get("tag1");// > returns null > int n = context.getConfiguration().getInt("tag2", > def_val);// returns def_val > > Can anyone please tell me what may be wrong with this code or anything > related to it? Is it a bug of this version of Hadoop? Is there any > alternative way to accomplish the same objective? I am stuck with this > problem for about one week. I would appreciate if someone would shed some > light on it. > > Thanks, > Farhan >
Re: How to ensure LzoTextInputFormat is used to generate input splits for .lzo files
Digging around the new Job api with a rested brain came up with job.setInputFormatClass(LzoTextInputFormat.class); that solved the problem. On Thu, Dec 31, 2009 at 9:53 AM, Steve Kuo wrote: > I have followed > http://www.cloudera.com/blog/2009/11/17/hadoop-at-twitter-part-1-splittable-lzo-compression/and > http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ to build the > requisite hadoop-lzo jar and native .so files. (The jar and .so files were > built from Kevin Weil's git repository. Thanks Kevin.) I have configured > core-site.xml and mapred-site.xml as instructed to enable lzo for both map > and reduce output. The creation of lzo index also worked. The last step was > to replace TextInputFormat with LzoTextInputFormat. As I only have > > FileInputFormat.addInputPath(jobConf, new Path(inputPath)); > > it was replaced with > > LzoTextInputFormat.addInputPath(job, new Path(inputPath)); > > When I ran my MR job, I noticed that the new code was able to read in .lzo > input files and decompressed fine. The output was also lzo compressed. > However, only one map job was created for each input .lzo file indicating > that input splitting was not done by LzoTextInputFormat but more likely by > its parent such as FileInputFormat. There must be a way to ensure > LzoTextInputFormat is used in the Map task. How can this be done? > > Thanks in advance. > >
How to ensure LzoTextInputFormat is used to generate input splits for .lzo files
I have followed http://www.cloudera.com/blog/2009/11/17/hadoop-at-twitter-part-1-splittable-lzo-compression/and http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ to build the requisite hadoop-lzo jar and native .so files. (The jar and .so files were built from Kevin Weil's git repository. Thanks Kevin.) I have configured core-site.xml and mapred-site.xml as instructed to enable lzo for both map and reduce output. The creation of lzo index also worked. The last step was to replace TextInputFormat with LzoTextInputFormat. As I only have FileInputFormat.addInputPath(jobConf, new Path(inputPath)); it was replaced with LzoTextInputFormat.addInputPath(job, new Path(inputPath)); When I ran my MR job, I noticed that the new code was able to read in .lzo input files and decompressed fine. The output was also lzo compressed. However, only one map job was created for each input .lzo file indicating that input splitting was not done by LzoTextInputFormat but more likely by its parent such as FileInputFormat. There must be a way to ensure LzoTextInputFormat is used in the Map task. How can this be done? Thanks in advance.