Re: Problem with large .lzo files
Here is the ArrayIndexOutOfBoundException from LzoDecompressor. I have 30 800MB of lzo files. 16 of them were processed by the Mapper successfully with splits around 60. File part-r-00012.lzo failed at split 45. It's clear that it was not size related. I have copy out part-r-00012.lzo, split it in half, and lzo'ed the small files. I am now running the same set of input but with the smaller files instead to see if the process complete. Below is the abridged stack trace. 2010-02-21 11:15:34,718 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processNa me=MAP, sessionId= 2010-02-21 11:15:34,909 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100 2010-02-21 11:15:34,967 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720 2010-02-21 11:15:34,967 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680 2010-02-21 11:15:34,972 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded native gpl library 2010-02-21 11:15:34,977 INFO com.hadoop.compression.lzo.LzoCodec: Successfully loaded initialized native-lzo library [hadoop-lzo rev fatal: Not a git repository (or any of the parent directories): .git] 2010-02-21 11:15:35,124 INFO com.foo.bar.mapreduce.baz.BazFaz: InputFile=hdfs://foo-ndbig03.lax1.foo.com:9000/ user/hadoop/baz/faz_old/part-r-00012.lzo 2010-02-21 11:15:35,124 INFO com.foo.bar.mapreduce.baz.BazFaz: maxFazCount =1000 2010-02-21 11:15:38,438 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true 2010-02-21 11:15:38,438 INFO org.apache.hadoop.mapred.MapTask: bufstart = 0; bufend = 18821149; bufvoid = 9961 4720 2010-02-21 11:15:38,438 INFO org.apache.hadoop.mapred.MapTask: kvstart = 0; kvend = 262144; length = 327680 2010-02-21 11:15:39,125 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor 2010-02-21 11:15:39,738 INFO org.apache.hadoop.mapred.MapTask: Finished spill 0 2010-02-21 11:15:41,917 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true 2010-02-21 11:15:41,917 INFO org.apache.hadoop.mapred.MapTask: bufstart = 18821149; bufend = 37722479; bufvoid = 99614720 2010-02-21 11:15:41,918 INFO org.apache.hadoop.mapred.MapTask: kvstart = 262144; kvend = 196607; length = 3276 80 2010-02-21 11:15:42,642 INFO org.apache.hadoop.mapred.MapTask: Finished spill 1 2010-02-21 11:15:44,564 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true 2010-02-21 11:15:44,565 INFO org.apache.hadoop.mapred.MapTask: bufstart = 37722479; bufend = 56518058; bufvoid = 99614720 2010-02-21 11:15:44,565 INFO org.apache.hadoop.mapred.MapTask: kvstart = 196607; kvend = 131070; length = 3276 80 2010-02-21 11:15:45,485 INFO org.apache.hadoop.mapred.MapTask: Finished spill 2 ... 2010-02-21 11:17:33,280 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true 2010-02-21 11:17:33,280 INFO org.apache.hadoop.mapred.MapTask: bufstart = 32585253; bufend = 51221187; bufvoid = 99614720 2010-02-21 11:17:33,280 INFO org.apache.hadoop.mapred.MapTask: kvstart = 65501; kvend = 327645; length = 32768 0 2010-02-21 11:17:34,056 INFO org.apache.hadoop.mapred.MapTask: Finished spill 44 2010-02-21 11:17:35,914 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true 2010-02-21 11:17:35,914 INFO org.apache.hadoop.mapred.MapTask: bufstart = 51221187; bufend = 70104763; bufvoid = 99614720 2010-02-21 11:17:35,914 INFO org.apache.hadoop.mapred.MapTask: kvstart = 327645; kvend = 262108; length = 3276 80 2010-02-21 11:17:36,731 INFO org.apache.hadoop.mapred.MapTask: Finished spill 45 2010-02-21 11:17:37,802 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.lang.ArrayIndexOutOfBoundsException at com.hadoop.compression.lzo.LzoDecompressor.setInput(LzoDecompressor.java:200) at com.hadoop.compression.lzo.LzopDecompressor.setInput(LzopDecompressor.java:98) at com.hadoop.compression.lzo.LzopInputStream.getCompressedData(LzopInputStream.java:297) at com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:232) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74) at java.io.InputStream.read(InputStream.java:85) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:187) at com.hadoop.mapreduce.LzoLineRecordReader.nextKeyValue(LzoLineRecordReader.java:126) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) 2010-02-21 11:17:37,809 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task
Re: Problem with large .lzo files
Splitting the problem file into two smaller ones allowed the process to succeed. This, I believe pointed to some bug in LzoDecompressor.
Re: Problem with large .lzo files
On Sun, Feb 14, 2010 at 12:46 PM, Todd Lipcon t...@cloudera.com wrote: Hi Steve, I'm not sure here whether you mean that the DistributedLzoIndexer job is failing, or if the job on the resulting split file is faiing. Could you clarify? DistributedLzoIndexer job did successfully complete. It was one of the jobs on the resulting split file always failed while other splits succeeded. By the way, if all files have been indexed, DistributedLzoIndexer does not detect that and hadoop throws an exception complaining that the input dir (or file) does not exist. I work around this by catching the exception. - It's possible to sacrifice parallelism by having hadoop work on each .lzo file without indexing. This worked well until the file size exceeded 30G when array indexing exception got thrown. Apparently the code processed the file in chunks and stored the references to the chunk in an array. When the number of chunks was greater than a certain number (around 256 was my recollection), exception was thrown. - My current work around is to increase the number of reducers to keep the .lzo file size low. I would like to get advices on how people handle large .lzo files. Any pointers on the cause of the stack trace below and best way to resolve it are greatly appreciated. Is this reproducible every time? If so, is it always at the same point in the LZO file that it occurs? It's at the same point. Do you know how to print out the lzo index for the task? I only print out the input file now. Would it be possible to download that lzo file to your local box and use lzop -d to see if it decompresses successfully? That way we can isolate whether it's a compression bug or decompression. Bothe java LzoDecompressor and lzop -d were able to decompress the file correctly. As a matter of fact, my job does not index .lzo files now but process each as a whole and it works
Re: Problem with large .lzo files
On Mon, Feb 15, 2010 at 8:07 AM, Steve Kuo kuosen...@gmail.com wrote: On Sun, Feb 14, 2010 at 12:46 PM, Todd Lipcon t...@cloudera.com wrote: By the way, if all files have been indexed, DistributedLzoIndexer does not detect that and hadoop throws an exception complaining that the input dir (or file) does not exist. I work around this by catching the exception. Just fixed that in my github repo. Thanks for the bug report. - It's possible to sacrifice parallelism by having hadoop work on each .lzo file without indexing. This worked well until the file size exceeded 30G when array indexing exception got thrown. Apparently the code processed the file in chunks and stored the references to the chunk in an array. When the number of chunks was greater than a certain number (around 256 was my recollection), exception was thrown. - My current work around is to increase the number of reducers to keep the .lzo file size low. I would like to get advices on how people handle large .lzo files. Any pointers on the cause of the stack trace below and best way to resolve it are greatly appreciated. Is this reproducible every time? If so, is it always at the same point in the LZO file that it occurs? It's at the same point. Do you know how to print out the lzo index for the task? I only print out the input file now. You should be able to downcast the InputSplit to FileSplit, if you're using the new API. From there you can get the start and length of the split. Would it be possible to download that lzo file to your local box and use lzop -d to see if it decompresses successfully? That way we can isolate whether it's a compression bug or decompression. Bothe java LzoDecompressor and lzop -d were able to decompress the file correctly. As a matter of fact, my job does not index .lzo files now but process each as a whole and it works Interesting. If you can somehow make a reproducible test case I'd be happy to look into this. Thanks -Todd
Re: Problem with large .lzo files
You should be able to downcast the InputSplit to FileSplit, if you're using the new API. From there you can get the start and length of the split. Cool, let me give it a shot. Interesting. If you can somehow make a reproducible test case I'd be happy to look into this. This sounds great. As the input file is 1G, let me do some work on my side to see if I can pinpoint it so as not have to transfer a 1G file around. Thanks.
Re: Problem with large .lzo files
Hi Steve, On Sun, Feb 14, 2010 at 12:11 PM, Steve Kuo kuosen...@gmail.com wrote: I am running a hadoop job that combines daily results with results with previous days. The reduce output is lzo compressed and growing daily in size. - DistributedLzoIndexer is used to index lzo files to provide parallelism. When the size of the lzo files were small, everything went well. As the size of .lzo files grow, the chance that one of the partition does not complete increases. The exception I got for one such case is listed at the end of the post. I'm not sure here whether you mean that the DistributedLzoIndexer job is failing, or if the job on the resulting split file is faiing. Could you clarify? - It's possible to sacrifice parallelism by having hadoop work on each .lzo file without indexing. This worked well until the file size exceeded 30G when array indexing exception got thrown. Apparently the code processed the file in chunks and stored the references to the chunk in an array. When the number of chunks was greater than a certain number (around 256 was my recollection), exception was thrown. - My current work around is to increase the number of reducers to keep the .lzo file size low. I would like to get advices on how people handle large .lzo files. Any pointers on the cause of the stack trace below and best way to resolve it are greatly appreciated. Is this reproducible every time? If so, is it always at the same point in the LZO file that it occurs? Would it be possible to download that lzo file to your local box and use lzop -d to see if it decompresses successfully? That way we can isolate whether it's a compression bug or decompression. Thanks -Todd