Re: Problem with large .lzo files

2010-02-22 Thread Steve Kuo
Here is the ArrayIndexOutOfBoundException from LzoDecompressor.  I have 30
800MB of lzo files.  16 of them were processed by the Mapper successfully
with splits around 60.  File part-r-00012.lzo failed at split 45.  It's
clear that it was not size related.  I have copy out part-r-00012.lzo, split
it in half, and lzo'ed the small files.  I am now running the same set of
input but with the smaller files instead to see if the process complete.

Below is the abridged stack trace.

2010-02-21 11:15:34,718 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processNa
me=MAP, sessionId=
2010-02-21 11:15:34,909 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb =
100
2010-02-21 11:15:34,967 INFO org.apache.hadoop.mapred.MapTask: data buffer =
79691776/99614720
2010-02-21 11:15:34,967 INFO org.apache.hadoop.mapred.MapTask: record buffer
= 262144/327680
2010-02-21 11:15:34,972 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader:
Loaded native gpl library
2010-02-21 11:15:34,977 INFO com.hadoop.compression.lzo.LzoCodec:
Successfully loaded  initialized native-lzo
 library [hadoop-lzo rev fatal: Not a git repository (or any of the parent
directories): .git]
2010-02-21 11:15:35,124 INFO com.foo.bar.mapreduce.baz.BazFaz:
InputFile=hdfs://foo-ndbig03.lax1.foo.com:9000/
user/hadoop/baz/faz_old/part-r-00012.lzo
2010-02-21 11:15:35,124 INFO com.foo.bar.mapreduce.baz.BazFaz: maxFazCount
=1000
2010-02-21 11:15:38,438 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: record full = true
2010-02-21 11:15:38,438 INFO org.apache.hadoop.mapred.MapTask: bufstart = 0;
bufend = 18821149; bufvoid = 9961
4720
2010-02-21 11:15:38,438 INFO org.apache.hadoop.mapred.MapTask: kvstart = 0;
kvend = 262144; length = 327680
2010-02-21 11:15:39,125 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new compressor
2010-02-21 11:15:39,738 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 0
2010-02-21 11:15:41,917 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: record full = true
2010-02-21 11:15:41,917 INFO org.apache.hadoop.mapred.MapTask: bufstart =
18821149; bufend = 37722479; bufvoid
 = 99614720
2010-02-21 11:15:41,918 INFO org.apache.hadoop.mapred.MapTask: kvstart =
262144; kvend = 196607; length = 3276
80
2010-02-21 11:15:42,642 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 1
2010-02-21 11:15:44,564 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: record full = true
2010-02-21 11:15:44,565 INFO org.apache.hadoop.mapred.MapTask: bufstart =
37722479; bufend = 56518058; bufvoid
 = 99614720
2010-02-21 11:15:44,565 INFO org.apache.hadoop.mapred.MapTask: kvstart =
196607; kvend = 131070; length = 3276
80
2010-02-21 11:15:45,485 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 2

...
2010-02-21 11:17:33,280 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: record full = true
2010-02-21 11:17:33,280 INFO org.apache.hadoop.mapred.MapTask: bufstart =
32585253; bufend = 51221187; bufvoid
 = 99614720
2010-02-21 11:17:33,280 INFO org.apache.hadoop.mapred.MapTask: kvstart =
65501; kvend = 327645; length = 32768
0
2010-02-21 11:17:34,056 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 44
2010-02-21 11:17:35,914 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output: record full = true
2010-02-21 11:17:35,914 INFO org.apache.hadoop.mapred.MapTask: bufstart =
51221187; bufend = 70104763; bufvoid
 = 99614720
2010-02-21 11:17:35,914 INFO org.apache.hadoop.mapred.MapTask: kvstart =
327645; kvend = 262108; length = 3276
80
2010-02-21 11:17:36,731 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 45
2010-02-21 11:17:37,802 WARN org.apache.hadoop.mapred.TaskTracker: Error
running child
java.lang.ArrayIndexOutOfBoundsException
at
com.hadoop.compression.lzo.LzoDecompressor.setInput(LzoDecompressor.java:200)
at
com.hadoop.compression.lzo.LzopDecompressor.setInput(LzopDecompressor.java:98)
at
com.hadoop.compression.lzo.LzopInputStream.getCompressedData(LzopInputStream.java:297)
at
com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:232)
at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
at java.io.InputStream.read(InputStream.java:85)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:187)
at
com.hadoop.mapreduce.LzoLineRecordReader.nextKeyValue(LzoLineRecordReader.java:126)
at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
at
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
2010-02-21 11:17:37,809 INFO org.apache.hadoop.mapred.TaskRunner: Runnning
cleanup for the task


Re: Problem with large .lzo files

2010-02-22 Thread Steve Kuo
Splitting the problem file into two smaller ones allowed the process to
succeed.  This, I believe pointed to some bug in LzoDecompressor.


Re: Problem with large .lzo files

2010-02-15 Thread Steve Kuo
On Sun, Feb 14, 2010 at 12:46 PM, Todd Lipcon t...@cloudera.com wrote:

 Hi Steve,

 I'm not sure here whether you mean that the DistributedLzoIndexer job
 is failing, or if the job on the resulting split file is faiing. Could
 you clarify?


DistributedLzoIndexer job did successfully complete.  It was one of the jobs
on the resulting split file always failed while other splits succeeded.

By the way, if all files have been indexed, DistributedLzoIndexer does not
detect that and hadoop throws an exception complaining that the input dir
(or file) does not exist.  I work around this by catching the exception.


- It's possible to sacrifice parallelism by having hadoop work on each
.lzo file without indexing.  This worked well until the file size
 exceeded
30G when array indexing exception got thrown.  Apparently the code
 processed
the file in chunks and stored the references to the chunk in an array.
  When
the number of chunks was greater than a certain number (around 256 was
 my
recollection), exception was thrown.
- My current work around is to increase the number of reducers to keep
the .lzo file size low.
 
  I would like to get advices on how people handle large .lzo files.  Any
  pointers on the cause of the stack trace below and best way to resolve it
  are greatly appreciated.
 

 Is this reproducible every time? If so, is it always at the same point
 in the LZO file that it occurs?

 It's at the same point.  Do you know how to print out the lzo index for the
task?  I only print out the input file now.


 Would it be possible to download that lzo file to your local box and
 use lzop -d to see if it decompresses successfully? That way we can
 isolate whether it's a compression bug or decompression.

 Bothe java LzoDecompressor and lzop -d were able to decompress the file
correctly.  As a matter of fact, my job does not index .lzo files now but
process each as a whole and it works


Re: Problem with large .lzo files

2010-02-15 Thread Todd Lipcon
On Mon, Feb 15, 2010 at 8:07 AM, Steve Kuo kuosen...@gmail.com wrote:
 On Sun, Feb 14, 2010 at 12:46 PM, Todd Lipcon t...@cloudera.com wrote:

 By the way, if all files have been indexed, DistributedLzoIndexer does not
 detect that and hadoop throws an exception complaining that the input dir
 (or file) does not exist.  I work around this by catching the exception.


Just fixed that in my github repo. Thanks for the bug report.


    - It's possible to sacrifice parallelism by having hadoop work on each
    .lzo file without indexing.  This worked well until the file size
 exceeded
    30G when array indexing exception got thrown.  Apparently the code
 processed
    the file in chunks and stored the references to the chunk in an array.
  When
    the number of chunks was greater than a certain number (around 256 was
 my
    recollection), exception was thrown.
    - My current work around is to increase the number of reducers to keep
    the .lzo file size low.
 
  I would like to get advices on how people handle large .lzo files.  Any
  pointers on the cause of the stack trace below and best way to resolve it
  are greatly appreciated.
 

 Is this reproducible every time? If so, is it always at the same point
 in the LZO file that it occurs?

 It's at the same point.  Do you know how to print out the lzo index for the
 task?  I only print out the input file now.


You should be able to downcast the InputSplit to FileSplit, if you're
using the new API. From there you can get the start and length of the
split.


 Would it be possible to download that lzo file to your local box and
 use lzop -d to see if it decompresses successfully? That way we can
 isolate whether it's a compression bug or decompression.

 Bothe java LzoDecompressor and lzop -d were able to decompress the file
 correctly.  As a matter of fact, my job does not index .lzo files now but
 process each as a whole and it works


Interesting. If you can somehow make a reproducible test case I'd be
happy to look into this.

Thanks
-Todd


Re: Problem with large .lzo files

2010-02-15 Thread Steve Kuo
You should be able to downcast the InputSplit to FileSplit, if you're
 using the new API. From there you can get the start and length of the
 split.

 Cool, let me give it a shot.


 Interesting. If you can somehow make a reproducible test case I'd be
 happy to look into this.

 This sounds great.  As the input file is 1G, let me do some work on my side
to see if I can pinpoint it so as not have to transfer a 1G file around.

Thanks.


Re: Problem with large .lzo files

2010-02-14 Thread Todd Lipcon
Hi Steve,

On Sun, Feb 14, 2010 at 12:11 PM, Steve Kuo kuosen...@gmail.com wrote:
 I am running a hadoop job that combines daily results with results with
 previous days.  The reduce output is lzo compressed and growing daily in
 size.


   - DistributedLzoIndexer is used to index lzo files to provide
   parallelism.  When the size of the lzo files were small, everything went
   well.  As the size of .lzo files grow, the chance that one of the partition
   does not complete increases.  The exception I got for one such case is
   listed at the end of the post.

I'm not sure here whether you mean that the DistributedLzoIndexer job
is failing, or if the job on the resulting split file is faiing. Could
you clarify?

   - It's possible to sacrifice parallelism by having hadoop work on each
   .lzo file without indexing.  This worked well until the file size exceeded
   30G when array indexing exception got thrown.  Apparently the code processed
   the file in chunks and stored the references to the chunk in an array.  When
   the number of chunks was greater than a certain number (around 256 was my
   recollection), exception was thrown.
   - My current work around is to increase the number of reducers to keep
   the .lzo file size low.

 I would like to get advices on how people handle large .lzo files.  Any
 pointers on the cause of the stack trace below and best way to resolve it
 are greatly appreciated.


Is this reproducible every time? If so, is it always at the same point
in the LZO file that it occurs?

Would it be possible to download that lzo file to your local box and
use lzop -d to see if it decompresses successfully? That way we can
isolate whether it's a compression bug or decompression.

Thanks
-Todd