On Mon, Feb 15, 2010 at 8:07 AM, Steve Kuo <kuosen...@gmail.com> wrote: > On Sun, Feb 14, 2010 at 12:46 PM, Todd Lipcon <t...@cloudera.com> wrote: > > By the way, if all files have been indexed, DistributedLzoIndexer does not > detect that and hadoop throws an exception complaining that the input dir > (or file) does not exist. I work around this by catching the exception. >
Just fixed that in my github repo. Thanks for the bug report. > >> > - It's possible to sacrifice parallelism by having hadoop work on each >> > .lzo file without indexing. This worked well until the file size >> exceeded >> > 30G when array indexing exception got thrown. Apparently the code >> processed >> > the file in chunks and stored the references to the chunk in an array. >> When >> > the number of chunks was greater than a certain number (around 256 was >> my >> > recollection), exception was thrown. >> > - My current work around is to increase the number of reducers to keep >> > the .lzo file size low. >> > >> > I would like to get advices on how people handle large .lzo files. Any >> > pointers on the cause of the stack trace below and best way to resolve it >> > are greatly appreciated. >> > >> >> Is this reproducible every time? If so, is it always at the same point >> in the LZO file that it occurs? >> >> It's at the same point. Do you know how to print out the lzo index for the > task? I only print out the input file now. > You should be able to downcast the InputSplit to FileSplit, if you're using the new API. From there you can get the start and length of the split. > >> Would it be possible to download that lzo file to your local box and >> use lzop -d to see if it decompresses successfully? That way we can >> isolate whether it's a compression bug or decompression. >> >> Bothe java LzoDecompressor and lzop -d were able to decompress the file > correctly. As a matter of fact, my job does not index .lzo files now but > process each as a whole and it works > Interesting. If you can somehow make a reproducible test case I'd be happy to look into this. Thanks -Todd