Concatenate multiple sequence files into 1 big sequence file

2013-09-10 Thread Jerry Lam
Hi Hadoop users, I have been trying to concatenate multiple sequence files into one. Since the total size of the sequence files is quite big (1TB), I won't use mapreduce because it requires 1TB in the reducer host to hold the temporary data. I ended up doing what have been suggested in this threa

Re: Concatenate multiple sequence files into 1 big sequence file

2013-09-10 Thread Adam Muise
Jerry, It might not help with this particular file, but you might considered the approach used at Blackberry when dealing with your data. They block compressed into small avro files and then concatenated into large avro files without decompressing. Check out the boom file format here: https://git

Re: Concatenate multiple sequence files into 1 big sequence file

2013-09-10 Thread John Meagher
Here's a great tool for exactly what you're looking for https://github.com/edwardcapriolo/filecrush On Tue, Sep 10, 2013 at 11:07 AM, Jerry Lam wrote: > Hi Hadoop users, > > I have been trying to concatenate multiple sequence files into one. > Since the total size of the sequence files is quite b

Re: Concatenate multiple sequence files into 1 big sequence file

2013-09-10 Thread Jay Vyas
iirc sequence files can be concatenated as is and read as one large file but maybe im forgetting something.

Re: Concatenate multiple sequence files into 1 big sequence file

2013-09-10 Thread Jerry Lam
Hi guys, Thank you for all the advices here. I really appreciate it. I read through the code in filecrush and I found out that it is doing exactly what I'm currently doing. The logic resides in CrushReducer.java with the following lines that do the concatenation: while (reader.next(key, value))