Re: Bzip2 files as an input to MR job

2014-09-22 Thread Georgi Ivanov
Hi Niels, Thanks for the reply. Changing the avro files is not really an option for me as it will require a lot of time( i have a lot ). The Avro files themself are compressed a bit. But still bzip2 gives 50% compression on one avro file. So what i want is , to use Bzip2 compressed file as an i

Re: Bzip2 files as an input to MR job

2014-09-22 Thread Niels Basjes
Hi, You can use the GZip inside the AVRO files and still have splittable AVRO files. This has the to with the fact that there is a block structure inside the AVRO and these blocks are gzipped. I suggest you simply try it. Niels On Mon, Sep 22, 2014 at 4:40 PM, Georgi Ivanov wrote: > Hi guys,

Bzip2 files as an input to MR job

2014-09-22 Thread Georgi Ivanov
Hi guys, I would like to compress the files on HDFS to save some storage. As far as i see bzip2 is the only format which is splitable (and slow). The actual files are Avro. So in my driver class i have : job.setInputFormatClass(AvroKeyInputFormat.class); I have number of jobs running processi

Re: To Generate Test Data in HDFS (PDGF)

2014-09-22 Thread Jay Vyas
While on the subject, You can also use the bigpetstore application to do this, in apache bigtop. This data is suited well for hbase ( semi structured, transactional, and features some global patterns which can make for meaningful queries and so on). Clone apache/bigtop cd bigtop-bigpetstore gra

To Generate Test Data in HDFS (PDGF)

2014-09-22 Thread arthur.hk.c...@gmail.com
Hi, I need to generate large amount of test data (4TB) into Hadoop, has anyone used PDGF to do so? Could you share your cook book about PDGF in Hadoop (or HBase)? Many Thanks Arthur