Georgi:
I think  you misunderstand the originally answer.
If you already use Avor format, then the file will be splitable. If you want to 
add compression on top of that,  feel free going ahead.
If you read the Avor DataFileWriter API:
http://avro.apache.org/docs/1.7.6/api/java/org/apache/avro/file/DataFileWriter.html#setCodec(org.apache.avro.file.CodecFactory)
You will see there is a setCodec method, which allow to you specify any codec 
to compress your data.
The compression can be either per block, or per record. Per block is 
recommended, as it will be more efficient.
You can use bzip2 or gzip or snappy or any other compression. You just need to 
to use the above api, and make sure the compression codec is available in all 
your task nodes.
splitable or unsplitable compression doesn't matter to you in this case, as you 
are using AVRO, which is splitable.
What you need to choose is which compression is better, or fit your application 
usage case.
In our production, we use snappy, as it gives us a good balance between 
compression ratio and read/decompression speed and CPU usage.
Different compressions have trade off. You need to compare them based on your 
case.
Yong

Date: Mon, 22 Sep 2014 17:21:29 +0200
From: iva...@vesseltracker.com
To: user@hadoop.apache.org
Subject: Re: Bzip2 files as an input to MR job


  
    
  
  
    Hi Niels,

      Thanks for the reply.

      Changing the avro files is not really an option for me as it will
      require a lot of time( i have a lot ).

      The Avro files themself are compressed a bit.

      But still bzip2 gives 50% compression on one avro file.

      

      So what i want is , to use Bzip2 compressed file as an input to my
      MR jobs.

      Bzip2 is splittable.

      Should be possible somehow , but i don't seem to find it atm.

      

      On 22.09.2014 17:13, Niels Basjes wrote:

    
    
      
        Hi,
        

        
        You can use the GZip inside the AVRO files and still have
        splittable AVRO files.
        This has the to with the fact that there is a block
          structure inside the AVRO and these blocks are gzipped.

        
        

        
        I suggest you simply try it.
        

        
        Niels
        

        
        

          On Mon, Sep 22, 2014 at 4:40 PM,
            Georgi Ivanov <iva...@vesseltracker.com>
            wrote:

            Hi guys,

              I would like to compress the files on HDFS to save some
              storage.

              

              As far as i see bzip2 is the only format which is
              splitable (and slow).

              

              The actual files are Avro.

              

              So in my driver class i have :

              

              job.setInputFormatClass(AvroKeyInputFormat.class);

              

              I have number of jobs running processing Avro files so i
              would like to keep the code change to a minimum.

              

              Is it possible to comrpess these avro files with bzip2 and
              keep the code of MR jobs the same (or with little change)

              If it is , please give me some hints as so far i don't
              seem to find any good resources on the Internet.

                  

                  

                  Georgi

                
          
          

          
          

          
          -- 

          Best regards / Met vriendelijke groeten,

          

          Niels Basjes
        
      
    
    
                                          

Reply via email to