Hi, I have been trying to use Avro compression codecs to reduce the size of avro-output. The Java object being serialized is pretty huge and here are the results of applying different codecs.
Serialization : Kilo-Bytes ------------- : ----------- Avro (No Codec) : 57.3 Avro (Snappy) : 52.0 Avro (Bzip2) : 51.6 Avro (Deflate) : 51.1 Avro (xzCodec) : 51.0 Direct JSON : 23.6 (Just for comparison since we use JSON too heavily. This was done using Jackson) The Java code I used to try codecs is as follows: --------------------------------------------------------------------------- ------------ ReflectDatumWriter datumWriter = new ReflectDatumWriter (productObj.getClass(), rdata); DataFileWriter fileWriter = new DataFileWriter (datumWriter); // Try each one of these codecs one at a time fileWriter.setCodec(CodecFactory.snappyCodec()); fileWriter.setCodec(CodecFactory.bzip2Codec()); fileWriter.setCodec(CodecFactory.deflateCodec(9)); fileWriter.setCodec(CodecFactory.xzCodec(5)); // using 9 here caused out-of-memory // Now check output size ByteArrayOutputStream baos = new ByteArrayOutputStream(); fileWriter.create(schema, baos); fileWriter.append(productObj); fileWriter.close(); System.out.println ("Avro bytes = " + baos.toByteArray().length); --------------------------------------------------------------------------- ------------ And then, on the command line, I applied the normal zip command as: $ zip output.zip output.avr; $ ls -l output.* This gives me the following output: 57339 output.avr 9081 output.zip (20% the original size!) So my questions are: --------------------- 1) Why I am not seeing a huge benefit in size when applying the codec? Am I using the API correctly? 2) I understand that the compression achieved by normal zip command would be better than applying codecs in Avro, but is such a huge difference expected? One thing I expected and did notice is that Avro truly shines when the number of objects to be appended are more than 10. This is so because the schema is written only once and all the actual objects are appended as binary. So that was expected, but compression codecs output looked a bit questionable. Please suggest if I am doing something wrong. Thanks Sachin