The schema itself is around 35kb and the data is the rest: around 22kb
We are converting lots of Java objects into Avro and JSON format.
And for us, the Schema is almost always more than the data size (except for 
objects which have very long arrays).

Is there a limitation on compressing the schema within Avro itself?
For data-streams having only single objects, I am inclined to think that Schema 
would always be bigger than the data.
So compressing schema too could reduce the size a lot.



From: Sean Busbey <bus...@cloudera.com<mailto:bus...@cloudera.com>>
Reply-To: "user@avro.apache.org<mailto:user@avro.apache.org>" 
<user@avro.apache.org<mailto:user@avro.apache.org>>
Date: Wednesday, July 9, 2014 at 12:15 AM
To: "user@avro apache. org" <user@avro.apache.org<mailto:user@avro.apache.org>>
Subject: Re: Avro compression doubt

Can you share the schema? How big is it?

The schema itself is not compressed, so given your small data size it might be 
dominating.


On Wed, Jul 9, 2014 at 1:20 AM, Sachin Goyal 
<sgo...@walmartlabs.com<mailto:sgo...@walmartlabs.com>> wrote:
Hi,

I have been trying to use Avro compression codecs to reduce the size of
avro-output.
The Java object being serialized is pretty huge and here are the results
of applying different codecs.


  Serialization   : Kilo-Bytes
-------------   : -----------
Avro (No Codec)   :   57.3
Avro (Snappy)   :   52.0
Avro (Bzip2)    :   51.6
Avro (Deflate)  :   51.1
Avro (xzCodec)  :   51.0
Direct JSON     :   23.6  (Just for comparison since we use JSON too
heavily. This was done using Jackson)




The Java code I used to try codecs is as follows:
---------------------------------------------------------------------------
------------
ReflectDatumWriter datumWriter = new ReflectDatumWriter
(productObj.getClass(), rdata);
            DataFileWriter fileWriter = new DataFileWriter (datumWriter);


// Try each one of these codecs one at a time
fileWriter.setCodec(CodecFactory.snappyCodec());
fileWriter.setCodec(CodecFactory.bzip2Codec());
fileWriter.setCodec(CodecFactory.deflateCodec(9));
fileWriter.setCodec(CodecFactory.xzCodec(5));  // using 9 here caused
out-of-memory

// Now check output size
ByteArrayOutputStream baos = new ByteArrayOutputStream();

fileWriter.create(schema, baos);
            fileWriter.append(productObj);
            fileWriter.close();
System.out.println ("Avro bytes = " + baos.toByteArray().length);
---------------------------------------------------------------------------
------------



And then, on the command line, I applied the normal zip command as:
  $ zip output.zip output.avr;
  $ ls -l output.*
This gives me the following output:

57339  output.avr
 9081  output.zip (20% the original size!)




So my questions are:
---------------------
1) Why I am not seeing a huge benefit in size when applying the codec? Am
I using the API correctly?
2) I understand that the compression achieved by normal zip command would
be better than applying codecs in Avro, but is such a huge difference
expected?


One thing I expected and did notice is that Avro truly shines when the
number of objects to be appended are more than 10.
This is so because the schema is written only once and all the actual
objects are appended as binary.
So that was expected, but compression codecs output looked a bit
questionable.

Please suggest if I am doing something wrong.

Thanks
Sachin







--
Sean

Reply via email to