Re: Avro compression doubt

Joey Echeverria Wed, 09 Jul 2014 09:20:38 -0700

Hi Sachin,

What's the use case for always serializing single objects at a time?
If you don't need to multiplex multiple schemas over the same channel,
you could also negotiate the schema once and then only send the
serialized data. You could also store your schemas in a repository[1]
and only serialize a unique id with the data itself. There was also
talk of serializing the schema as an Avro record or just adding a
codec to data file headers in AVRO-251[2].


If you can tell us more about your use case, maybe we can help you
contribute to one or more of those solutions!

-Joey

[1] https://issues.apache.org/jira/browse/AVRO-1124
[2] https://issues.apache.org/jira/browse/AVRO-251

On Wed, Jul 9, 2014 at 12:04 PM, Sachin Goyal <[email protected]> wrote:
>
> The schema itself is around 35kb and the data is the rest: around 22kb
> We are converting lots of Java objects into Avro and JSON format.
> And for us, the Schema is almost always more than the data size (except for 
> objects which have very long arrays).
>
> Is there a limitation on compressing the schema within Avro itself?
> For data-streams having only single objects, I am inclined to think that 
> Schema would always be bigger than the data.
> So compressing schema too could reduce the size a lot.
>
>
>
> From: Sean Busbey <[email protected]<mailto:[email protected]>>
> Reply-To: "[email protected]<mailto:[email protected]>" 
> <[email protected]<mailto:[email protected]>>
> Date: Wednesday, July 9, 2014 at 12:15 AM
> To: "user@avro apache. org" 
> <[email protected]<mailto:[email protected]>>
> Subject: Re: Avro compression doubt
>
> Can you share the schema? How big is it?
>
> The schema itself is not compressed, so given your small data size it might 
> be dominating.
>
>
> On Wed, Jul 9, 2014 at 1:20 AM, Sachin Goyal 
> <[email protected]<mailto:[email protected]>> wrote:
> Hi,
>
> I have been trying to use Avro compression codecs to reduce the size of
> avro-output.
> The Java object being serialized is pretty huge and here are the results
> of applying different codecs.
>
>
>   Serialization   : Kilo-Bytes
> -------------   : -----------
> Avro (No Codec)   :   57.3
> Avro (Snappy)   :   52.0
> Avro (Bzip2)    :   51.6
> Avro (Deflate)  :   51.1
> Avro (xzCodec)  :   51.0
> Direct JSON     :   23.6  (Just for comparison since we use JSON too
> heavily. This was done using Jackson)
>
>
>
>
> The Java code I used to try codecs is as follows:
> ---------------------------------------------------------------------------
> ------------
> ReflectDatumWriter datumWriter = new ReflectDatumWriter
> (productObj.getClass(), rdata);
>             DataFileWriter fileWriter = new DataFileWriter (datumWriter);
>
>
> // Try each one of these codecs one at a time
> fileWriter.setCodec(CodecFactory.snappyCodec());
> fileWriter.setCodec(CodecFactory.bzip2Codec());
> fileWriter.setCodec(CodecFactory.deflateCodec(9));
> fileWriter.setCodec(CodecFactory.xzCodec(5));  // using 9 here caused
> out-of-memory
>
> // Now check output size
> ByteArrayOutputStream baos = new ByteArrayOutputStream();
>
> fileWriter.create(schema, baos);
>             fileWriter.append(productObj);
>             fileWriter.close();
> System.out.println ("Avro bytes = " + baos.toByteArray().length);
> ---------------------------------------------------------------------------
> ------------
>
>
>
> And then, on the command line, I applied the normal zip command as:
>   $ zip output.zip output.avr;
>   $ ls -l output.*
> This gives me the following output:
>
> 57339  output.avr
>  9081  output.zip (20% the original size!)
>
>
>
>
> So my questions are:
> ---------------------
> 1) Why I am not seeing a huge benefit in size when applying the codec? Am
> I using the API correctly?
> 2) I understand that the compression achieved by normal zip command would
> be better than applying codecs in Avro, but is such a huge difference
> expected?
>
>
> One thing I expected and did notice is that Avro truly shines when the
> number of objects to be appended are more than 10.
> This is so because the schema is written only once and all the actual
> objects are appended as binary.
> So that was expected, but compression codecs output looked a bit
> questionable.
>
> Please suggest if I am doing something wrong.
>
> Thanks
> Sachin
>
>
>
>
>
>
>
> --
> Sean



-- 
Joey Echeverria

Re: Avro compression doubt

Reply via email to