[ 
https://issues.apache.org/jira/browse/AVRO-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778676#action_12778676
 ] 

Justin SB commented on AVRO-196:
--------------------------------

You're probably right that this is too big a change for avro in its early 
stages.  In my use case I was storing floats, but I've switched to storing ints 
instead, so an empty value is now 7 extra bits instead of 31.

Perhaps we should see what can be achieved through compression first 
(AVRO-135).  I'd like to see a per-record compression option, and I'd also like 
to have empty values compress well.  I think as long as we choose an algorithm 
where consecutive zeroes are highly compressed, compression would solve the 
issue here, while also being more generally applicable.

> Add encoding for sparse records
> -------------------------------
>
>                 Key: AVRO-196
>                 URL: https://issues.apache.org/jira/browse/AVRO-196
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Justin SB
>            Priority: Minor
>
> If we have a large record with many fields in avro which is mostly empty, 
> currently avro will still serialize every field, leading to big overhead.  We 
> could support a sparse record format for this case: before each record a 
> bitmask is serialized indicating the presence of the fields.  We could 
> specify the encoding type as a new attribute in the avpr e.g.  
> {"type":"record", "name":"Test", "encoding":"sparse", "fields":....}
> I've put an implementation of the idea on github:
> http://github.com/justinsb/avro/commit/7f6ad2532298127fcdd9f52ce90df21ff527f9d1
> This leads to big improvements in the serialization size in our case, when 
> we're using avro to serialize performance metrics, where most of the fields 
> are usually empty.
> The alternative of using a Map isn't a good idea because it (1) serializes 
> the names of the fields and (2) means we lose strong typing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to