[ https://issues.apache.org/jira/browse/AVRO-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799863#action_12799863 ]
Jeff Hammerbacher commented on AVRO-196: ---------------------------------------- bq. Perhaps we should see what can be achieved through compression first (AVRO-135). I think compression is the way to go here. We are missing a strategy for RPC compression, but it seems that sparse records in data files will compress well. > Add encoding for sparse records > ------------------------------- > > Key: AVRO-196 > URL: https://issues.apache.org/jira/browse/AVRO-196 > Project: Avro > Issue Type: New Feature > Components: java > Reporter: Justin SB > Priority: Minor > > If we have a large record with many fields in avro which is mostly empty, > currently avro will still serialize every field, leading to big overhead. We > could support a sparse record format for this case: before each record a > bitmask is serialized indicating the presence of the fields. We could > specify the encoding type as a new attribute in the avpr e.g. > {"type":"record", "name":"Test", "encoding":"sparse", "fields":....} > I've put an implementation of the idea on github: > http://github.com/justinsb/avro/commit/7f6ad2532298127fcdd9f52ce90df21ff527f9d1 > This leads to big improvements in the serialization size in our case, when > we're using avro to serialize performance metrics, where most of the fields > are usually empty. > The alternative of using a Map isn't a good idea because it (1) serializes > the names of the fields and (2) means we lose strong typing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.