[ 
https://issues.apache.org/jira/browse/AVRO-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix GV updated AVRO-2307:
---------------------------
    Description: 
We have a performance-sensitive project that leverages Avro for an online data 
store, and I wanted to report on a couple of Avro deserialization optimizations 
we have implemented. On one hand, it is great that Avro’s code is clean and 
modular enough to have allowed us to achieve this easily. But on the other 
hand, we are leveraging parts of the API which are probably not typically used 
by most users, and thus we are exposing ourselves to ongoing maintenance costs 
as those “ambiguously-public” APIs might change in future versions. For this 
reason, I wanted to gauge the appetite of the Avro community for taking in 
those optimizations upstream into the main project.

The minor challenge is that the optimizations we’ve made are not completely 
invisible, and therefore should probably be presented as opt-in settings, 
rather than new defaults. Below is a summary of both changes.

*1. Re-use of byte arrays when instantiating ByteBuffers*

When deserializing a byte array that contains a ByteBuffer field, the relevant 
portion of the input byte array is copied into a new byte array, which is then 
used as the backing array of a new ByteBuffer. This is completely safe, but 
appears to be inefficient from our point of view, requiring extra CPU cycles 
and generating garbage.

In our case, we have a few schemas which contain some general metadata and an 
opaque byte array payload, which often ends up being a significant portion of 
the total byte length. Recopying these bytes results in up to 2x the byte 
allocation (assuming the schema contained just a ByteBuffer field and nothing 
else). The ByteBuffer API, however, provides an alternative behavior where the 
backing array can be larger than needed, with an offset and length provided to 
indicate the internal boundaries of the payload. In our implementation, we 
re-use the input byte array as the ButeBuffer’s backing array, therefore 
avoiding both the allocation and the copy.

The caveat in this case is that this only works safely for use cases that don’t 
mutate the content of the bytes (neither the input nor the deserialized 
object). In our case this assumption is valid.

If this was implemented in the open-source project, there are a few ways this 
could be achieved:
 # We could return a special read-only ByteBuffer implementation that throws an 
exception if any mutation is attempted, indicating that this special mode ought 
to be turned off to support mutations.
 # We could return a modified ByteBuffer implementation which defers the copy 
of the content lazily until (and only if) one of the mutation API is called. 
More user-friendly, perhaps, but could introduce hidden overhead which the 
application designer might prefer to be alerted to rather than being silently 
shoved under the rug.

Either way, this probably requires a custom ByteBuffer implementation with 
special behavior in order to be cleaner. In both cases, however, I don’t see a 
way to guard against mutations to the input byte array, therefore, this should 
probably be an opt-in setting for users who know what they’re doing and are 
comfortable with the read-only assumption in their workloads.

*2. Primitive (non-boxing) implementation of lists*

Another challenge we’ve come across is that lists of primitive types (floats in 
our case) are always boxed into Object Floats by Avro. In our case, this 
resulted in millions of Objects / second getting allocated, causing 
pathological GC pressure.

To solve this, we have implemented an alternative version of the Avro Array 
class, but which instead of hanging on to an array of generically-typed 
Objects, internally, hangs on to an array of primitive floats. This causes no 
boxing at deserialization time, but there is a further challenge which is that 
since Avro array fields are exposed as instances of the Java List interface, 
the regular functions of the API all return Objects, therefore merely deferring 
the boxing to a slightly later point in time. To get around this further 
complication, we have added a getPrimitive(index) function which returns 
primitive items directly. In order to be able to use this more optimized 
function, it is necessary for us to cast the list into our own type, otherwise 
we wouldn’t see the new function. The end result is quite dramatic, 
performance-wise, reducing our p99 latencies down to a quarter to a third of 
their original values, depending on workloads.

One challenge here is that the “PrimitiveFloatArray” class is an almost 
complete copy of Avro’s Array class, basically just stripping away the generics 
and adding the primitive-returning function. If we were to contribute this 
upstream to the open-source project, I imagine we might want to do this not 
only for floats but for boolean, int, long and double arrays as well. This 
would mean roughly 5x the same copy-pasted implementation, which is not ideal 
from a maintenance standpoint. The generic types are nicer in that sense, but 
unfortunately, Java generics do not support primitives (AFAIK, unless one of 
the latest Java releases introduced support for this recently?). In our case, 
we are willing to pay that maintenance cost in exchange for the dramatic GC 
reduction it gives us, but I don’t know if the Avro project would come to the 
same conclusion.

If implemented, this could potentially be the new default behavior, though 
boxing would still happen (merely deferred) unless users casted the lists into 
the new types and used the new primitive-returning functions. In use cases 
where an entire primitive array is deserialized but only a subset of its items 
are accessed, this boxing deferral could still provide benefits, however, even 
without code changes. I imagine there might be a counter case though where a 
primitive array gets deserialized and then fully accessed multiple times, in 
which case deferred on-demand boxing might be even worse...

—

So again, I re-iterate: the goal of this ticket is to share what we’ve found 
and what we’ve done about it, with the intent to understand if there would be 
interest in reviewing and merging a patch which provided the above opt-in 
benefits.

Thanks for your feedback.

  was:
We have a performance-sensitive project that leverages Avro for an online data 
store, and I wanted to report on a couple of Avro deserialization optimizations 
we have implemented. On one hand, it is great that Avro’s code is clean and 
modular enough to have allowed us to achieve this easily. But on the other 
hand, we are leveraging parts of the API which are probably not typically used 
by most users, and thus we are exposing ourselves to ongoing maintenance costs 
as those “ambiguously-public” APIs might change in future versions. For this 
reason, I wanted to gauge the appetite of the Avro community for taking in 
those optimizations upstream into the main project.

The minor challenge is that the optimizations we’ve made are not completely 
invisible, and therefore should probably be presented as opt-in settings, 
rather than new defaults. Below is a summary of both changes.

*1. Re-use of byte arrays when instantiating ByteBuffers*

When deserializing a byte array that contains a ByteBuffer field, the relevant 
portion of the input byte array is copied into a new byte array, which is then 
used as the backing array of a new ByteBuffer. This is completely safe, but 
appears to be inefficient from our point of view, requiring extra CPU cycles 
and generating garbage.

In our case, we have a few schemas which contain some general metadata and an 
opaque byte array payload, which often ends up being a significant portion of 
the total byte length. Recopying these bytes results in up to 2x the byte 
allocation (assuming the schema contained just a ByteBuffer field and nothing 
else). The ByteBuffer API, however, provides an alternative behavior where the 
backing array can be larger than needed, with an offset and length provided to 
indicate the internal boundaries of the payload. In our implementation, we 
re-use the input byte array as the ButeBuffer’s backing array, therefore 
avoiding both the allocation and the copy.

The caveat in this case is that this only works safely for use cases that don’t 
mutate the content of the bytes (neither the input nor the deserialized 
object). In our case this assumption is valid.

If this was implemented in the open-source project, there are a few ways this 
could be achieved:
 # We could return a special read-only ByteBuffer implementation that throws an 
exception if any mutation is attempted, indicating that this special mode ought 
to be turned off to support mutations.
 # We could return a modified ByteBuffer implementation which defers the copy 
of the content lazily until (and only if) one of the mutation API is called. 
More user-friendly, perhaps, but could introduce hidden overhead which the 
application designer might prefer to be alerted to rather than being silently 
shoved under the rug.

Either way, this probably requires a custom ByteBuffer implementation with 
special behavior in order to be cleaner. In both cases, however, I don’t see a 
way to guard against mutations to the input byte array, therefore, this should 
probably be an opt-in setting for users who know what they’re doing and are 
comfortable with the read-only assumption in their workloads.

*2. Primitive (non-boxing) implementation of lists*

Another challenge we’ve come across is that lists of primitive types (floats in 
our case) are always boxed into Object Floats by Avro. In our case, this 
resulted in millions of Objects / second getting allocated, causing 
pathological GC pressure.

To solve this, we have implemented an alternative version of the Avro Array 
class, but which instead of hanging on to an array of generically-typed 
Objects, internally, hangs on to an array of primitive floats. This causes no 
boxing at deserialization time, but there is a further challenge which is that 
since Avro array fields are exposed as instances of the Java List interface, 
the regular functions of the API all return Objects, therefore merely deferring 
the boxing to a slightly later point in time. To get around this further 
complication, we have added a getPrimitive(index) function which returns 
primitive items directly. In order to be able to use this more optimized 
function, it is necessary for us to cast the list into our own type, otherwise 
we wouldn’t see the new function. The end result is quite dramatic, 
performance-wise, reducing our p99 latencies down to a quarter to a third of 
their original values, depending on workloads.

One challenge here is that the “PrimitiveFloatArray” class is an almost 
complete copy of Avro’s Array class, basically just stripping away the generics 
and adding the primitive-returning function. If we were to contribute this 
upstream to the open-source project, I imagine we might want to do this not 
only for floats but for boolean, int, long and double arrays as well. This 
would mean roughly 5x the same copy-pasted implementation, which is not ideal 
from a maintenance standpoint. The generic types are nicer in that sense, but 
unfortunately, Java generics do not support primitives (AFAIK, unless one of 
the latest Java releases introduced support for this recently?). In our case, 
we are willing to pay that maintenance cost in exchange for the dramatic GC 
reduction it gives us, but I don’t know if the Avro project would come to the 
same conclusion.

If implemented, this could potentially be the new default behavior, though 
boxing would still happen (merey deferred) unless users casted the lists into 
the new types and used the new primitive-returning functions. In use cases 
where an entire primitive array is deserialized but only a subset of its items 
are accessed, this boxing deferral could still provide benefits, however.

—

So again, I re-iterate: the goal of this ticket is to share what we’ve found 
and what we’ve done about it, with the intent to understand if there would be 
interest in reviewing and merging a patch which provided the above opt-in 
benefits.

Thanks for your feedback.


> Opt-in setting to improve GC behavior during deserialization?
> -------------------------------------------------------------
>
>                 Key: AVRO-2307
>                 URL: https://issues.apache.org/jira/browse/AVRO-2307
>             Project: Apache Avro
>          Issue Type: Improvement
>          Components: compatibility, java
>    Affects Versions: 1.7.7
>            Reporter: Felix GV
>            Priority: Minor
>              Labels: performance
>
> We have a performance-sensitive project that leverages Avro for an online 
> data store, and I wanted to report on a couple of Avro deserialization 
> optimizations we have implemented. On one hand, it is great that Avro’s code 
> is clean and modular enough to have allowed us to achieve this easily. But on 
> the other hand, we are leveraging parts of the API which are probably not 
> typically used by most users, and thus we are exposing ourselves to ongoing 
> maintenance costs as those “ambiguously-public” APIs might change in future 
> versions. For this reason, I wanted to gauge the appetite of the Avro 
> community for taking in those optimizations upstream into the main project.
> The minor challenge is that the optimizations we’ve made are not completely 
> invisible, and therefore should probably be presented as opt-in settings, 
> rather than new defaults. Below is a summary of both changes.
> *1. Re-use of byte arrays when instantiating ByteBuffers*
> When deserializing a byte array that contains a ByteBuffer field, the 
> relevant portion of the input byte array is copied into a new byte array, 
> which is then used as the backing array of a new ByteBuffer. This is 
> completely safe, but appears to be inefficient from our point of view, 
> requiring extra CPU cycles and generating garbage.
> In our case, we have a few schemas which contain some general metadata and an 
> opaque byte array payload, which often ends up being a significant portion of 
> the total byte length. Recopying these bytes results in up to 2x the byte 
> allocation (assuming the schema contained just a ByteBuffer field and nothing 
> else). The ByteBuffer API, however, provides an alternative behavior where 
> the backing array can be larger than needed, with an offset and length 
> provided to indicate the internal boundaries of the payload. In our 
> implementation, we re-use the input byte array as the ButeBuffer’s backing 
> array, therefore avoiding both the allocation and the copy.
> The caveat in this case is that this only works safely for use cases that 
> don’t mutate the content of the bytes (neither the input nor the deserialized 
> object). In our case this assumption is valid.
> If this was implemented in the open-source project, there are a few ways this 
> could be achieved:
>  # We could return a special read-only ByteBuffer implementation that throws 
> an exception if any mutation is attempted, indicating that this special mode 
> ought to be turned off to support mutations.
>  # We could return a modified ByteBuffer implementation which defers the copy 
> of the content lazily until (and only if) one of the mutation API is called. 
> More user-friendly, perhaps, but could introduce hidden overhead which the 
> application designer might prefer to be alerted to rather than being silently 
> shoved under the rug.
> Either way, this probably requires a custom ByteBuffer implementation with 
> special behavior in order to be cleaner. In both cases, however, I don’t see 
> a way to guard against mutations to the input byte array, therefore, this 
> should probably be an opt-in setting for users who know what they’re doing 
> and are comfortable with the read-only assumption in their workloads.
> *2. Primitive (non-boxing) implementation of lists*
> Another challenge we’ve come across is that lists of primitive types (floats 
> in our case) are always boxed into Object Floats by Avro. In our case, this 
> resulted in millions of Objects / second getting allocated, causing 
> pathological GC pressure.
> To solve this, we have implemented an alternative version of the Avro Array 
> class, but which instead of hanging on to an array of generically-typed 
> Objects, internally, hangs on to an array of primitive floats. This causes no 
> boxing at deserialization time, but there is a further challenge which is 
> that since Avro array fields are exposed as instances of the Java List 
> interface, the regular functions of the API all return Objects, therefore 
> merely deferring the boxing to a slightly later point in time. To get around 
> this further complication, we have added a getPrimitive(index) function which 
> returns primitive items directly. In order to be able to use this more 
> optimized function, it is necessary for us to cast the list into our own 
> type, otherwise we wouldn’t see the new function. The end result is quite 
> dramatic, performance-wise, reducing our p99 latencies down to a quarter to a 
> third of their original values, depending on workloads.
> One challenge here is that the “PrimitiveFloatArray” class is an almost 
> complete copy of Avro’s Array class, basically just stripping away the 
> generics and adding the primitive-returning function. If we were to 
> contribute this upstream to the open-source project, I imagine we might want 
> to do this not only for floats but for boolean, int, long and double arrays 
> as well. This would mean roughly 5x the same copy-pasted implementation, 
> which is not ideal from a maintenance standpoint. The generic types are nicer 
> in that sense, but unfortunately, Java generics do not support primitives 
> (AFAIK, unless one of the latest Java releases introduced support for this 
> recently?). In our case, we are willing to pay that maintenance cost in 
> exchange for the dramatic GC reduction it gives us, but I don’t know if the 
> Avro project would come to the same conclusion.
> If implemented, this could potentially be the new default behavior, though 
> boxing would still happen (merely deferred) unless users casted the lists 
> into the new types and used the new primitive-returning functions. In use 
> cases where an entire primitive array is deserialized but only a subset of 
> its items are accessed, this boxing deferral could still provide benefits, 
> however, even without code changes. I imagine there might be a counter case 
> though where a primitive array gets deserialized and then fully accessed 
> multiple times, in which case deferred on-demand boxing might be even worse...
> —
> So again, I re-iterate: the goal of this ticket is to share what we’ve found 
> and what we’ve done about it, with the intent to understand if there would be 
> interest in reviewing and merging a patch which provided the above opt-in 
> benefits.
> Thanks for your feedback.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to