David Mollitor created AVRO-4065:
------------------------------------
Summary: Do Not Copy Array Contents when Expanding UTF-8 Arrays
Key: AVRO-4065
URL: https://issues.apache.org/jira/browse/AVRO-4065
Project: Apache Avro
Issue Type: Improvement
Components: java
Affects Versions: 1.12.0
Reporter: David Mollitor
Assignee: David Mollitor
Fix For: 1.13.0
The following code snippet in UTF8.java shows that when an array is re-sized to
accept a new String, the contents of the original array are copied into the new
array.
I'm not sure the intent here, but the expectation is that the array is being
resized precisely because it is about to be overwritten, hence why the cache
String is blown away too.
{code:java}
/**
* Set length in bytes. Should called whenever byte content changes, even if
the
* length does not change, as this also clears the cached String.
*/
public Utf8 setByteLength(int newLength) {
SystemLimitException.checkMaxStringLength(newLength);
if (this.bytes.length < newLength) {
this.bytes = Arrays.copyOf(this.bytes, newLength);
}
this.length = newLength;
this.string = null;
this.hash = 0;
return this;
}
{code}
If we peek at the JDK code, it's simply creating a new array, and then copying
over the existing contents, but also zero-padding.
{code:java}
public static byte[] copyOf(byte[] original, int newLength) {
byte[] copy = new byte[newLength];
System.arraycopy(original, 0, copy, 0,
Math.min(original.length, newLength));
return copy;
}
{code}
So this is problematic for a few reasons... number one is that it is wasted CPU
cycle to copy the data and then immediately overwrite it. Second, it's not well
documented/understood/expected what the state of the String is after calling
this method. It makes sense that when truncating the string the value is a
prefix, but unexpected what the behavior is when expanding the String without
knowing the underlying code (i.e., padding of zeros).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)