[jira] [Updated] (AVRO-4065) Do Not Copy Array Contents when Expanding UTF-8 Arrays

David Mollitor (Jira) Fri, 27 Sep 2024 20:05:19 -0700


     [ 
https://issues.apache.org/jira/browse/AVRO-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


David Mollitor updated AVRO-4065:
---------------------------------
    Description: 
The following code snippet in UTF8.java shows that when an array is re-sized to 
accept a new String, the contents of the original array are copied into the new 
array.

I'm not sure the intent here, but the expectation is that the array is being 
resized precisely because it is about to be overwritten, hence why the cache 
String is blown away too.
{code:java}
  /**
   * Set length in bytes. Should called whenever byte content changes, even if 
the
   * length does not change, as this also clears the cached String.
   */
  public Utf8 setByteLength(int newLength) {
    SystemLimitException.checkMaxStringLength(newLength);
    if (this.bytes.length < newLength) {
      this.bytes = Arrays.copyOf(this.bytes, newLength);
    }
    this.length = newLength;
    this.string = null;
    this.hash = 0;
    return this;
  }
{code}
If we peek at the JDK code, it's simply creating a new array, and then copying 
over the existing contents, but also zero-padding.
{code:java}
    public static byte[] copyOf(byte[] original, int newLength) {
        byte[] copy = new byte[newLength];
        System.arraycopy(original, 0, copy, 0,
                         Math.min(original.length, newLength));
        return copy;
    }
{code}
So this is problematic for a few reasons... number one is that it is wasted CPU 
cycle to copy the data and then immediately overwrite it. Second, it's not well 
documented/understood/expected what the state of the String is after calling 
this method. It makes sense that when truncating the string the value is a 
prefix, but unexpected what the behavior is when expanding the String without 
knowing the underlying code (i.e., padding of zeros).

 

So, the most sane thing is that  no assumptions should made by the caller what 
the side-effects of calling this method are. If they want to get a Substring, 
then they can call #subSequence.

  was:
The following code snippet in UTF8.java shows that when an array is re-sized to 
accept a new String, the contents of the original array are copied into the new 
array.

I'm not sure the intent here, but the expectation is that the array is being 
resized precisely because it is about to be overwritten, hence why the cache 
String is blown away too.
{code:java}
  /**
   * Set length in bytes. Should called whenever byte content changes, even if 
the
   * length does not change, as this also clears the cached String.
   */
  public Utf8 setByteLength(int newLength) {
    SystemLimitException.checkMaxStringLength(newLength);
    if (this.bytes.length < newLength) {
      this.bytes = Arrays.copyOf(this.bytes, newLength);
    }
    this.length = newLength;
    this.string = null;
    this.hash = 0;
    return this;
  }
{code}
If we peek at the JDK code, it's simply creating a new array, and then copying 
over the existing contents, but also zero-padding.
{code:java}
    public static byte[] copyOf(byte[] original, int newLength) {
        byte[] copy = new byte[newLength];
        System.arraycopy(original, 0, copy, 0,
                         Math.min(original.length, newLength));
        return copy;
    }
{code}
So this is problematic for a few reasons... number one is that it is wasted CPU 
cycle to copy the data and then immediately overwrite it. Second, it's not well 
documented/understood/expected what the state of the String is after calling 
this method. It makes sense that when truncating the string the value is a 
prefix, but unexpected what the behavior is when expanding the String without 
knowing the underlying code (i.e., padding of zeros).

 

So, there should be no assumptions made by the caller what the side-effects of 
calling this method are. If they want to get a Substring, then they can call 
#subSequence.


> Do Not Copy Array Contents when Expanding UTF-8 Arrays
> ------------------------------------------------------
>
>                 Key: AVRO-4065
>                 URL: https://issues.apache.org/jira/browse/AVRO-4065
>             Project: Apache Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.12.0
>            Reporter: David Mollitor
>            Assignee: David Mollitor
>            Priority: Minor
>             Fix For: 1.13.0
>
>
> The following code snippet in UTF8.java shows that when an array is re-sized 
> to accept a new String, the contents of the original array are copied into 
> the new array.
> I'm not sure the intent here, but the expectation is that the array is being 
> resized precisely because it is about to be overwritten, hence why the cache 
> String is blown away too.
> {code:java}
>   /**
>    * Set length in bytes. Should called whenever byte content changes, even 
> if the
>    * length does not change, as this also clears the cached String.
>    */
>   public Utf8 setByteLength(int newLength) {
>     SystemLimitException.checkMaxStringLength(newLength);
>     if (this.bytes.length < newLength) {
>       this.bytes = Arrays.copyOf(this.bytes, newLength);
>     }
>     this.length = newLength;
>     this.string = null;
>     this.hash = 0;
>     return this;
>   }
> {code}
> If we peek at the JDK code, it's simply creating a new array, and then 
> copying over the existing contents, but also zero-padding.
> {code:java}
>     public static byte[] copyOf(byte[] original, int newLength) {
>         byte[] copy = new byte[newLength];
>         System.arraycopy(original, 0, copy, 0,
>                          Math.min(original.length, newLength));
>         return copy;
>     }
> {code}
> So this is problematic for a few reasons... number one is that it is wasted 
> CPU cycle to copy the data and then immediately overwrite it. Second, it's 
> not well documented/understood/expected what the state of the String is after 
> calling this method. It makes sense that when truncating the string the value 
> is a prefix, but unexpected what the behavior is when expanding the String 
> without knowing the underlying code (i.e., padding of zeros).
>  
> So, the most sane thing is that  no assumptions should made by the caller 
> what the side-effects of calling this method are. If they want to get a 
> Substring, then they can call #subSequence.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (AVRO-4065) Do Not Copy Array Contents when Expanding UTF-8 Arrays

Reply via email to