Peter Bacsko created HADOOP-17901:
-------------------------------------

             Summary: Performance degradation in Text.append() after 
HADOOP-16951
                 Key: HADOOP-17901
                 URL: https://issues.apache.org/jira/browse/HADOOP-17901
             Project: Hadoop Common
          Issue Type: Bug
          Components: common
            Reporter: Peter Bacsko
            Assignee: Peter Bacsko


We discovered a serious performance degradation in {{Text.append()}}.

The problem is that the logic which intends to increase the size of the backing 
array does not work as intended.
It's very difficult to spot, so I added extra logs to see what happens.

Let's add 4096 bytes of textual data in a loop:
{noformat}
  public static void main(String[] args) {
    Text text = new Text();
    String toAppend = RandomStringUtils.randomAscii(4096);

    for(int i = 0; i < 100; i++) {
      text.append(toAppend.getBytes(), 0, 4096);
    }
  }
{noformat}

With some debug printouts, we can observe:
{noformat}
2021-09-08 13:35:29,528 INFO  [main] io.Text (Text.java:append(251)) - length: 
24576,  len: 4096, utf8ArraySize: 4096, bytes.length: 30720
2021-09-08 13:35:29,528 INFO  [main] io.Text (Text.java:append(253)) - length + 
(length >> 1): 36864
2021-09-08 13:35:29,528 INFO  [main] io.Text (Text.java:append(254)) - length + 
len: 28672
2021-09-08 13:35:29,528 INFO  [main] io.Text (Text.java:ensureCapacity(287)) - 
>>> enhancing capacity from 30720 to 36864
2021-09-08 13:35:29,528 INFO  [main] io.Text (Text.java:append(251)) - length: 
28672,  len: 4096, utf8ArraySize: 4096, bytes.length: 36864
2021-09-08 13:35:29,528 INFO  [main] io.Text (Text.java:append(253)) - length + 
(length >> 1): 43008
2021-09-08 13:35:29,529 INFO  [main] io.Text (Text.java:append(254)) - length + 
len: 32768
2021-09-08 13:35:29,529 INFO  [main] io.Text (Text.java:ensureCapacity(287)) - 
>>> enhancing capacity from 36864 to 43008
2021-09-08 13:35:29,529 INFO  [main] io.Text (Text.java:append(251)) - length: 
32768,  len: 4096, utf8ArraySize: 4096, bytes.length: 43008
2021-09-08 13:35:29,529 INFO  [main] io.Text (Text.java:append(253)) - length + 
(length >> 1): 49152
2021-09-08 13:35:29,529 INFO  [main] io.Text (Text.java:append(254)) - length + 
len: 36864
2021-09-08 13:35:29,529 INFO  [main] io.Text (Text.java:ensureCapacity(287)) - 
>>> enhancing capacity from 43008 to 49152
...
{noformat}

After a certain number of {{append()}} calls, subsequent capacity increments 
are small.

It's because the difference between two {{length + (length >> 1)}} values is 
always 6144 bytes. Because the size of the backing array is trailing behind the 
calculated value, the increment will also be 6144 bytes. This means that new 
arrays are constantly created.

Suggested solution: don't calculate the capacity in advance based on length. 
Instead, pass the required minimum to {{ensureCapacity()}}. Then the increment 
should depend on the actual size of the byte array if the desired capacity is 
larger.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to