[ 
https://issues.apache.org/jira/browse/HADOOP-17901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412483#comment-17412483
 ] 

Peter Bacsko commented on HADOOP-17901:
---------------------------------------

Thanks [~elgoiri]. I was also thinking about the possible expansion of the 
array to the max size (which is often assumed to be Integer.MAX_VALUE - 8), but 
I think I'll do that in a different JIRA.

> Performance degradation in Text.append() after HADOOP-16951
> -----------------------------------------------------------
>
>                 Key: HADOOP-17901
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17901
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: common
>            Reporter: Peter Bacsko
>            Assignee: Peter Bacsko
>            Priority: Critical
>              Labels: pull-request-available
>         Attachments: HADOOP-17901-001.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> We discovered a serious performance degradation in {{Text.append()}}.
> The problem is that the logic which intends to increase the size of the 
> backing array does not work as intended.
> It's very difficult to spot, so I added extra logs to see what happens.
> Let's add 4096 bytes of textual data in a loop:
> {noformat}
>   public static void main(String[] args) {
>     Text text = new Text();
>     String toAppend = RandomStringUtils.randomAscii(4096);
>     for(int i = 0; i < 100; i++) {
>       text.append(toAppend.getBytes(), 0, 4096);
>     }
>   }
> {noformat}
> With some debug printouts, we can observe:
> {noformat}
> 2021-09-08 13:35:29,528 INFO  [main] io.Text (Text.java:append(251)) - 
> length: 24576,  len: 4096, utf8ArraySize: 4096, bytes.length: 30720
> 2021-09-08 13:35:29,528 INFO  [main] io.Text (Text.java:append(253)) - length 
> + (length >> 1): 36864
> 2021-09-08 13:35:29,528 INFO  [main] io.Text (Text.java:append(254)) - length 
> + len: 28672
> 2021-09-08 13:35:29,528 INFO  [main] io.Text (Text.java:ensureCapacity(287)) 
> - >>> enhancing capacity from 30720 to 36864
> 2021-09-08 13:35:29,528 INFO  [main] io.Text (Text.java:append(251)) - 
> length: 28672,  len: 4096, utf8ArraySize: 4096, bytes.length: 36864
> 2021-09-08 13:35:29,528 INFO  [main] io.Text (Text.java:append(253)) - length 
> + (length >> 1): 43008
> 2021-09-08 13:35:29,529 INFO  [main] io.Text (Text.java:append(254)) - length 
> + len: 32768
> 2021-09-08 13:35:29,529 INFO  [main] io.Text (Text.java:ensureCapacity(287)) 
> - >>> enhancing capacity from 36864 to 43008
> 2021-09-08 13:35:29,529 INFO  [main] io.Text (Text.java:append(251)) - 
> length: 32768,  len: 4096, utf8ArraySize: 4096, bytes.length: 43008
> 2021-09-08 13:35:29,529 INFO  [main] io.Text (Text.java:append(253)) - length 
> + (length >> 1): 49152
> 2021-09-08 13:35:29,529 INFO  [main] io.Text (Text.java:append(254)) - length 
> + len: 36864
> 2021-09-08 13:35:29,529 INFO  [main] io.Text (Text.java:ensureCapacity(287)) 
> - >>> enhancing capacity from 43008 to 49152
> ...
> {noformat}
> After a certain number of {{append()}} calls, subsequent capacity increments 
> are small.
> It's because the difference between two {{length + (length >> 1)}} values is 
> always 6144 bytes. Because the size of the backing array is trailing behind 
> the calculated value, the increment will also be 6144 bytes. This means that 
> new arrays are constantly created.
> Suggested solution: don't calculate the capacity in advance based on length. 
> Instead, pass the required minimum to {{ensureCapacity()}}. Then the 
> increment should depend on the actual size of the byte array if the desired 
> capacity is larger.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to