[ https://issues.apache.org/jira/browse/HADOOP-17901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Peter Bacsko updated HADOOP-17901: ---------------------------------- Attachment: HADOOP-17901-001.patch > Performance degradation in Text.append() after HADOOP-16951 > ----------------------------------------------------------- > > Key: HADOOP-17901 > URL: https://issues.apache.org/jira/browse/HADOOP-17901 > Project: Hadoop Common > Issue Type: Bug > Components: common > Reporter: Peter Bacsko > Assignee: Peter Bacsko > Priority: Critical > Attachments: HADOOP-17901-001.patch > > > We discovered a serious performance degradation in {{Text.append()}}. > The problem is that the logic which intends to increase the size of the > backing array does not work as intended. > It's very difficult to spot, so I added extra logs to see what happens. > Let's add 4096 bytes of textual data in a loop: > {noformat} > public static void main(String[] args) { > Text text = new Text(); > String toAppend = RandomStringUtils.randomAscii(4096); > for(int i = 0; i < 100; i++) { > text.append(toAppend.getBytes(), 0, 4096); > } > } > {noformat} > With some debug printouts, we can observe: > {noformat} > 2021-09-08 13:35:29,528 INFO [main] io.Text (Text.java:append(251)) - > length: 24576, len: 4096, utf8ArraySize: 4096, bytes.length: 30720 > 2021-09-08 13:35:29,528 INFO [main] io.Text (Text.java:append(253)) - length > + (length >> 1): 36864 > 2021-09-08 13:35:29,528 INFO [main] io.Text (Text.java:append(254)) - length > + len: 28672 > 2021-09-08 13:35:29,528 INFO [main] io.Text (Text.java:ensureCapacity(287)) > - >>> enhancing capacity from 30720 to 36864 > 2021-09-08 13:35:29,528 INFO [main] io.Text (Text.java:append(251)) - > length: 28672, len: 4096, utf8ArraySize: 4096, bytes.length: 36864 > 2021-09-08 13:35:29,528 INFO [main] io.Text (Text.java:append(253)) - length > + (length >> 1): 43008 > 2021-09-08 13:35:29,529 INFO [main] io.Text (Text.java:append(254)) - length > + len: 32768 > 2021-09-08 13:35:29,529 INFO [main] io.Text (Text.java:ensureCapacity(287)) > - >>> enhancing capacity from 36864 to 43008 > 2021-09-08 13:35:29,529 INFO [main] io.Text (Text.java:append(251)) - > length: 32768, len: 4096, utf8ArraySize: 4096, bytes.length: 43008 > 2021-09-08 13:35:29,529 INFO [main] io.Text (Text.java:append(253)) - length > + (length >> 1): 49152 > 2021-09-08 13:35:29,529 INFO [main] io.Text (Text.java:append(254)) - length > + len: 36864 > 2021-09-08 13:35:29,529 INFO [main] io.Text (Text.java:ensureCapacity(287)) > - >>> enhancing capacity from 43008 to 49152 > ... > {noformat} > After a certain number of {{append()}} calls, subsequent capacity increments > are small. > It's because the difference between two {{length + (length >> 1)}} values is > always 6144 bytes. Because the size of the backing array is trailing behind > the calculated value, the increment will also be 6144 bytes. This means that > new arrays are constantly created. > Suggested solution: don't calculate the capacity in advance based on length. > Instead, pass the required minimum to {{ensureCapacity()}}. Then the > increment should depend on the actual size of the byte array if the desired > capacity is larger. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org