[ https://issues.apache.org/jira/browse/HDFS-7435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Daryn Sharp updated HDFS-7435: ------------------------------ Attachment: HDFS-7435.patch I don't need to bump the DN's min NN version. I leveraged the auto-detection I did before. So here's the deal. As wasteful as PBs can be in some cases, they are smart in other cases to avoid unnecessary copying. I've leveraged those apis to their maximum. I’ve also optimized {{BlockListAsLongs}} to remove intermediate encodings. Instead of a pre-allocated array, a {{ByteString.Output}} stream can be used by a custom specified buffer size. Internally the {{ByteString}} is building up a balanced tree of rope buffers of the given size. Extracting the buffer via {{Output#toByteString}} doesn't really copy anything. So on the DN we have cheaply built up a blocks buffer of {{ByteString}} that is internally segmented. I relented to sending the blocks buffer in a repeating field. {{ByteString}} provides no direct API for internally accessing its roped strings. However, we can slice it up with {{ByteString#substring}}. It prefers to create bounded instances (offset+len) much like {{String}} did in earlier JDKs. Alas, a wasted instantiation. However a substring that aligns with an entire rope buffer will get the actual buffer reference – not wrapped with a bounded instance. So I slice it up that way. On the NN, if the new repeating field is not present, it decodes as old-style format. Otherwise it uses {{ByteString#concat}} to reassemble the blocks buffer fragments. Concat doesn't copy, but again builds a balanced tree for later decoding. The best part is the encoding/decoding. {{BlockListAsLongs}} directly encodes the {{Replicas}} into a {{ByteString}}, not into a wasted intermediate {{long[]}}. To support the unlikely and surely shorted lived case of a new DN reporting to a old NN, the PB layer will decode the {{ByteString}} and re-encode into old-style longs. Not efficient, but it’s the least ugly way to handle something that will probably never happen. {{BlockListAsLongs}} doesn’t decode the buffer into a wasted and intermediate {{long[]}}. Instead, the iterator decodes on-demand. I also changed the format of the encoded buffer. Might as well while the patient is on the table. It’s documented in the code, but now finalized and uc blocks aren’t treated specially. The finalized and uc counts, followed by 3-long finalized, a 3-long delimiter, and 4-long uc blocks is now just total block count, 4-long blocks. Every block encodes its {{ReplicaState}}. One compelling reason is the capability to encode additional bits into the state field. One anticipated use is encoding the pinned block status. Phew. I need to write some new tests to verify old/new compatibility. I’ll also manually test on a perf cluster. > PB encoding of block reports is very inefficient > ------------------------------------------------ > > Key: HDFS-7435 > URL: https://issues.apache.org/jira/browse/HDFS-7435 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, namenode > Affects Versions: 2.0.0-alpha, 3.0.0 > Reporter: Daryn Sharp > Assignee: Daryn Sharp > Priority: Critical > Attachments: HDFS-7435.000.patch, HDFS-7435.001.patch, > HDFS-7435.002.patch, HDFS-7435.patch, HDFS-7435.patch > > > Block reports are encoded as a PB repeating long. Repeating fields use an > {{ArrayList}} with default capacity of 10. A block report containing tens or > hundreds of thousand of longs (3 for each replica) is extremely expensive > since the {{ArrayList}} must realloc many times. Also, decoding repeating > fields will box the primitive longs which must then be unboxed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)