[ 
https://issues.apache.org/jira/browse/HDFS-7435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daryn Sharp updated HDFS-7435:
------------------------------
    Attachment: HDFS-7435.patch

I don't need to bump the DN's min NN version.  I leveraged the auto-detection I 
did before.

So here's the deal.  As wasteful as PBs can be in some cases, they are smart in 
other cases to avoid unnecessary copying.  I've leveraged those apis to their 
maximum.  I’ve also optimized {{BlockListAsLongs}} to remove intermediate 
encodings.

Instead of a pre-allocated array, a {{ByteString.Output}} stream can be used by 
a custom specified buffer size.  Internally the {{ByteString}} is building up a 
balanced tree of rope buffers of the given size.  Extracting the buffer via 
{{Output#toByteString}} doesn't really copy anything.  So on the DN we have 
cheaply built up a blocks buffer of {{ByteString}} that is internally segmented.

I relented to sending the blocks buffer in a repeating field.  {{ByteString}} 
provides no direct API for internally accessing its roped strings.  However, we 
can slice it up with {{ByteString#substring}}.  It prefers to create bounded 
instances (offset+len) much like {{String}} did in earlier JDKs.  Alas, a 
wasted instantiation.  However a substring that aligns with an entire rope 
buffer will get the actual buffer reference – not wrapped with a bounded 
instance.  So I slice it up that way.

On the NN, if the new repeating field is not present, it decodes as old-style 
format.  Otherwise it uses {{ByteString#concat}} to reassemble the blocks 
buffer fragments.  Concat doesn't copy, but again builds a balanced tree for 
later decoding.

The best part is the encoding/decoding.  {{BlockListAsLongs}} directly encodes 
the {{Replicas}} into a {{ByteString}}, not into a wasted intermediate 
{{long[]}}.  To support the unlikely and surely shorted lived case of a new DN 
reporting to a old NN, the PB layer will decode the {{ByteString}} and 
re-encode into old-style longs.  Not efficient, but it’s the least ugly way to 
handle something that will probably never happen.

{{BlockListAsLongs}} doesn’t decode the buffer into a wasted and intermediate 
{{long[]}}.  Instead, the iterator decodes on-demand.

I also changed the format of the encoded buffer.  Might as well while the 
patient is on the table.  It’s documented in the code, but now finalized and uc 
blocks aren’t treated specially.  The finalized and uc counts, followed by 
3-long finalized, a 3-long delimiter, and 4-long uc blocks is now just total 
block count, 4-long blocks.  Every block encodes its {{ReplicaState}}.  One 
compelling reason is the capability to encode additional bits into the state 
field.  One anticipated use is encoding the pinned block status.

Phew.  I need to write some new tests to verify old/new compatibility.  I’ll 
also manually test on a perf cluster.

> PB encoding of block reports is very inefficient
> ------------------------------------------------
>
>                 Key: HDFS-7435
>                 URL: https://issues.apache.org/jira/browse/HDFS-7435
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode, namenode
>    Affects Versions: 2.0.0-alpha, 3.0.0
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>         Attachments: HDFS-7435.000.patch, HDFS-7435.001.patch, 
> HDFS-7435.002.patch, HDFS-7435.patch, HDFS-7435.patch
>
>
> Block reports are encoded as a PB repeating long.  Repeating fields use an 
> {{ArrayList}} with default capacity of 10.  A block report containing tens or 
> hundreds of thousand of longs (3 for each replica) is extremely expensive 
> since the {{ArrayList}} must realloc many times.  Also, decoding repeating 
> fields will box the primitive longs which must then be unboxed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to