[ 
https://issues.apache.org/jira/browse/HDFS-7435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14345859#comment-14345859
 ] 

Jing Zhao commented on HDFS-7435:
---------------------------------

Thanks for working on this, Daryn. The patch looks good to me in general. Some 
comments below:
# Please add javadoc for BlockListAsLongs.
# I think we'd better not include the replica state for finalized replicas. 
Though it is encoded using a varint, considering most of the blocks are 
finalized in the report, this can still increase the total size of the report. 
How about not including the finalized state and write an extra number in the 
header to indicate the total number of finalized replicas?
# For {{BlockReportLongIterator}}, it can be more clean to only keep either 
{{iter}} or {{currentBlockIndex}}.
# Now the iterator becomes an internal field of a BlockListAsLongs. I guess the 
main motivation here is to support both new and old encoding but avoid having 
both a ByteString and a long[] inside of BlockListsAsLongs? But in the 
meanwhile, this means there can be only one user of {{iterator()}} in the whole 
life time of BlockListAsLongs? This usage pattern is true in the current code, 
but is it good for BlockListAsLongs as a pure data structure?
# Also in {{getBlockReportIterator}}, if {{iter}} is null, we're actually 
creating a new {{BlockListAsLongs}} instance and returning its iterator. Though 
this may not be hit by normal code path, this may not be very clean.
# To address the above two comments, maybe what we can do is:
1) declare both the ByteString and the long[] in BlockListAsLongs and build 
iterator on top
2) or keep the old BlockListAsLongs class (but annotate it as deprecated) and 
use it for handling the old style encoding. In this way we can have a more 
clean new BlockListAsLongs, and we can delete the old one in the future.





> PB encoding of block reports is very inefficient
> ------------------------------------------------
>
>                 Key: HDFS-7435
>                 URL: https://issues.apache.org/jira/browse/HDFS-7435
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode, namenode
>    Affects Versions: 2.0.0-alpha, 3.0.0
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>         Attachments: HDFS-7435.000.patch, HDFS-7435.001.patch, 
> HDFS-7435.002.patch, HDFS-7435.patch, HDFS-7435.patch, HDFS-7435.patch, 
> HDFS-7435.patch, HDFS-7435.patch
>
>
> Block reports are encoded as a PB repeating long.  Repeating fields use an 
> {{ArrayList}} with default capacity of 10.  A block report containing tens or 
> hundreds of thousand of longs (3 for each replica) is extremely expensive 
> since the {{ArrayList}} must realloc many times.  Also, decoding repeating 
> fields will box the primitive longs which must then be unboxed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to