[ 
https://issues.apache.org/jira/browse/HDFS-7339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14289868#comment-14289868
 ] 

Zhe Zhang commented on HDFS-7339:
---------------------------------

[~jingzhao] Thanks for the insightful review! I believe this discussion also 
addresses comments from [~szetszwo] under HDFS-7285, HDFS-7614, and HDFS-7652. 

The main reason for creating a BlockGroup class and the hierarchical block ID 
protocol is to _minimize NN memory overhead_. As shown in the [fsimage analysis 
| 
https://issues.apache.org/jira/secure/attachment/12690129/fsimage-analysis-20150105.pdf],
 the {{blocksMap}} size increases 3.5x~5.4x if the NN plainly tracks every 
striped block -- this translates to 10s GB of memory usage. This is mainly 
caused by small blocks being striped into many more even smaller blocks. 

bq. I think DataNode does not need to know the difference between contiguous 
blocks and stripped blocks (when doing recovery the datanode can learn the 
information from NameNode). The concept of BlockGroup should be known and used 
only internally in NameNode (and maybe also logically known by the client while 
writing). 
bq. Datanodes and their block reports do not distinguish stripped and 
contiguous blocks. And we do not need to distinguish them from the block ID. 
They are treated equally while storing and reporting in/from the DN.
Agreed. DN is indeed group-agnostic in the current design. The only DN code 
change will be for block recovery and conversion. It will probably be clearer 
when the client patch (HDFS-7545) is ready. As shown in the [design | 
https://issues.apache.org/jira/secure/attachment/12687886/DataStripingSupportinHDFSClient.pdf],
 after receiving a newly allocated block group, the client does the following:
# Calculates blocks IDs from the block group ID and the group layout (number of 
data and parity blocks) -- a block's ID is basically the group ID plus the 
block's index in the group.
# The {{DFSOutputStream}} starts _n_ {{DataStreamer}} threads, each write one 
block to its destination DN. Note that even the {{DataStreamer}} is unaware of 
the group -- it just follows the regular client-DN block writing protocol. 
Therefore the DN just receives and processes regular block creation and write 
requests.

The DN then follows the regular block reporting protocol for all contiguous and 
striped blocks. Then the NN (with the logic from HDFS-7652) will parse the 
reported block ID, and store the reported info under either {{blocksMap}} or 
the map of block groups. Again, the benefit of having a separate map for block 
groups is to avoid the order-of-magnitude increase of {{blocksMap}} size. 

We can track on the unit of block groups because data loss can only happen when 
the entire group is "under-replicated" -- i.e. the number of healthy blocks in 
the group falls below a threshold. This coarse-grained tracking also aligns 
with the plan to push some monitoring and recovery workload from NN to DN, as 
[~sureshms] also [proposed | 
https://issues.apache.org/jira/browse/HDFS-7285?focusedCommentId=14192480&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14192480]
 in the meetup. 

bq. Fundamentally BlockGroup is also a BlockCollection. We do not need to 
assign generation stamp to BlockGroup (and even its id can be omitted). What we 
need is only maintaining the mapping between block and blockgroup in the 
original blocksmap, recording the list of blocks in the blockgroup, and 
recording the blockgroups in INodeFile.
This is an interesting thought and does simplify the code. But it seems to me 
the added complexity of tracking block groups is necessary to avoid heavy NN 
overhead. The generation stamp of a block group will be used to derive the 
stamps for its blocks (this logic is not included in the patch yet).

bq. I think in this way we can simplify the current design and reuse most of 
the current block management code. 
Reusing block management code is a great point. While developing this patch I 
did have to take many {{Block}} management logics and create counterparts for 
{{BlockGroup}}. One possibility is to create a "common ancestor" class for 
{{Block}} and {{BlockGroup}} (e.g., {{GeneralizedBlock}}). Main commonalities 
being:
# Both represent a contiguous range of data in a file. Therefore each file 
consists of an array of {{GeneralizedBlock}}.
# Both are a separate unit for NN monitoring. Therefore {{BlocksMap}} can work 
with {{GeneralizedBlock}}
# Both have a capacity and a set of storage locations

Another alternative to reuse block mgmt code is to treat each {{Block}} as a 
single-member {{BlockGroup}}. 

I discussed the above 2 alternatives offline with [~andrew.wang] and we are 
inclined to use separate block group management code in this JIRA and start a 
refactoring JIRA after more logics are fleshed out. At that time we'll see more 
clearly which option is easier.

bq. We can develop the addBlockgroup API and revisit how to handle under 
construction block and blockgroups (and whether we need to assign 
complete/commit state to block group and define BlockGroupUC) later in separate 
jiras.
This sounds totally OK with me. I included {{addBlockgroup}} in this patch to 
have a meaningful unit test.

> Allocating and persisting block groups in NameNode
> --------------------------------------------------
>
>                 Key: HDFS-7339
>                 URL: https://issues.apache.org/jira/browse/HDFS-7339
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Zhe Zhang
>            Assignee: Zhe Zhang
>         Attachments: HDFS-7339-001.patch, HDFS-7339-002.patch, 
> HDFS-7339-003.patch, HDFS-7339-004.patch, HDFS-7339-005.patch, 
> HDFS-7339-006.patch, Meta-striping.jpg, NN-stripping.jpg
>
>
> All erasure codec operations center around the concept of _block group_; they 
> are formed in initial encoding and looked up in recoveries and conversions. A 
> lightweight class {{BlockGroup}} is created to record the original and parity 
> blocks in a coding group, as well as a pointer to the codec schema (pluggable 
> codec schemas will be supported in HDFS-7337). With the striping layout, the 
> HDFS client needs to operate on all blocks in a {{BlockGroup}} concurrently. 
> Therefore we propose to extend a file’s inode to switch between _contiguous_ 
> and _striping_ modes, with the current mode recorded in a binary flag. An 
> array of BlockGroups (or BlockGroup IDs) is added, which remains empty for 
> “traditional” HDFS files with contiguous block layout.
> The NameNode creates and maintains {{BlockGroup}} instances through the new 
> {{ECManager}} component; the attached figure has an illustration of the 
> architecture. As a simple example, when a {_Striping+EC_} file is created and 
> written to, it will serve requests from the client to allocate new 
> {{BlockGroups}} and store them under the {{INodeFile}}. In the current phase, 
> {{BlockGroups}} are allocated both in initial online encoding and in the 
> conversion from replication to EC. {{ECManager}} also facilitates the lookup 
> of {{BlockGroup}} information for block recovery work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to