[ 
https://issues.apache.org/jira/browse/HDFS-5318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13854540#comment-13854540
 ] 

Eric Sirianni commented on HDFS-5318:
-------------------------------------

bq. If there is just one read-write logical replica for a given physical 
replica at a time we may be able to support this more easily without needing 
pluggable interface

This seems workable.  The basic design would be:
* The R/W logical replica is used for all BlockManager processing (counting, 
tagging as excess/corrupt/decommissioned, etc.)
* R/O logical replicas are considered only as possible block locations for 
reads (both for clients and inter-datanode replication).

I'm not sure I understand the definition of an "incomplete" block.  Does that 
mean a finalized block whose length is less than the block size (i.e., a block 
with free capacity potential for append)?

bq. Read-only replicas don't count towards excess replicas. In fact read-only 
replicas of incomplete blocks should probably not be counted towards satisfying 
the target replication factor.
Read-only replicas should never count towards the target replication factor 
regardless of "complete" or "incomplete" right?  In the following scenario, the 
block should be classified as under replicated, right?
* desired replication = 2 : NameNode has 1 R/O logical replica, 1 R/W logical 
replica

bq. Incomplete blocks with just read-only replicas are corrupt since there is 
no append path to the block.
I don't fully understand the implications here.  In a repcount=1 situation, 
when there is a single R/W logical replica and the DataNode that's hosting it 
goes offline, what happens?  I would propose the following:
* replication count goes to 0
* NameNode detects block as under replicated, triggers a replication from one 
of the (still online) R/O replicas to a new DataNode
* replication count goes to 1
* DataNode hosting the original R/W replica comes back online
* replication count goes to 2
* NameNode detects block as over-replicated, prunes one of the R/W replicas

bq. Some of these conditions should be already enforced by the NameNode today, 
we can fix the rest.
OK - the only place I see {{DatanodeStorageInfo.getState()}} currently 
referenced is in {{BlockPlacementPolicyDefault}}:
{code:title=BlockPlacementPolicyDefault.java}
    if (storage.getState() == State.READ_ONLY) {
      logNodeIsNotChosen(storage, "storage is read-only");
      return false;
    }
{code}

> Pluggable interface for replica counting
> ----------------------------------------
>
>                 Key: HDFS-5318
>                 URL: https://issues.apache.org/jira/browse/HDFS-5318
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 2.4.0
>            Reporter: Eric Sirianni
>         Attachments: HDFS-5318.patch, hdfs-5318.pdf
>
>
> There are several use cases for using shared-storage for datanode block 
> storage in an HDFS environment (storing cold blocks on a NAS device, Amazon 
> S3, etc.).
> With shared-storage, there is a distinction between:
> # a distinct physical copy of a block
> # an access-path to that block via a datanode.  
> A single 'replication count' metric cannot accurately capture both aspects.  
> However, for most of the current uses of 'replication count' in the Namenode, 
> the "number of physical copies" aspect seems to be the appropriate semantic.
> I propose altering the replication counting algorithm in the Namenode to 
> accurately infer distinct physical copies in a shared storage environment.  
> With HDFS-5115, a {{StorageID}} is a UUID.  I propose associating some minor 
> additional semantics to the {{StorageID}} - namely that multiple datanodes 
> attaching to the same physical shared storage pool should report the same 
> {{StorageID}} for that pool.  A minor modification would be required in the 
> DataNode to enable the generation of {{StorageID}} s to be pluggable behind 
> the {{FsDatasetSpi}} interface.  
> With those semantics in place, the number of physical copies of a block in a 
> shared storage environment can be calculated as the number of _distinct_ 
> {{StorageID}} s associated with that block.
> Consider the following combinations for two {{(DataNode ID, Storage ID)}} 
> pairs {{(DN_A, S_A) (DN_B, S_B)}} for a given block B:
> * {{DN_A != DN_B && S_A != S_B}} - *different* access paths to *different* 
> physical replicas (i.e. the traditional HDFS case with local disks)
> ** → Block B has {{ReplicationCount == 2}}
> * {{DN_A != DN_B && S_A == S_B}} - *different* access paths to the *same* 
> physical replica (e.g. HDFS datanodes mounting the same NAS share)
> ** → Block B has {{ReplicationCount == 1}}
> For example, if block B has the following location tuples:
> * {{DN_1, STORAGE_A}}
> * {{DN_2, STORAGE_A}}
> * {{DN_3, STORAGE_B}}
> * {{DN_4, STORAGE_B}},
> the effect of this proposed change would be to calculate the replication 
> factor in the namenode as *2* instead of *4*.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to