[ 
https://issues.apache.org/jira/browse/HDFS-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15315169#comment-15315169
 ] 

Chris Douglas commented on HDFS-9806:
-------------------------------------

bq.  How should we update the over-replication logic to work with caching?
Without read-through caching, there should be no over-replication. To implement 
it for the PoC, the DN just drops a replica on a local storage and reports it 
to the NN, which has a hack to ignore this case. This isn't a long-term 
solution.

One proposal extends the replication policy to be elastic, so the NN may 
increase the target replication above the configured value in response to 
reads. This has been discussed for years. The target DN would be the chosen for 
the client, likely the DN running on the same node. Whether the DN learns of it 
in a heartbeat or a client requests the block (with a payload including the 
destination storage), the DN can replicate the block and tee data to the 
client. This would work not only for caching provided blocks in local media, 
but also for cases where long-running services gradually want data to migrate 
to the responsible worker nodes, temporarily going above the configured 
replication for that file. The machinery effecting provided storage can be 
independent of elastic replication, but agreed: the NN should continue to 
control when and where blocks are replicated, not the DN.

Without read-through caching, one could still tune the contents of local media 
at a file level by setting the target replication and storage policy. When 
writes are supported, policies that end in a PROVIDED type may require a 
special case, since it's not clear what multiple replicas would imply. If both 
local and remote are HDFS, the policy could use that to set replication in the 
external store, but the semantics are not obvious.

bq. So does a DN still reports connectivity to the PROVIDED store to NN at each 
BR? I guess an alternative is for NN itself to periodically check the 
connectivity?
Good point. The NN will need to periodically contact the external store to 
implement refresh, but this may not catch cases where the DN can't connect 
(e.g., expired credentials). The DN should report the storage as failed, which 
should remove its entry from the composite DatanodeStorage at the NN.

bq. This would be a tricky case to handle. How are directories persisted in the 
external store?
The NN needs a policy. Even if the directory exists in both places, the 
permissions/ownership may not match, or its successful creation may have been 
used by a framework to check exclusivity. The NN needs to know whether it or 
the remote is authoritative if both are applying updates. Similar to the 
"source of truth" 
[above|https://issues.apache.org/jira/browse/HDFS-9806?focusedCommentId=15295776&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15295776]:
 the design assumes that cycles are rare, and most tiered clusters will be 
either producers or consumers for a subtree/region. A truly shared namespace 
requires more sophisticated coordination between the two namesystems.

bq. I think more details can be added to Section 2 for clarification. In 
particular, per the above comment, is this work mainly intended for "using a 
big external store to back a single smaller HDFS"? Or the above "out-of-band 
update" use case is also important? Is it better to have a phase 1 for 
single-HDFS use case (no other updates to external store)?
The design doc needs to lay out the implementation plan more clearly. 
Initially, writes are not supported through HDFS (read-only). Refresh is an 
important case, so the NN needs some way to follow the external store so new 
data eventually appear in the NN. Write-through caching and write-back caching 
will take some time.

> Allow HDFS block replicas to be provided by an external storage system
> ----------------------------------------------------------------------
>
>                 Key: HDFS-9806
>                 URL: https://issues.apache.org/jira/browse/HDFS-9806
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Chris Douglas
>         Attachments: HDFS-9806-design.001.pdf
>
>
> In addition to heterogeneous media, many applications work with heterogeneous 
> storage systems. The guarantees and semantics provided by these systems are 
> often similar, but not identical to those of 
> [HDFS|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html].
>  Any client accessing multiple storage systems is responsible for reasoning 
> about each system independently, and must propagate/and renew credentials for 
> each store.
> Remote stores could be mounted under HDFS. Block locations could be mapped to 
> immutable file regions, opaque IDs, or other tokens that represent a 
> consistent view of the data. While correctness for arbitrary operations 
> requires careful coordination between stores, in practice we can provide 
> workable semantics with weaker guarantees.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to