[ https://issues.apache.org/jira/browse/HDFS-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15315169#comment-15315169 ]
Chris Douglas commented on HDFS-9806: ------------------------------------- bq. How should we update the over-replication logic to work with caching? Without read-through caching, there should be no over-replication. To implement it for the PoC, the DN just drops a replica on a local storage and reports it to the NN, which has a hack to ignore this case. This isn't a long-term solution. One proposal extends the replication policy to be elastic, so the NN may increase the target replication above the configured value in response to reads. This has been discussed for years. The target DN would be the chosen for the client, likely the DN running on the same node. Whether the DN learns of it in a heartbeat or a client requests the block (with a payload including the destination storage), the DN can replicate the block and tee data to the client. This would work not only for caching provided blocks in local media, but also for cases where long-running services gradually want data to migrate to the responsible worker nodes, temporarily going above the configured replication for that file. The machinery effecting provided storage can be independent of elastic replication, but agreed: the NN should continue to control when and where blocks are replicated, not the DN. Without read-through caching, one could still tune the contents of local media at a file level by setting the target replication and storage policy. When writes are supported, policies that end in a PROVIDED type may require a special case, since it's not clear what multiple replicas would imply. If both local and remote are HDFS, the policy could use that to set replication in the external store, but the semantics are not obvious. bq. So does a DN still reports connectivity to the PROVIDED store to NN at each BR? I guess an alternative is for NN itself to periodically check the connectivity? Good point. The NN will need to periodically contact the external store to implement refresh, but this may not catch cases where the DN can't connect (e.g., expired credentials). The DN should report the storage as failed, which should remove its entry from the composite DatanodeStorage at the NN. bq. This would be a tricky case to handle. How are directories persisted in the external store? The NN needs a policy. Even if the directory exists in both places, the permissions/ownership may not match, or its successful creation may have been used by a framework to check exclusivity. The NN needs to know whether it or the remote is authoritative if both are applying updates. Similar to the "source of truth" [above|https://issues.apache.org/jira/browse/HDFS-9806?focusedCommentId=15295776&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15295776]: the design assumes that cycles are rare, and most tiered clusters will be either producers or consumers for a subtree/region. A truly shared namespace requires more sophisticated coordination between the two namesystems. bq. I think more details can be added to Section 2 for clarification. In particular, per the above comment, is this work mainly intended for "using a big external store to back a single smaller HDFS"? Or the above "out-of-band update" use case is also important? Is it better to have a phase 1 for single-HDFS use case (no other updates to external store)? The design doc needs to lay out the implementation plan more clearly. Initially, writes are not supported through HDFS (read-only). Refresh is an important case, so the NN needs some way to follow the external store so new data eventually appear in the NN. Write-through caching and write-back caching will take some time. > Allow HDFS block replicas to be provided by an external storage system > ---------------------------------------------------------------------- > > Key: HDFS-9806 > URL: https://issues.apache.org/jira/browse/HDFS-9806 > Project: Hadoop HDFS > Issue Type: New Feature > Reporter: Chris Douglas > Attachments: HDFS-9806-design.001.pdf > > > In addition to heterogeneous media, many applications work with heterogeneous > storage systems. The guarantees and semantics provided by these systems are > often similar, but not identical to those of > [HDFS|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html]. > Any client accessing multiple storage systems is responsible for reasoning > about each system independently, and must propagate/and renew credentials for > each store. > Remote stores could be mounted under HDFS. Block locations could be mapped to > immutable file regions, opaque IDs, or other tokens that represent a > consistent view of the data. While correctness for arbitrary operations > requires careful coordination between stores, in practice we can provide > workable semantics with weaker guarantees. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org