[ https://issues.apache.org/jira/browse/HDFS-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15295776#comment-15295776 ]
Chris Douglas commented on HDFS-9806: ------------------------------------- Great questions. bq. the way we interpreted the document, the external (provided) storage is the source of truth so any changes there should be updated in HDFS and any inconsistencies that arise would favour the external store This is accurate, particularly of the first implementation. When it supports writes to provided storage, HDFS may be declared as the source of truth when it conflicts with the external store, but solving this generally for generic external stores seems untenable. Similarly, rules for merging conflicts are rarely general e.g., concurrent updates to Hive can't be resolved by the filesystem, but might be reconcilable by that system. The design assumes that cycles are rare, and most tiered clusters will be either producers or consumers for a subtree/region. bq. If the Namenode is accessing the PROVIDED storage to update its mapping shouldn’t it also update the nonce data at the same time and instruct the datanode to refresh too? Or is the intention for the Namenode to only update the directory information and not the actual nonce data for the files? (If so, how could the Namenode apply heuristics to detect “promoting output to a parent directory”?). Yes, the NN is also responsible for maintaining the nonce. The DN will refuse to fetch the block from the external store when it doesn't match the recorded metadata. In general, the NN can only periodically scan the external namespace (if it has one). It can't move between fixed, meaningful snapshots that remain valid while it updates its view. The scan may capture inconsistent states (e.g., as a job is promoting its data, the scan may find some of its output data promoted, some transient, and the sentinel file declaring it fully promoted). The heuristics are intended to avoid treating renames as delete/create of the same data, but they're not sufficient to guarantee the view is meaningful. The design should not prevent a tighter integration, and it's flawed if it does. For example, if the NN can move between snapshots and coordinate with the external store, then the references only become stale when the NN moves past them (an unusually dramatic, but legal sequence for the NN without provided storage). It's also interesting to consider what refresh looks like, when the remote store doesn't have a namespace. A colleague used the prototype to fetch objects from an archive object store, using the NN as its (only) namespace. Here, the NN has no real "refresh" work to do, it only implements the mapping. bq. How should this work in the face of Storage Policies? For example, if we have a StoragePolicy of {SSD, DISK, PROVIDED} The design depends on storage policies, and the mechanisms in the NN for managing heterogeneous storage. In the prototype, when the namespace is scanned the default sets its replication to 1 and the storage type as {PROVIDED, DISK}. When a DN reports that storage as accessible, all blocks in that store are reachable. For files with replication > 1, these are replicated into local storage at startup, but all blocks are immediately accessible. On the write path, this could be used to implement write-through and write-back semantics. If the first replica is provided, then updates are durable in the external store before it returns to the client. The block manager needs to record the mapping before the block is written by the DN, so it knows where to put the data in the external store. bq. When you say “Periodically and/or when a particular directory or file is accessed on the Namenode” do you mean this is something to be configured, or just that it hasn’t been decided if both are required. We think periodically is required since this is the only way to clean up directory listings with files that have been removed from the PROVIDED storage. On access, it makes sense to always make a HEAD request (or equivalent) to make sure it isn’t stale. Agreed, the NN needs to stay in sync. It's not strictly required to verify every operation in the NN, since that optimistic assumption is (and must be) checked at the DN, anyway. In addition to loading new data, maintaining local replicas also requires the NN to do periodic scans, since datasets removed from the external store should eventually become inaccessible in the clusters caching them. bq. Finally, do you anticipate changes to the wire protocol between the Namenode and Datanode? Yes. To your earlier point, if the DN discovers an inconsistency, some component is responsible for fixing that up. We haven't prototyped this, but intuitively: if the NN is the only component updating the table, then it's easier to reason about the synchronization of its map to the external store. Since the NN may only periodically become aware of inconsistency, the DN should report stale aliases. Reporting these as "corrupt" is not accurate, since it may apply to _all_ replicas, including those in local media if the block was deleted or no longer accessible. That said, if we can avoid protocol changes without muddying the semantics, we will. > Allow HDFS block replicas to be provided by an external storage system > ---------------------------------------------------------------------- > > Key: HDFS-9806 > URL: https://issues.apache.org/jira/browse/HDFS-9806 > Project: Hadoop HDFS > Issue Type: New Feature > Reporter: Chris Douglas > Attachments: HDFS-9806-design.001.pdf > > > In addition to heterogeneous media, many applications work with heterogeneous > storage systems. The guarantees and semantics provided by these systems are > often similar, but not identical to those of > [HDFS|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html]. > Any client accessing multiple storage systems is responsible for reasoning > about each system independently, and must propagate/and renew credentials for > each store. > Remote stores could be mounted under HDFS. Block locations could be mapped to > immutable file regions, opaque IDs, or other tokens that represent a > consistent view of the data. While correctness for arbitrary operations > requires careful coordination between stores, in practice we can provide > workable semantics with weaker guarantees. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org