[ 
https://issues.apache.org/jira/browse/HDFS-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15295776#comment-15295776
 ] 

Chris Douglas commented on HDFS-9806:
-------------------------------------

Great questions.

bq. the way we interpreted the document, the external (provided) storage is the 
source of truth so any changes there should be updated in HDFS and any 
inconsistencies that arise would favour the external store
This is accurate, particularly of the first implementation. When it supports 
writes to provided storage, HDFS may be declared as the source of truth when it 
conflicts with the external store, but solving this generally for generic 
external stores seems untenable. Similarly, rules for merging conflicts are 
rarely general e.g., concurrent updates to Hive can't be resolved by the 
filesystem, but might be reconcilable by that system. The design assumes that 
cycles are rare, and most tiered clusters will be either producers or consumers 
for a subtree/region.

bq. If the Namenode is accessing the PROVIDED storage to update its mapping 
shouldn’t it also update the nonce data at the same time and instruct the 
datanode to refresh too? Or is the intention for the Namenode to only update 
the directory information and not the actual nonce data for the files? (If so, 
how could the Namenode apply heuristics to detect “promoting output to a parent 
directory”?).
Yes, the NN is also responsible for maintaining the nonce. The DN will refuse 
to fetch the block from the external store when it doesn't match the recorded 
metadata. In general, the NN can only periodically scan the external namespace 
(if it has one). It can't move between fixed, meaningful snapshots that remain 
valid while it updates its view. The scan may capture inconsistent states 
(e.g., as a job is promoting its data, the scan may find some of its output 
data promoted, some transient, and the sentinel file declaring it fully 
promoted). The heuristics are intended to avoid treating renames as 
delete/create of the same data, but they're not sufficient to guarantee the 
view is meaningful.

The design should not prevent a tighter integration, and it's flawed if it 
does. For example, if the NN can move between snapshots and coordinate with the 
external store, then the references only become stale when the NN moves past 
them (an unusually dramatic, but legal sequence for the NN without provided 
storage). It's also interesting to consider what refresh looks like, when the 
remote store doesn't have a namespace. A colleague used the prototype to fetch 
objects from an archive object store, using the NN as its (only) namespace. 
Here, the NN has no real "refresh" work to do, it only implements the mapping.

bq. How should this work in the face of Storage Policies? For example, if we 
have a StoragePolicy of {SSD, DISK, PROVIDED} 
The design depends on storage policies, and the mechanisms in the NN for 
managing heterogeneous storage. In the prototype, when the namespace is scanned 
the default sets its replication to 1 and the storage type as {PROVIDED, DISK}. 
When a DN reports that storage as accessible, all blocks in that store are 
reachable. For files with replication > 1, these are replicated into local 
storage at startup, but all blocks are immediately accessible.

On the write path, this could be used to implement write-through and write-back 
semantics. If the first replica is provided, then updates are durable in the 
external store before it returns to the client. The block manager needs to 
record the mapping before the block is written by the DN, so it knows where to 
put the data in the external store.

bq. When you say “Periodically and/or when a particular directory or file is 
accessed on the Namenode” do you mean this is something to be configured, or 
just that it hasn’t been decided if both are required. We think periodically is 
required since this is the only way to clean up directory listings with files 
that have been removed from the PROVIDED storage. On access, it makes sense to 
always make a HEAD request (or equivalent) to make sure it isn’t stale.
Agreed, the NN needs to stay in sync. It's not strictly required to verify 
every operation in the NN, since that optimistic assumption is (and must be) 
checked at the DN, anyway. In addition to loading new data, maintaining local 
replicas also requires the NN to do periodic scans, since datasets removed from 
the external store should eventually become inaccessible in the clusters 
caching them.

bq. Finally, do you anticipate changes to the wire protocol between the 
Namenode and Datanode?
Yes. To your earlier point, if the DN discovers an inconsistency, some 
component is responsible for fixing that up. We haven't prototyped this, but 
intuitively: if the NN is the only component updating the table, then it's 
easier to reason about the synchronization of its map to the external store. 
Since the NN may only periodically become aware of inconsistency, the DN should 
report stale aliases. Reporting these as "corrupt" is not accurate, since it 
may apply to _all_ replicas, including those in local media if the block was 
deleted or no longer accessible. That said, if we can avoid protocol changes 
without muddying the semantics, we will.

> Allow HDFS block replicas to be provided by an external storage system
> ----------------------------------------------------------------------
>
>                 Key: HDFS-9806
>                 URL: https://issues.apache.org/jira/browse/HDFS-9806
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Chris Douglas
>         Attachments: HDFS-9806-design.001.pdf
>
>
> In addition to heterogeneous media, many applications work with heterogeneous 
> storage systems. The guarantees and semantics provided by these systems are 
> often similar, but not identical to those of 
> [HDFS|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html].
>  Any client accessing multiple storage systems is responsible for reasoning 
> about each system independently, and must propagate/and renew credentials for 
> each store.
> Remote stores could be mounted under HDFS. Block locations could be mapped to 
> immutable file regions, opaque IDs, or other tokens that represent a 
> consistent view of the data. While correctness for arbitrary operations 
> requires careful coordination between stores, in practice we can provide 
> workable semantics with weaker guarantees.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to