[ https://issues.apache.org/jira/browse/HDFS-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13627219#comment-13627219 ]
Andrew Purtell commented on HDFS-4672: -------------------------------------- bq. One way to reduce complexity and RAM pressure would be to only support placement hints on directories and have them apply only to files in that immediate directory. That should limit meta-data cost and address HBase and other use cases. On minimizing RAM pressure then the thing to do here might be to allow for hints on a directory to apply to all descendants. Otherwise if we have N directories under one parent then we would need N hints instead of 1. If the proposal for storage device class hints will be generalized/incorporated into an extended attributes facility, then this may be an interesting discussion. In the case of least Linux, Windows NT+, and *BSD, xattrs are arbitrary name/value pairs associated only with a single file or directory object, and a query on a given file or directory returns only the xattrs found in its inode (or equivalent). However since namespace storage in HDFS is at a premium, it may make sense to introduce a bit that signals the xattr should be inherited by all descendants. > Support tiered storage policies > ------------------------------- > > Key: HDFS-4672 > URL: https://issues.apache.org/jira/browse/HDFS-4672 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, hdfs-client, libhdfs, namenode > Reporter: Andrew Purtell > > We would like to be able to create certain files on certain storage device > classes (e.g. spinning media, solid state devices, RAM disk, non-volatile > memory). HDFS-2832 enables heterogeneous storage at the DataNode, so the > NameNode can gain awareness of what different storage options are available > in the pool and where they are located, but no API is provided for clients or > block placement plugins to perform device aware block placement. We would > like to propose a set of extensions that also have broad applicability to use > cases where storage device affinity is important: > > - Add an enum of generic storage device classes, borrowing from current > taxonomy of the storage industry > > - Augment DataNode volume metadata in storage reports with this enum > > - Extend the namespace so pluggable block policies can be specified on a > directory and storage device class can be tracked in the Inode. Perhaps this > could be a larger discussion on adding support for extended attributes in the > HDFS namespace. The Inode should track both the storage device class hint and > the current actual storage device class. FileStatus should expose this > information (or xattrs in general) to clients. > > - Extend the pluggable block policy framework so policies can also consider, > and specify, affinity for a particular storage device class > > - Extend the file creation API to accept a storage device class affinity > hint. Such a hint can be supplied directly as a parameter, or, if we are > considering extended attribute support, then instead as one of a set of > xattrs. The hint would be stored in the namespace and also used by the client > to indicate to the NameNode/block placement policy/DataNode constraints on > block placement. Furthermore, if xattrs or device storage class affinity > hints are associated with directories, then the NameNode should provide the > storage device affinity hint to the client in the create API response, so the > client can provide the appropriate hint to DataNodes when writing new blocks. > > - The list of candidate DataNodes for new blocks supplied by the NameNode to > clients should be weighted/sorted by availability of the desired storage > device class. > > - Block replication should consider storage device affinity hints. If a > client move()s a file from a location under a path with affinity hint X to > under a path with affinity hint Y, then all blocks currently residing on > media X should be eventually replicated onto media Y with the then excess > replicas on media X deleted. > > - Introduce the concept of degraded path: a path can be degraded if a block > placement policy is forced to abandon a constraint in order to persist the > block, when there may not be available space on the desired device class, or > to maintain the minimum necessary replication factor. This concept is > distinct from the corrupt path, where one or more blocks are missing. Paths > in degraded state should be periodically reevaluated for re-replication. > > - The FSShell should be extended with commands for changing the storage > device class hint for a directory or file. > > - Clients like DistCP which compare metadata should be extended to be aware > of the storage device class hint. For DistCP specifically, there should be an > option to ignore the storage device class hints, enabled by default. > > Suggested semantics: > > - The default storage device class should be the null class, or simply the > “default class”, for all cases where a hint is not available. This should be > configurable. hdfs-defaults.xml could provide the default as spinning media. > > - A storage device class hint should be provided (and is necessary) only when > the default is not sufficient. > > - For backwards compatibility, any FSImage or edit log entry lacking a > storage device class hint is interpreted as having affinity for the null > class. > > - All blocks for a given file share the same storage device class. If the > replication factor for this file is increased the replicas should all be > placed on the same storage device class. > > - If one or more blocks for a given file cannot be placed on the required > device class, then the file is marked as degraded. Files in degraded state > should be periodically reevaluated for re-replication. > > - A directory and path can only have one storage device affinity hint. If the > file inode specifies a hint, this is used, otherwise we walk up the path > until a hint is found and use that one, otherwise the default storage class > is used. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira