[ 
https://issues.apache.org/jira/browse/HDFS-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131142#comment-14131142
 ] 

Jing Zhao commented on HDFS-6584:
---------------------------------

Thanks a lot for the great comments, [~andrew.wang]! Let me try to answer some 
of the questions here, and I believe [~szetszwo] will provide more details 
later.

bq. When does the Mover actually migrate data? When a block is finalized? When 
the file is closed? Some amount of time after? When the admin decides to run 
the Mover?

Currently the data is only migrated when admin runs the Mover.

bq. What is the load impact of scanning the namespace for files that need to be 
migrated? A naive ls -R / type operation could be bad.

Yeah, scanning the namespace is definitely a big burden here. HDFS-6875 adds 
the support to allow users to specify a list of paths for migration. And in the 
future we may want to support running multiple Movers for disjoint directories 
concurrently or even utilizing MR.

bq. Why are policies specified in XML files rather than in the fsimage / edit 
log? It seems very important to keep the policies consistent, and this is thus 
one more file that needs to be synchronized and backed up. Stashing it in the 
editlog would do this for you.

Agree. Actually Nicholas and I had a discussion about this before, and I had a 
unfinished preliminary patch but still need to think more about some details. 
We plan to finish this work after the merge.

bq. Can storage policies be set at a directory level? Testing to confirm this 
either way?

Yes, this has been done in HDFS-6847.

bq. How does this interact with snapshots? With replication factor, I believe 
we use the maximum replication factor across all snapshots. Here, would it be 
the union of all storage types across all snapshots? Not sure how the Mover 
accounts for this, or if a full-union is the right policy.

This has been addressed in HDFS-6969. Please see the discussion there.

bq. Do we have per-storage-type quotas? Are there APIs exposed to show, for 
instance, storage type usage by a snapshot, by a directory, etc?

This is a very good suggestion, especially considering we also have storage 
type SSD and in the future we may also have storage type MEMORY.

bq. How does this interact with open files?

Actually we should ignore the incomplete block which can be inferred from 
LocatedBlocks. I will file a new jira for this. Thanks! 
In another scenario, if a block later gets appended during the migration, the 
new replica will be marked as corrupted when it is reported to the NN because 
of the inconsistency of generation stamp.

> Support Archival Storage
> ------------------------
>
>                 Key: HDFS-6584
>                 URL: https://issues.apache.org/jira/browse/HDFS-6584
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: balancer, namenode
>            Reporter: Tsz Wo Nicholas Sze
>            Assignee: Tsz Wo Nicholas Sze
>         Attachments: HDFS-6584.000.patch, 
> HDFSArchivalStorageDesign20140623.pdf, HDFSArchivalStorageDesign20140715.pdf, 
> archival-storage-testplan.pdf, h6584_20140907.patch, h6584_20140908.patch, 
> h6584_20140908b.patch, h6584_20140911.patch, h6584_20140911b.patch
>
>
> In most of the Hadoop clusters, as more and more data is stored for longer 
> time, the demand for storage is outstripping the compute. Hadoop needs a cost 
> effective and easy to manage solution to meet this demand for storage. 
> Current solution is:
> - Delete the old unused data. This comes at operational cost of identifying 
> unnecessary data and deleting them manually.
> - Add more nodes to the clusters. This adds along with storage capacity 
> unnecessary compute capacity to the cluster.
> Hadoop needs a solution to decouple growing storage capacity from compute 
> capacity. Nodes with higher density and less expensive storage with low 
> compute power are becoming available and can be used as cold storage in the 
> clusters. Based on policy the data from hot storage can be moved to cold 
> storage. Adding more nodes to the cold storage can grow the storage 
> independent of the compute capacity in the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to