Ok,thanks.It means that I need to decide which data is hot and which time it is cold, then change it storage policy and tell 'Mover tool' to move it.
2016-07-20 14:29 GMT+08:00 Rakesh Radhakrishnan <rake...@apache.org>: > Based on storage policy the data from hot storage will be moved to cold > storage. The storage policy defines the number of replicas to be located on > each storage type. It is possible to change the storage policy on a > directory(for example: HOT to COLD) and then invoke 'Mover tool' on that > directory to make the policy effective. One can set/change the storage > policy via HDFSCommand, "hdfs storagepolicies -setStoragePolicy -path > <path> -policy <policy>". After setting the new policy, you need to run the > tool, then it identifies the replicas to be moved based on the storage > policy information, and schedules the movement between source and > destination data nodes to satisfy the policy. Internally, the tool is > comparing the 'storage type' of a block in order to fulfill the 'storage > policy' requirement. > > Probably you can refer > https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html > to know more about storage types, storage policies and hdfs commands. Hope > this helps. > > Rakesh > > On Wed, Jul 20, 2016 at 10:30 AM, kevin <kiss.kevin...@gmail.com> wrote: > >> Thanks again. "automatically" what I mean is the hdfs mover knows the >> hot data have come to cold , I don't need to tell it what exactly files/dirs >> need to be move now ? >> Of course I should tell it what files/dirs need to monitoring. >> >> 2016-07-20 12:35 GMT+08:00 Rakesh Radhakrishnan <rake...@apache.org>: >> >>> >>>I have another question is , hdfs mover (A New Data Migration Tool ) >>> know when to move data from hot to cold automatically ? >>> While running the tool, it reads the argument and get the separated list >>> of hdfs files/dirs to migrate. Then it periodically scans these files in >>> HDFS to check if the block placement satisfies the storage policy, if not >>> satisfied it moves the replicas to a different storage type in order to >>> fulfill the storage policy requirement. This cycle continues until it hits >>> an error or no blocks to move etc. Could you please tell me, what do you >>> meant by "automatically" ? FYI, HDFS-10285 is proposing an idea to >>> introduce a daemon thread in Namenode to track the storage movements set by >>> APIs from clients. This Daemon thread named as StoragePolicySatisfier(SPS) >>> serves something similar to ReplicationMonitor. If interested you can read >>> the https://goo.gl/NA5EY0 proposal/idea and welcome feedback. >>> >>> Sleep time between each cycle is, ('dfs.heartbeat.interval' * 2000) + >>> ('dfs.namenode.replication.interval' * 1000) milliseconds; >>> >>> >>>It use algorithm like LRU、LFU ? >>> It will simply iterating over the lists in the order of files/dirs given >>> to this tool as an argument. afaik, its just maintains the order mentioned >>> by the user. >>> >>> Regards, >>> Rakesh >>> >>> >>> On Wed, Jul 20, 2016 at 7:05 AM, kevin <kiss.kevin...@gmail.com> wrote: >>> >>>> Thanks a lot Rakesh. >>>> >>>> I have another question is , hdfs mover (A New Data Migration Tool ) >>>> know when to move data from hot to cold automatically ? It >>>> use algorithm like LRU、LFU ? >>>> >>>> 2016-07-19 19:55 GMT+08:00 Rakesh Radhakrishnan <rake...@apache.org>: >>>> >>>>> >>>>Is that mean I should config dfs.replication with 1 ? if more >>>>> than one I should not use *Lazy_Persist* policies ? >>>>> >>>>> The idea of Lazy_Persist policy is, while writing blocks, one replica >>>>> will be placed in memory first and then it is lazily persisted into DISK. >>>>> It doesn't means that, you are not allowed to configure dfs.replication > >>>>> 1. If 'dfs.replication' is configured > 1 then the first replica will be >>>>> placed in RAM_DISK and all the other replicas (n-1) will be written to the >>>>> DISK. Here the (n-1) replicas will have the overhead of pipeline >>>>> replication over the network and the DISK write latency on the write hot >>>>> path. So you will not get better performance results. >>>>> >>>>> IIUC, for getting memory latency benefits, it is recommended to use >>>>> replication=1. In this way, applications should be able to perform single >>>>> replica writes to a local DN with low latency. HDFS will store block data >>>>> in memory and lazily save it to disk avoiding incurring disk write latency >>>>> on the hot path. By writing to local memory we can also avoid checksum >>>>> computation on the hot path. >>>>> >>>>> Regards, >>>>> Rakesh >>>>> >>>>> On Tue, Jul 19, 2016 at 3:25 PM, kevin <kiss.kevin...@gmail.com> >>>>> wrote: >>>>> >>>>>> I don't quite understand :"Note that the Lazy_Persist policy is >>>>>> useful only for single replica blocks. For blocks with more than one >>>>>> replicas, all the replicas will be written to DISK since writing only one >>>>>> of the replicas to RAM_DISK does not improve the overall performance." >>>>>> >>>>>> Is that mean I should config dfs.replication with 1 ? if more than >>>>>> one I should not use *Lazy_Persist* policies ? >>>>>> >>>>> >>>>> >>>> >>> >> >