Re: About Archival Storage

kevin Tue, 19 Jul 2016 23:48:46 -0700

Ok,thanks.It means that I need to decide which data is hot and which time
it is cold, then change it storage policy and tell  'Mover tool'  to move
it.


2016-07-20 14:29 GMT+08:00 Rakesh Radhakrishnan <rake...@apache.org>:

> Based on storage policy the data from hot storage will be moved to cold
> storage. The storage policy defines the number of replicas to be located on
> each storage type. It is possible to change the storage policy on a
> directory(for example: HOT to COLD) and then invoke 'Mover tool' on that
> directory to make the policy effective. One can set/change the storage
> policy via HDFSCommand, "hdfs storagepolicies -setStoragePolicy -path
> <path> -policy <policy>". After setting the new policy, you need to run the
> tool, then it identifies the replicas to be moved based on the storage
> policy information, and schedules the movement between source and
> destination data nodes to satisfy the policy. Internally, the tool is
> comparing the 'storage type' of a block in order to fulfill the 'storage
> policy' requirement.
>
> Probably you can refer
> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html
> to know more about storage types, storage policies and hdfs commands. Hope
> this helps.
>
> Rakesh
>
> On Wed, Jul 20, 2016 at 10:30 AM, kevin <kiss.kevin...@gmail.com> wrote:
>
>> Thanks again. "automatically" what I mean is the hdfs mover knows the
>> hot data have come to cold , I don't need to tell it what exactly files/dirs
>> need to be move now ?
>> Of course I should tell it what files/dirs need to monitoring.
>>
>> 2016-07-20 12:35 GMT+08:00 Rakesh Radhakrishnan <rake...@apache.org>:
>>
>>> >>>I have another question is , hdfs mover (A New Data Migration Tool )
>>> know when to move data from hot to cold  automatically ?
>>> While running the tool, it reads the argument and get the separated list
>>> of hdfs files/dirs to migrate. Then it periodically scans these files in
>>> HDFS to check if the block placement satisfies the storage policy, if not
>>> satisfied it moves the replicas to a different storage type in order to
>>> fulfill the storage policy requirement. This cycle continues until it hits
>>> an error or no blocks to move etc. Could you please tell me, what do you
>>> meant by "automatically" ? FYI, HDFS-10285 is proposing an idea to
>>> introduce a daemon thread in Namenode to track the storage movements set by
>>> APIs from clients. This Daemon thread named as StoragePolicySatisfier(SPS)
>>> serves something similar to ReplicationMonitor. If interested you can read
>>> the https://goo.gl/NA5EY0 proposal/idea and welcome feedback.
>>>
>>> Sleep time between each cycle is, ('dfs.heartbeat.interval' * 2000) +
>>> ('dfs.namenode.replication.interval' * 1000) milliseconds;
>>>
>>> >>>It use algorithm like LRU、LFU ?
>>> It will simply iterating over the lists in the order of files/dirs given
>>> to this tool as an argument. afaik, its just maintains the order mentioned
>>> by the user.
>>>
>>> Regards,
>>> Rakesh
>>>
>>>
>>> On Wed, Jul 20, 2016 at 7:05 AM, kevin <kiss.kevin...@gmail.com> wrote:
>>>
>>>> Thanks a lot Rakesh.
>>>>
>>>> I have another question is , hdfs mover (A New Data Migration Tool )
>>>> know when to move data from hot to cold  automatically ? It
>>>> use algorithm like LRU、LFU ?
>>>>
>>>> 2016-07-19 19:55 GMT+08:00 Rakesh Radhakrishnan <rake...@apache.org>:
>>>>
>>>>> >>>>Is that mean I should config dfs.replication with 1 ?  if more
>>>>> than one I should not use *Lazy_Persist*  policies ?
>>>>>
>>>>> The idea of Lazy_Persist policy is, while writing blocks, one replica
>>>>> will be placed in memory first and then it is lazily persisted into DISK.
>>>>> It doesn't means that, you are not allowed to configure dfs.replication >
>>>>> 1. If 'dfs.replication' is configured > 1 then the first replica will be
>>>>> placed in RAM_DISK and all the other replicas (n-1) will be written to the
>>>>> DISK. Here the (n-1) replicas will have the overhead of pipeline
>>>>> replication over the network and the DISK write latency on the write hot
>>>>> path. So you will not get better performance results.
>>>>>
>>>>> IIUC, for getting memory latency benefits, it is recommended to use
>>>>> replication=1. In this way, applications should be able to perform single
>>>>> replica writes to a local DN with low latency. HDFS will store block data
>>>>> in memory and lazily save it to disk avoiding incurring disk write latency
>>>>> on the hot path. By writing to local memory we can also avoid checksum
>>>>> computation on the hot path.
>>>>>
>>>>> Regards,
>>>>> Rakesh
>>>>>
>>>>> On Tue, Jul 19, 2016 at 3:25 PM, kevin <kiss.kevin...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I don't quite understand :"Note that the Lazy_Persist policy is
>>>>>> useful only for single replica blocks. For blocks with more than one
>>>>>> replicas, all the replicas will be written to DISK since writing only one
>>>>>> of the replicas to RAM_DISK does not improve the overall performance."
>>>>>>
>>>>>> Is that mean I should config dfs.replication with 1 ?  if more than
>>>>>> one I should not use *Lazy_Persist*  policies ?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: About Archival Storage

Reply via email to