[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16281503#comment-16281503
 ] 

Rakesh R edited comment on HDFS-10285 at 12/7/17 11:26 AM:
-----------------------------------------------------------

Thanks a lot [~anu] for your time and comments.

bq. This is the most critical concern that I have. In one of the discussions 
with SPS developers, they pointed out to me that they want to make sure an SPS 
move happens within a reasonable time. Apparently, I was told that this is a 
requirement from HBase. If you have such a need, then the first thing an admin 
will do is to increase this queue size. Slowly, but steadily SPS will eat into 
more and more memory of Namenode
Increasing Namenode Q will not help to speedup the block movements. It is the 
Datanode who does actual block movements and need to tune Datanode bandwidth to 
speedup the block movements. Hence there is no sense in increasing Namenode Q. 
Infact, that will simply add up the pending tasks at the Namenode side.

Let me try putting the memory usage of Namenode Q:
Assume there are 1 million directories and users invoked 
{{dfs#satisfyStoragePolicy(path)}} API on these directories, which is a huge 
data movement and it may not be a regular case. Again, assume without knowing 
the advantage of increasing the Q size if some unpleasant user set the Q size 
to a higher value 1,000,000. Each API call, will add an {{Xattr}} to represent 
the pending movement and NN maintains list of pending dir InodeId to satisfy 
the policy, which is {{Long}} value. Each Xattr takes 15 chars 
{{"system.hdfs.sps"}} for the marking(Note: in the branch code it uses 
{{system.hdfs.satisfy.storage.policy}}, we will shorten the no. of chars to 
{{system.hdfs.sps}}). With that, the total space occupy is (xattr + inodeId) 
size.

*(1) Xattr entry*
Xattr: 12bytes(Object overhead) + 4bytes(String reference) + 4bytes(byte array) 
= 24
String "system.hdfs.sps": 40bytes(String Object) + 15bytes(chars) = 56bytes. 
Its creating new String objects every time ideally 56bytes count need not be 
counted every time. Still, I'm considering this.
byte[]: 4bytes
--------------------------------------------------------------------------------
84 bytes = (aligned 88bytes * 1,000,000) = 83.923MB
--------------------------------------------------------------------------------
If we keep SPS outside or inside Namenode, this much memory space will be 
occupied as xattribute is used to mark the pending item.

*(2) Namenode Q*
LinkedList entry = 24bytes
Long object = 12bytes(Object overhead) + 8bytes = aligned 24bytes
----------------------------------------------------------------------------------
48bytes * 1000,000 = 45.78MB
----------------------------------------------------------------------------------

45MB approax, which I feel is a smaller percentage and this may occur in the 
misconfgured scenario where many {{InodeIds}} queued up.

Default Q size value will be recommended as 10,000 = 48bytes * 10,000 = 
468.75KB. = 469KB.

Please feel free to correct me if I missed anything. Thanks!

bq. We have an existing pattern Balancer, Mover, DiskBalancer where we have the 
"scan and move tools" as an external feature to namenode. I am not able to see 
any convincing reason for breaking this pattern.
- {{Scanning}} - For scanning, CPU is the most consumed resource. IIUC, from 
your previous comments, I'm glad that you agreed that CPU is not an issue. 
Hence scanning is not a concern. If we run SPS outside, it has to put 
additional RPC calls for the SPS work and again switching of SPS-ha service has 
to blindly scan the entire namespace to figure out the xattrs. Now, for 
handling the switching scenarios, we have to come up with some kind of unfair 
tweaking logic like, write xattr somewhere in the file and new active SPS 
service should read it from there and continue. With this, I feel to keep the 
scanning logic at NN. 
FYI, NN has existing feature EDEK which also does scanning and we reuses the 
same code in SPS.
Also, I'm re-iterating the point that, SPS does not scan the files its own, 
user has to call API to satisfy a particular file.

- {{Moving blocks}} - It is something assigning the responsibility to Datanode. 
Presently, Namenode has several logic which does block movement - 
ReplicationMonitor, EC-Reconstruction, Decommissioning etc. We have added 
throttling mechanism for the sps block movements also, not to affect the 
existing data movements.

- AFAIK, DiskBalancer is completely run at the Datanode and it looks like 
Datanode utility. I don't think to compare it with SPS. Coming to the Balancer, 
which doesn't need any input file paths and it does balancing HDFS cluster 
based on the utilization. Balancer can run independently as it doesn't take any 
input file path argument and user may not be waiting to finish the balancing 
work, whereas SPS is exposed to the user via HSM feature. HSM is completely 
binds to the Namenode, which today only allows users to set the storage policy 
and changing the state at NN and NN is taking no action to satisfy the policy. 
For HSM feature, starting another service may be an overhead in reality and HSM 
adoption may be less. My personal opinion, just because of the 
Balancer/DiskBalancer running outside it is not a good reason for keeping SPS 
outside.


was (Author: rakeshr):
Thanks a lot [~anu] for your time and comments.

bq. This is the most critical concern that I have. In one of the discussions 
with SPS developers, they pointed out to me that they want to make sure an SPS 
move happens within a reasonable time. Apparently, I was told that this is a 
requirement from HBase. If you have such a need, then the first thing an admin 
will do is to increase this queue size. Slowly, but steadily SPS will eat into 
more and more memory of Namenode
Increasing Namenode Q will not help to speedup the block movements. It is the 
Datanode who does actual block movements and need to tune Datanode bandwidth to 
speedup the block movements. Hence there is no sense in increasing Namenode Q. 
Infact, that will simply add up the pending tasks at the Namenode side.

Let me try putting the memory usage of Namenode Q:
Assume there are 1 million directories and users invoked 
{{dfs#satisfyStoragePolicy(path)}} API on these directories, which is a huge 
data movement and it may not be a regular case. Again, assume without knowing 
the advantage of increasing the Q size if some unpleasant user set the Q size 
to a higher value 1,000,000. Each API call, will add an xattr to represent the 
pending movement and NN maintains list of pending dir InodeId to satisfy the 
policy, which is Long (IIUC, Long is 8 bytes of Object overhead, plus 8 bytes 
more for the actual long value). Each Xattr takes 15 chars 
{{"system.hdfs.sps"}} for the marking(Note: in the branch code it uses 
{{system.hdfs.satisfy.storage.policy}}, we will shorten the no. of chars to 
{{system.hdfs.sps}}). With that, the total space occupy is (xattr + inodeId) 
size:

1,000,000 * (30bytes + 16bytes) = 1000,000 * 46 = 46,000,000bytes = 43.87MB = 
44MB approax, which I feel is a smaller percentage and this may occur in the 
misconfgured scenario where many InodeIds queued up.

bq. We have an existing pattern Balancer, Mover, DiskBalancer where we have the 
"scan and move tools" as an external feature to namenode. I am not able to see 
any convincing reason for breaking this pattern.
- {{Scanning}} - For scanning, CPU is the most consumed resource. IIUC, from 
your previous comments, I'm glad that you agreed that CPU is not an issue. 
Hence scanning is not a concern. If we run SPS outside, it has to put 
additional RPC calls for the SPS work and again switching of SPS-ha service has 
to blindly scan the entire namespace to figure out the xattrs. Now, for 
handling the switching scenarios, we have to come up with some kind of unfair 
tweaking logic like, write xattr somewhere in the file and new active SPS 
service should read it from there and continue. With this, I feel to keep the 
scanning logic at NN. 
FYI, NN has existing feature EDEK which also does scanning and we reuses the 
same code in SPS.
Also, I'm re-iterating the point that, SPS does not scan the files its own, 
user has to call API to satisfy a particular file.

- {{Moving blocks}} - It is something assigning the responsibility to Datanode. 
Presently, Namenode has several logic which does block movement - 
ReplicationMonitor, EC-Reconstruction, Decommissioning etc. We have added 
throttling mechanism for the sps block movements also, not to affect the 
existing data movements.

- AFAIK, DiskBalancer is completely run at the Datanode and it looks like 
Datanode utility. I don't think to compare it with SPS. Coming to the Balancer, 
which doesn't need any input file paths and it does balancing HDFS cluster 
based on the utilization. Balancer can run independently as it doesn't take any 
input file path argument and user may not be waiting to finish the balancing 
work, whereas SPS is exposed to the user via HSM feature. HSM is completely 
binds to the Namenode, which today only allows users to set the storage policy 
and changing the state at NN and NN is taking no action to satisfy the policy. 
For HSM feature, starting another service may be an overhead in reality and HSM 
adoption may be less. My personal opinion, just because of the 
Balancer/DiskBalancer running outside it is not a good reason for keeping SPS 
outside.

> Storage Policy Satisfier in Namenode
> ------------------------------------
>
>                 Key: HDFS-10285
>                 URL: https://issues.apache.org/jira/browse/HDFS-10285
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>    Affects Versions: HDFS-10285
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>         Attachments: HDFS-10285-consolidated-merge-patch-00.patch, 
> HDFS-10285-consolidated-merge-patch-01.patch, 
> HDFS-10285-consolidated-merge-patch-02.patch, 
> HDFS-10285-consolidated-merge-patch-03.patch, 
> HDFS-SPS-TestReport-20170708.pdf, 
> Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, 
> Storage-Policy-Satisfier-in-HDFS-May10.pdf, 
> Storage-Policy-Satisfier-in-HDFS-Oct-26-2017.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These 
> policies can be set on directory/file to specify the user preference, where 
> to store the physical block. When user set the storage policy before writing 
> data, then the blocks could take advantage of storage policy preferences and 
> stores physical block accordingly. 
> If user set the storage policy after writing and completing the file, then 
> the blocks would have been written with default storage policy (nothing but 
> DISK). User has to run the ‘Mover tool’ explicitly by specifying all such 
> file names as a list. In some distributed system scenarios (ex: HBase) it 
> would be difficult to collect all the files and run the tool as different 
> nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage 
> policy file (inherited policy from parent directory) to another storage 
> policy effected directory, it will not copy inherited storage policy from 
> source. So it will take effect from destination file/dir parent storage 
> policy. This rename operation is just a metadata change in Namenode. The 
> physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for 
> admins from distributed nodes(ex: region servers) and running the Mover tool. 
> Here the proposal is to provide an API from Namenode itself for trigger the 
> storage policy satisfaction. A Daemon thread inside Namenode should track 
> such calls and process to DN as movement commands. 
> Will post the detailed design thoughts document soon. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to