[ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16281503#comment-16281503 ]
Rakesh R edited comment on HDFS-10285 at 12/7/17 11:26 AM: ----------------------------------------------------------- Thanks a lot [~anu] for your time and comments. bq. This is the most critical concern that I have. In one of the discussions with SPS developers, they pointed out to me that they want to make sure an SPS move happens within a reasonable time. Apparently, I was told that this is a requirement from HBase. If you have such a need, then the first thing an admin will do is to increase this queue size. Slowly, but steadily SPS will eat into more and more memory of Namenode Increasing Namenode Q will not help to speedup the block movements. It is the Datanode who does actual block movements and need to tune Datanode bandwidth to speedup the block movements. Hence there is no sense in increasing Namenode Q. Infact, that will simply add up the pending tasks at the Namenode side. Let me try putting the memory usage of Namenode Q: Assume there are 1 million directories and users invoked {{dfs#satisfyStoragePolicy(path)}} API on these directories, which is a huge data movement and it may not be a regular case. Again, assume without knowing the advantage of increasing the Q size if some unpleasant user set the Q size to a higher value 1,000,000. Each API call, will add an {{Xattr}} to represent the pending movement and NN maintains list of pending dir InodeId to satisfy the policy, which is {{Long}} value. Each Xattr takes 15 chars {{"system.hdfs.sps"}} for the marking(Note: in the branch code it uses {{system.hdfs.satisfy.storage.policy}}, we will shorten the no. of chars to {{system.hdfs.sps}}). With that, the total space occupy is (xattr + inodeId) size. *(1) Xattr entry* Xattr: 12bytes(Object overhead) + 4bytes(String reference) + 4bytes(byte array) = 24 String "system.hdfs.sps": 40bytes(String Object) + 15bytes(chars) = 56bytes. Its creating new String objects every time ideally 56bytes count need not be counted every time. Still, I'm considering this. byte[]: 4bytes -------------------------------------------------------------------------------- 84 bytes = (aligned 88bytes * 1,000,000) = 83.923MB -------------------------------------------------------------------------------- If we keep SPS outside or inside Namenode, this much memory space will be occupied as xattribute is used to mark the pending item. *(2) Namenode Q* LinkedList entry = 24bytes Long object = 12bytes(Object overhead) + 8bytes = aligned 24bytes ---------------------------------------------------------------------------------- 48bytes * 1000,000 = 45.78MB ---------------------------------------------------------------------------------- 45MB approax, which I feel is a smaller percentage and this may occur in the misconfgured scenario where many {{InodeIds}} queued up. Default Q size value will be recommended as 10,000 = 48bytes * 10,000 = 468.75KB. = 469KB. Please feel free to correct me if I missed anything. Thanks! bq. We have an existing pattern Balancer, Mover, DiskBalancer where we have the "scan and move tools" as an external feature to namenode. I am not able to see any convincing reason for breaking this pattern. - {{Scanning}} - For scanning, CPU is the most consumed resource. IIUC, from your previous comments, I'm glad that you agreed that CPU is not an issue. Hence scanning is not a concern. If we run SPS outside, it has to put additional RPC calls for the SPS work and again switching of SPS-ha service has to blindly scan the entire namespace to figure out the xattrs. Now, for handling the switching scenarios, we have to come up with some kind of unfair tweaking logic like, write xattr somewhere in the file and new active SPS service should read it from there and continue. With this, I feel to keep the scanning logic at NN. FYI, NN has existing feature EDEK which also does scanning and we reuses the same code in SPS. Also, I'm re-iterating the point that, SPS does not scan the files its own, user has to call API to satisfy a particular file. - {{Moving blocks}} - It is something assigning the responsibility to Datanode. Presently, Namenode has several logic which does block movement - ReplicationMonitor, EC-Reconstruction, Decommissioning etc. We have added throttling mechanism for the sps block movements also, not to affect the existing data movements. - AFAIK, DiskBalancer is completely run at the Datanode and it looks like Datanode utility. I don't think to compare it with SPS. Coming to the Balancer, which doesn't need any input file paths and it does balancing HDFS cluster based on the utilization. Balancer can run independently as it doesn't take any input file path argument and user may not be waiting to finish the balancing work, whereas SPS is exposed to the user via HSM feature. HSM is completely binds to the Namenode, which today only allows users to set the storage policy and changing the state at NN and NN is taking no action to satisfy the policy. For HSM feature, starting another service may be an overhead in reality and HSM adoption may be less. My personal opinion, just because of the Balancer/DiskBalancer running outside it is not a good reason for keeping SPS outside. was (Author: rakeshr): Thanks a lot [~anu] for your time and comments. bq. This is the most critical concern that I have. In one of the discussions with SPS developers, they pointed out to me that they want to make sure an SPS move happens within a reasonable time. Apparently, I was told that this is a requirement from HBase. If you have such a need, then the first thing an admin will do is to increase this queue size. Slowly, but steadily SPS will eat into more and more memory of Namenode Increasing Namenode Q will not help to speedup the block movements. It is the Datanode who does actual block movements and need to tune Datanode bandwidth to speedup the block movements. Hence there is no sense in increasing Namenode Q. Infact, that will simply add up the pending tasks at the Namenode side. Let me try putting the memory usage of Namenode Q: Assume there are 1 million directories and users invoked {{dfs#satisfyStoragePolicy(path)}} API on these directories, which is a huge data movement and it may not be a regular case. Again, assume without knowing the advantage of increasing the Q size if some unpleasant user set the Q size to a higher value 1,000,000. Each API call, will add an xattr to represent the pending movement and NN maintains list of pending dir InodeId to satisfy the policy, which is Long (IIUC, Long is 8 bytes of Object overhead, plus 8 bytes more for the actual long value). Each Xattr takes 15 chars {{"system.hdfs.sps"}} for the marking(Note: in the branch code it uses {{system.hdfs.satisfy.storage.policy}}, we will shorten the no. of chars to {{system.hdfs.sps}}). With that, the total space occupy is (xattr + inodeId) size: 1,000,000 * (30bytes + 16bytes) = 1000,000 * 46 = 46,000,000bytes = 43.87MB = 44MB approax, which I feel is a smaller percentage and this may occur in the misconfgured scenario where many InodeIds queued up. bq. We have an existing pattern Balancer, Mover, DiskBalancer where we have the "scan and move tools" as an external feature to namenode. I am not able to see any convincing reason for breaking this pattern. - {{Scanning}} - For scanning, CPU is the most consumed resource. IIUC, from your previous comments, I'm glad that you agreed that CPU is not an issue. Hence scanning is not a concern. If we run SPS outside, it has to put additional RPC calls for the SPS work and again switching of SPS-ha service has to blindly scan the entire namespace to figure out the xattrs. Now, for handling the switching scenarios, we have to come up with some kind of unfair tweaking logic like, write xattr somewhere in the file and new active SPS service should read it from there and continue. With this, I feel to keep the scanning logic at NN. FYI, NN has existing feature EDEK which also does scanning and we reuses the same code in SPS. Also, I'm re-iterating the point that, SPS does not scan the files its own, user has to call API to satisfy a particular file. - {{Moving blocks}} - It is something assigning the responsibility to Datanode. Presently, Namenode has several logic which does block movement - ReplicationMonitor, EC-Reconstruction, Decommissioning etc. We have added throttling mechanism for the sps block movements also, not to affect the existing data movements. - AFAIK, DiskBalancer is completely run at the Datanode and it looks like Datanode utility. I don't think to compare it with SPS. Coming to the Balancer, which doesn't need any input file paths and it does balancing HDFS cluster based on the utilization. Balancer can run independently as it doesn't take any input file path argument and user may not be waiting to finish the balancing work, whereas SPS is exposed to the user via HSM feature. HSM is completely binds to the Namenode, which today only allows users to set the storage policy and changing the state at NN and NN is taking no action to satisfy the policy. For HSM feature, starting another service may be an overhead in reality and HSM adoption may be less. My personal opinion, just because of the Balancer/DiskBalancer running outside it is not a good reason for keeping SPS outside. > Storage Policy Satisfier in Namenode > ------------------------------------ > > Key: HDFS-10285 > URL: https://issues.apache.org/jira/browse/HDFS-10285 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode > Affects Versions: HDFS-10285 > Reporter: Uma Maheswara Rao G > Assignee: Uma Maheswara Rao G > Attachments: HDFS-10285-consolidated-merge-patch-00.patch, > HDFS-10285-consolidated-merge-patch-01.patch, > HDFS-10285-consolidated-merge-patch-02.patch, > HDFS-10285-consolidated-merge-patch-03.patch, > HDFS-SPS-TestReport-20170708.pdf, > Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, > Storage-Policy-Satisfier-in-HDFS-May10.pdf, > Storage-Policy-Satisfier-in-HDFS-Oct-26-2017.pdf > > > Heterogeneous storage in HDFS introduced the concept of storage policy. These > policies can be set on directory/file to specify the user preference, where > to store the physical block. When user set the storage policy before writing > data, then the blocks could take advantage of storage policy preferences and > stores physical block accordingly. > If user set the storage policy after writing and completing the file, then > the blocks would have been written with default storage policy (nothing but > DISK). User has to run the ‘Mover tool’ explicitly by specifying all such > file names as a list. In some distributed system scenarios (ex: HBase) it > would be difficult to collect all the files and run the tool as different > nodes can write files separately and file can have different paths. > Another scenarios is, when user rename the files from one effected storage > policy file (inherited policy from parent directory) to another storage > policy effected directory, it will not copy inherited storage policy from > source. So it will take effect from destination file/dir parent storage > policy. This rename operation is just a metadata change in Namenode. The > physical blocks still remain with source storage policy. > So, Tracking all such business logic based file names could be difficult for > admins from distributed nodes(ex: region servers) and running the Mover tool. > Here the proposal is to provide an API from Namenode itself for trigger the > storage policy satisfaction. A Daemon thread inside Namenode should track > such calls and process to DN as movement commands. > Will post the detailed design thoughts document soon. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org