[ 
https://issues.apache.org/jira/browse/HDFS-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088688#comment-15088688
 ] 

Anu Engineer commented on HDFS-1312:
------------------------------------

Hi [~andrew.wang] Thanks for the quick response. I was thinking about what you 
said  and I think the the disconnect we are having is because you are assuming 
that HDFS-1804 is always available on the HDFS clusters, but for many customers 
that is not true.

bq. Have these customers tried out HDFS-1804? We've had users complain about 
imbalance before too, and after enabling HDFS-1804, no further issues.

Generally administrators are wary of  enabling a feature like HDFS-1804 in a 
production cluster. For new clusters it is more easier but for existing 
production clusters assuming the existence of HDFS-1804 is not realistic.


bq. If you're interested, feel free to take HDFS-8538 from me. I really think 
it'll fix the majority of imbalance issues outside of hotswap.

Thank you for the offer. I can certainly work on HDFS-8538 after HDFS-1312. I 
think of HDFS-1804, HDFS-8538 and HDFS-1312 as part of the solution to the same 
problem. Just attacking it from different angles, without all three some part 
of HDFS users will always be left out.

bq. When I mentioned removing the discover phase, I meant the NN communication. 
Here, the DN just probes its own volume information. Does it need to talk to 
the NN for anything else?


# I am not able to see any particular advantage is doing planning inside the 
datanode. However with that approach we do lose one the critical feature of the 
tool, that is ability to report what we did to the machine, capturing the 
"before state" is much more complex. I agree that users can manually capture 
this info before and after, but that is an extra administrative burden. With 
the current approach, we record the information of a datanode before and we 
make it easy to compare the state once we are done.
# Also We wanted to merge mover into this engine later, with the current 
approach what we build inside datanode is a simple block mover, or to be 
precise an RPC interface to the existing mover's block interface. You can feed 
it any move commands you like, which provides better composability. With what 
you are suggesting we will lose that flexibility.
# Less complexity inside datanode, planner code never needs to run inside 
datanode, it is a piece of code that plans a set of moves, why would we  want 
to run it inside datanode ?.
# Since none of our tools currently report the disk level data distribution, 
without talking to Namenode it is not possible to find which nodes are 
imbalanced. I know that you are arguing that it will  never happen, if all 
customers use HDFS-1804. Two issues with that, one there are lots of customers 
without HDFS-1804, and HDFS-1804 is just an option that user can choose. Since 
it is configurable, it is arguable that we will always have customers without 
HDFS-1804. The current architecture addresses the needs of both group of users. 
In that sense current architecture is a better or more encompassing 
architecture.

bq. Cluster-wide disk information is already handled by monitoring tools, no? 
The admin gets the Ganglia alert saying some node is imbalanced, admin triggers 
intranode balancer, admin keeps looking at Ganglia to see if it's fixed. I 
don't think adding our own monitoring of the same information helps, when 
Ganglia etc. are already available, in-use, and understood by admins.

Please correct me if I am missing something here, Getting an alert due to low 
space on disk from datanode is very reactive. With the current disk balancer 
design we are assuming that  disk balancer tool can be run to address and fix 
any issue you have in the cluster. You could argue that admins can write some 
scripts to monitor this issue using ganglia command line, but it is common 
enough problem that I think it should be solved at HDFS level.

Here are two use cases that disk balancer addresses. First one is discovering 
nodes with potential issues and the second one is auto fixing those issues. 
This is very similar to current balancer.

1. Scenario 1: Admin can run {noformat} hdfs diskbalancer -top 100 {noformat} , 
and viola! we print out the top 100 hundred nodes that is having a problem. Let 
us say that admin now wants to look closely at the node and find out 
distribution on individual disks, he can now do that via. disk balancer, ssh or 
ganglia.

2. Scenario 2 : Admin wants  not to be bothered with this balancing act at all, 
in reality he is thinking why doesn't HDFS just take care of this(I know 
HDFS-1804 is addressing that, but again we are talking about cluster which does 
not have it enabled.) and in that case we will let the admin run {noformat} 
hdfs diskbalancer -top 10 -balance  {noformat}, this allows the admin to run 
disk balancer just like current balancer, without having to worry about what is 
happening or measuring each node. With gangalia a bunch of nodes will fire 
alerts, admin needs to copy the address of each datanode and give it to disk 
balancer. I think the current flow of disk balancer makes it easier to use.

bq. I don't think this conflicts with the debuggability goal. The DN can dump 
the Info object (and even the Plan) object) if requested, to the log or 
somewhere in a data dir.

Well, it is debuggable, but assuming that I am the one who will be called on to 
debug this, I prefer to debug by looking at my local directory instead of 
ssh-ing into a datanode. I think of writing to local directory as a gift I am 
making to my future self :).  Plus as mentioned earlier the other use case 
where we want to report to user what our tool did, fetching this data out of 
datanode's log directory is hard(may be another RPC to fetch it ??).


bq. Adding a note that says "we use the existing moveBlockAcrossStorage method" 
is a great answer. 

I will update the design doc with this info. Thanks for your suggestions.

> Re-balance disks within a Datanode
> ----------------------------------
>
>                 Key: HDFS-1312
>                 URL: https://issues.apache.org/jira/browse/HDFS-1312
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode
>            Reporter: Travis Crawford
>            Assignee: Anu Engineer
>         Attachments: Architecture_and_testplan.pdf, disk-balancer-proposal.pdf
>
>
> Filing this issue in response to ``full disk woes`` on hdfs-user.
> Datanodes fill their storage directories unevenly, leading to situations 
> where certain disks are full while others are significantly less used. Users 
> at many different sites have experienced this issue, and HDFS administrators 
> are taking steps like:
> - Manually rebalancing blocks in storage directories
> - Decomissioning nodes & later readding them
> There's a tradeoff between making use of all available spindles, and filling 
> disks at the sameish rate. Possible solutions include:
> - Weighting less-used disks heavier when placing new blocks on the datanode. 
> In write-heavy environments this will still make use of all spindles, 
> equalizing disk use over time.
> - Rebalancing blocks locally. This would help equalize disk use as disks are 
> added/replaced in older cluster nodes.
> Datanodes should actively manage their local disk so operator intervention is 
> not needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to