[jira] [Commented] (HDFS-1312) Re-balance disks within a Datanode

Anu Engineer (JIRA) Thu, 07 Jan 2016 16:02:19 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088431#comment-15088431
 ]


Anu Engineer commented on HDFS-1312:
------------------------------------

[~andrew.wang] I really appreciate you taking time out to read the proposal and 
provide such detailed feedback. 

bq. I really like separating planning from execution, since it'll make the unit 
tests actual unit tests (no minicluster!). This is a real issue with the 
existing balancer, the tests take forever to run and don't converge reliably.

Thank you, that was the intent.

bq. Section 2: HDFS-1804 has been around for a while and used successfully by 
many of our users, so "lack of real world data or adoption" is not entirely 
correct. We've even considered making it the default, see HDFS-8538 where the 
consensus was that we could do this if we add some additional throttling.

Thanks for correcting me and reference to HDFS-8538.  I know folks at Cloudera 
write some excellent engineering blog posts and papers, I would love to read 
your experiences with using HDFS-1804.

bq. IMO the usecase to focus on is the addition of fresh drives, particularly 
in the context of hotswap. I'm unconvinced that intra-node imbalance happens 
naturally when HDFS-1804 is enabling, and enabling HDFS-1804 is essential if a 
cluster is commonly suffering from intra-DN imbalance (e.g. from differently 
sized disks on a node). This means we should only see intra-node imbalance on 
admin action like adding a new drive; a singular, administrator-triggered 
operation.

Our own experience from the field is that many customers routinely run into 
this issue, with and without new drives being added, so we are tackling both 
use cases.

bq. However, I don't understand the need for cluster-wide reporting and 
orchestration. 

Just to make sure we are in same page, there is no cluster-wide orchestration. 
The cluster-wide reporting allows admins to see which nodes need disk 
balancing. Not all clusters are running HDFS-1804, and hence the ability to 
discover which machines need disk-balancing is useful for a large set of 
customers.

bq. I think most of this functionality should live in the DN since it's better 
equipped to do IO throttling and mutual exclusion.
That is how it is, HDFS-1312 is not complete and you will see all the data 
movement is indeed inside the datanode.

bq. Namely, this avoids the Discover step, simplifying things. There's also no 
global planning step. What I'm envisioning is a user experience where admin 
just points it at a DN, like:
{noformat}
hdfs balancer -volumes -datanode 1.2.3.4:50070
{noformat}

[Took the liberty to re-order some of your comments since both of these are 
best answered together]


Completely agree with the proposed command, in fact that is one of the commands 
that will be part of the tool. As you pointed out that is the simplest use case.

The reason for discovery is due to many considerations.

# This discover phase is needed even if we have to process a single node, we 
have to read the information about that node. I understand that you are 
pointing out that we may not need the info for the whole cluster, please see my 
next point.
# Discover phase  makes coding simpler since we are able to rely on current 
balancer code in {{NameNodeConnector}}, and avoids us having to write any new 
code. For the approach that you are suggesting we will have to add more  RPCs 
to datanode and for a rarely used administrative tool it seemed like an 
overkill, when balancer already provided a way for us to do it. if you look at 
current code in HDFS-1312, you will  see that discover is merely a pass-thru to 
the  {{Balancer#NameNodeConnector}}.
# With discover approach, we can take a snapshot of the nodes and cluster, that 
allows us to report to the admin what changes were done by us after the move. 
This is pretty useful. Since we have cluster-wide data with us it also allows 
us to report to an administrator which nodes need his/her attention (this is 
just a sort and print). As I mentioned earlier, unfortunately there is a large 
number of customers that still do not use HDFS-1804 and it is a beneficial 
feature for them.
# Last but most important from my personal point of view, this allows testing 
and error reporting for disk balancer to much more easier. Let us say we find a 
bug in disk-balancer, a customer could just provide the disk-balancer cluster 
json discovered by diskbalancer and we can debug the issue off-line. During the 
the development phase, that is how I have been testing disk balancer, by taking 
a snapshot of real clusters and then feeding that data back into disk-balancer 
via a connector called {{JsonNodeConnector}}. 

As I said earlier, the most generic use case I have in mind is the one you 
already described,  where the user wants to just point the balancer at a 
datanode, and we will support that use case.

bq. There's also no global planning step, I think most of this functionality 
should live in the DN since it's better equipped to do IO throttling and mutual 
exclusion. Basically we'd send an RPC to tell the DN to balance itself (with 
parameters), and then poll another RPC to watch the status and wait until it's 
done.

We have to have a plan -- either we do that inside the datanode or do it 
outside and submit the plan to the datanode. With doing it outside  we are able 
to test the planning phase independently. Please look at the test cases in 
{{TestPlanner.java}} we are able to test for both scale and correctness of the 
planner easily, with moving that into datanode we will get into the issue that  
we can only test a plan with miniDFSCluster, which we both agree is painful.

This also has an added benefit that an admin gets an opportunity to review the 
plan if needed, and supports the other use case where you can use this tool as 
block mover. Eventually we are hoping that this code will merge with mover.

bq. On the topic of the actual balancing, how do we atomically move the block 
in the presence of failures? Right now the NN expects only one replica per DN, 
so if the same replica is on multiple volumes of the DN, we could run into 
issues. See related (quite serious) issues like HDFS-7443 and HDFS-7960. I 
think we can do some tricks with a temp filename and rename, but this procedure 
should be carefully explained.

Thanks for the pointers. We have very little new code here, we rely on the 
mover logic, in fact what we have a thin wrapper over 
{{FsDatasetSpi#moveBlockAcrossStorage}}. But the concerns you raise are quite 
valid, and I will test to make sure that we don't regress on the data node 
side. We will cover all these scenarios and build some test cases specifically 
to replicate the error conditions that you are pointing out. Right now, all we 
do is take a lock at the FsDatasetSpi level (since we want mutual exclusion 
with {{DirectoryScanner}}) and rely on moveBlockAcrossStorage. I will also add 
more details to Architecture_and_testplan.pdf, if it is not  well explained in 
that document.

> Re-balance disks within a Datanode
> ----------------------------------
>
>                 Key: HDFS-1312
>                 URL: https://issues.apache.org/jira/browse/HDFS-1312
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode
>            Reporter: Travis Crawford
>            Assignee: Anu Engineer
>         Attachments: Architecture_and_testplan.pdf, disk-balancer-proposal.pdf
>
>
> Filing this issue in response to ``full disk woes`` on hdfs-user.
> Datanodes fill their storage directories unevenly, leading to situations 
> where certain disks are full while others are significantly less used. Users 
> at many different sites have experienced this issue, and HDFS administrators 
> are taking steps like:
> - Manually rebalancing blocks in storage directories
> - Decomissioning nodes & later readding them
> There's a tradeoff between making use of all available spindles, and filling 
> disks at the sameish rate. Possible solutions include:
> - Weighting less-used disks heavier when placing new blocks on the datanode. 
> In write-heavy environments this will still make use of all spindles, 
> equalizing disk use over time.
> - Rebalancing blocks locally. This would help equalize disk use as disks are 
> added/replaced in older cluster nodes.
> Datanodes should actively manage their local disk so operator intervention is 
> not needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-1312) Re-balance disks within a Datanode

Reply via email to