[ 
https://issues.apache.org/jira/browse/ACCUMULO-4353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347329#comment-15347329
 ] 

Dave Marion commented on ACCUMULO-4353:
---------------------------------------

bq. Can you expand on this some more? Given that assignment is arguably the 
most important thing for the Master to do, why are we concerned about letting 
the master do that as fast as it can (for the aforementioned reason)? Do we 
need to come up with a more efficient way for the master to handle the 
reassignment of many tablets?

Reading through this, and bringing some first-hand experience, I don't think 
the issue is the the Master assigning tablets. It's the issue of tablet servers 
that are down for a short period of time. When a tserver goes down, the Master 
re-assigns the tablets. When the tserver comes back up, it goes through several 
rounds of balancing which could take a long time and cause a lot of churn.

bq. I'm a little worried about this as a configuration knob – I feel like it 
kind of goes against the highly-available distributed database which we expect 
Accumulo to be. When we don't reassign tablets fast, that is a direct lack of 
availability for clients to read data.

I don't see any harm done here as long as the default behavior is what happens 
today. Allowing an administrator to choose to delay tablet reassignment may not 
fit most use cases, but it could fit some.

My 2 cents.

> Stabilize tablet assignment during transient failure
> ----------------------------------------------------
>
>                 Key: ACCUMULO-4353
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4353
>             Project: Accumulo
>          Issue Type: Improvement
>            Reporter: Shawn Walker
>            Assignee: Shawn Walker
>            Priority: Minor
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> When a tablet server dies, Accumulo attempts to reassign the tablets it was 
> hosting as quickly as possible to maintain availability.  If multiple tablet 
> servers die in quick succession, such as from a rolling restart of the 
> Accumulo cluster or a network partition, this behavior can cause a storm of 
> reassignment and rebalancing, placing significant load on the master.
> To avert such load, Accumulo should be capable of maintaining a steady tablet 
> assignment state in the face of transient tablet server loss.  Instead of 
> reassigning tablets as quickly as possible, Accumulo should be await the 
> return of a temporarily downed tablet server (for some configurable duration) 
> before assigning its tablets to other tablet servers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to