Hi Jason, answers inline:

On Wed, Apr 12, 2017 at 5:53 AM, Jason Heo <jason.heo....@gmail.com> wrote:

>
> Q1. Can I disable redistributing tablets on failure of a tserver? The
> reason why I need this is described in Background.
>

We don't have any kind of built-in maintenance mode that would prevent
this, but it can be achieved by setting a flag on each of the tablet
servers.  The goal is not to disable re-replicating tablets, but instead to
avoid kicking the failed replica out of the tablet groups to begin with.
There is a config flag to control exactly that: 'evict_failed_followers'.
This isn't considered a stable or supported flag, but it should have the
effect you are looking for, if you set it to false on each of the tablet
servers, by running:

    kudu tserver set-flag <tserver-addr> evict_failed_followers false
--force

for each tablet server.  When you are done, set it back to the default
'true' value.  This isn't something we routinely test (especially setting
it without restarting the server), so please test before trying this on a
production cluster.

Q2. redistribution goes on even if the failed tserver reconnected to
> cluster. In my test cluster, it took 2 hours to distribute when a tserver
> which has 3TB data was killed.
>

This seems slow.  What's the speed of your network?  How many nodes?  How
many tablet replicas were on the failed tserver, and were the replica sizes
evenly balanced?  Next time this happens, you might try monitoring with
'kudu ksck' to ensure there aren't additional problems in the cluster
(admin guide
on the ksck tool
<https://github.com/apache/kudu/blob/master/docs/administration.adoc#ksck>).


> Q3. `--follower_unavailable_considered_failed_sec` can be changed without
> restarting cluster?
>

The flag can be changed, but it comes with the same caveats as above:

    'kudu tserver set-flag <tserver-addr>
follower_unavailable_considered_failed_sec
900 --force'


- Dan

Reply via email to