Re: Question about redistributing tablets on failure of a tserver.

Jason Heo Fri, 19 May 2017 22:08:11 -0700

Thanks, @dan @Todd

This issue has been resolved via https://gerrit.cloudera.org/#/c/6925/


Regards,

Jason

2017-05-09 4:55 GMT+09:00 Todd Lipcon <t...@cloudera.com>:

> Hey Jason
>
> Sorry for the delayed response here. It looks from your ksck like copying
> is ongoing but hasn't yet finished.
>
> FWIW Will B is working on adding more informative output to ksck to help
> diagnose cases like this:
> https://gerrit.cloudera.org/#/c/6772/
>
> -Todd
>
> On Thu, Apr 13, 2017 at 11:35 PM, Jason Heo <jason.heo....@gmail.com>
> wrote:
>
>> @Dan
>>
>> I monitored with `kudu ksck` while re-replication is occurring, but I'm
>> not sure if this output means my cluster has a problem. (It seems just
>> indicating one tserver stopped)
>>
>> Would you please check it?
>>
>> Thank,
>>
>> Jason
>>
>> ```
>> ...
>> ...
>> Tablet 0e29XXXXXXXXXXXXXXX1e1e3168a4d81 of table 'impala::tbl1' is
>> under-replicated: 1 replica(s) not RUNNING
>>   a7ca07f9bXXXXXXXXXXXXXXXbbb21cfb (hostname.com:7050): RUNNING
>>   a97644XXXXXXXXXXXXXXXdb074d4380f (hostname.com:7050): RUNNING [LEADER]
>>   401b6XXXXXXXXXXXXXXX5feda1de212b (hostname.com:7050): missing
>>
>> Tablet 550XXXXXXXXXXXXXXX08f5fc94126927 of table 'impala::tbl1' is
>> under-replicated: 1 replica(s) not RUNNING
>>   aec55b4XXXXXXXXXXXXXXXdb469427cf (hostname.com:7050): RUNNING [LEADER]
>>   a7ca07f9b3d94XXXXXXXXXXXXXXX1cfb (hostname.com:7050): RUNNING
>>   31461XXXXXXXXXXXXXXX3dbe060807a6 (hostname.com:7050): bad state
>>     State:       NOT_STARTED
>>     Data state:  TABLET_DATA_READY
>>     Last status: Tablet initializing...
>>
>> Tablet 4a1490fcXXXXXXXXXXXXXXX7a2c637e3 of table 'impala::tbl1' is
>> under-replicated: 1 replica(s) not RUNNING
>>   a7ca07f9b3d94414XXXXXXXXXXXXXXXb (hostname.com:7050): RUNNING
>>   40XXXXXXXXXXXXXXXd5b5feda1de212b (hostname.com:7050): RUNNING [LEADER]
>>   aec55b4e2acXXXXXXXXXXXXXXX9427cf (hostname.com:7050): bad state
>>     State:       NOT_STARTED
>>     Data state:  TABLET_DATA_COPYING
>>     Last status: TabletCopy: Downloading block 0000000005162382 (277/581)
>> ...
>> ...
>> ==================
>> Errors:
>> ==================
>> table consistency check error: Corruption: 52 table(s) are bad
>>
>> FAILED
>> Runtime error: ksck discovered errors
>> ```
>>
>>
>>
>> 2017-04-13 3:47 GMT+09:00 Dan Burkert <danburk...@apache.org>:
>>
>>> Hi Jason, answers inline:
>>>
>>> On Wed, Apr 12, 2017 at 5:53 AM, Jason Heo <jason.heo....@gmail.com>
>>> wrote:
>>>
>>>>
>>>> Q1. Can I disable redistributing tablets on failure of a tserver? The
>>>> reason why I need this is described in Background.
>>>>
>>>
>>> We don't have any kind of built-in maintenance mode that would prevent
>>> this, but it can be achieved by setting a flag on each of the tablet
>>> servers.  The goal is not to disable re-replicating tablets, but instead to
>>> avoid kicking the failed replica out of the tablet groups to begin with.
>>> There is a config flag to control exactly that: 'evict_failed_followers'.
>>> This isn't considered a stable or supported flag, but it should have the
>>> effect you are looking for, if you set it to false on each of the tablet
>>> servers, by running:
>>>
>>>     kudu tserver set-flag <tserver-addr> evict_failed_followers false
>>> --force
>>>
>>> for each tablet server.  When you are done, set it back to the default
>>> 'true' value.  This isn't something we routinely test (especially setting
>>> it without restarting the server), so please test before trying this on a
>>> production cluster.
>>>
>>> Q2. redistribution goes on even if the failed tserver reconnected to
>>>> cluster. In my test cluster, it took 2 hours to distribute when a tserver
>>>> which has 3TB data was killed.
>>>>
>>>
>>> This seems slow.  What's the speed of your network?  How many nodes?
>>> How many tablet replicas were on the failed tserver, and were the replica
>>> sizes evenly balanced?  Next time this happens, you might try monitoring
>>> with 'kudu ksck' to ensure there aren't additional problems in the cluster 
>>> (admin guide
>>> on the ksck tool
>>> <https://github.com/apache/kudu/blob/master/docs/administration.adoc#ksck>
>>> ).
>>>
>>>
>>>> Q3. `--follower_unavailable_considered_failed_sec` can be changed
>>>> without restarting cluster?
>>>>
>>>
>>> The flag can be changed, but it comes with the same caveats as above:
>>>
>>>     'kudu tserver set-flag <tserver-addr> 
>>> follower_unavailable_considered_failed_sec
>>> 900 --force'
>>>
>>>
>>> - Dan
>>>
>>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Question about redistributing tablets on failure of a tserver.

Reply via email to