Re: Question about redistributing tablets on failure of a tserver.

Jason Heo Sat, 20 May 2017 20:48:32 -0700

Hi.

I'm not sure how can I explain.


1.
re-replication is reduced from 20 hours to 2 hours 40 minutes.

Here are some charts.

Before applying the patch:

    - Total Tablet Size: http://i.imgur.com/QtT2sH4.png
    - Network & Disk Usage: http://i.imgur.com/m4gj6p2.png (started at 10
am, ended at tommorow 6 am)

After applying the patch:

    - Total Tablet Size: http://i.imgur.com/7RmWQA4.png
    - Network & Disk Usage: http://i.imgur.com/Jd7q8iY.png

2.
BTW, before applying, I got many "already in progress" messages in the kudu
master log file.

    delete failed for tablet 'tablet_id' with error code
TABLET_NOT_RUNNING: Illegal state: State transition of tablet 'tablet_id'
already in progress: copying tablet

But, after applied, there were no such messages.

3.
before applying, I used Kudu 1.3.0 and version is upgraded to 1.4 by using
the patch.

Thanks.


2017-05-21 0:02 GMT+09:00 Dan Burkert <danburk...@apache.org>:

> Hey Jason,
>
> What effect did you see with that patch applied?  I've had mixed results
> with it in my failover tests - it hasn't resolved some of the issues that I
> expected it would, so I'm still looking in to it.  Any feedback you have on
> it would be appreciated.
>
> - Dan
>
> On Fri, May 19, 2017 at 10:07 PM, Jason Heo <jason.heo....@gmail.com>
> wrote:
>
>> Thanks, @dan @Todd
>>
>> This issue has been resolved via https://gerrit.cloudera.org/#/c/6925/
>>
>> Regards,
>>
>> Jason
>>
>> 2017-05-09 4:55 GMT+09:00 Todd Lipcon <t...@cloudera.com>:
>>
>>> Hey Jason
>>>
>>> Sorry for the delayed response here. It looks from your ksck like
>>> copying is ongoing but hasn't yet finished.
>>>
>>> FWIW Will B is working on adding more informative output to ksck to help
>>> diagnose cases like this:
>>> https://gerrit.cloudera.org/#/c/6772/
>>>
>>> -Todd
>>>
>>> On Thu, Apr 13, 2017 at 11:35 PM, Jason Heo <jason.heo....@gmail.com>
>>> wrote:
>>>
>>>> @Dan
>>>>
>>>> I monitored with `kudu ksck` while re-replication is occurring, but I'm
>>>> not sure if this output means my cluster has a problem. (It seems just
>>>> indicating one tserver stopped)
>>>>
>>>> Would you please check it?
>>>>
>>>> Thank,
>>>>
>>>> Jason
>>>>
>>>> ```
>>>> ...
>>>> ...
>>>> Tablet 0e29XXXXXXXXXXXXXXX1e1e3168a4d81 of table 'impala::tbl1' is
>>>> under-replicated: 1 replica(s) not RUNNING
>>>>   a7ca07f9bXXXXXXXXXXXXXXXbbb21cfb (hostname.com:7050): RUNNING
>>>>   a97644XXXXXXXXXXXXXXXdb074d4380f (hostname.com:7050): RUNNING
>>>> [LEADER]
>>>>   401b6XXXXXXXXXXXXXXX5feda1de212b (hostname.com:7050): missing
>>>>
>>>> Tablet 550XXXXXXXXXXXXXXX08f5fc94126927 of table 'impala::tbl1' is
>>>> under-replicated: 1 replica(s) not RUNNING
>>>>   aec55b4XXXXXXXXXXXXXXXdb469427cf (hostname.com:7050): RUNNING
>>>> [LEADER]
>>>>   a7ca07f9b3d94XXXXXXXXXXXXXXX1cfb (hostname.com:7050): RUNNING
>>>>   31461XXXXXXXXXXXXXXX3dbe060807a6 (hostname.com:7050): bad state
>>>>     State:       NOT_STARTED
>>>>     Data state:  TABLET_DATA_READY
>>>>     Last status: Tablet initializing...
>>>>
>>>> Tablet 4a1490fcXXXXXXXXXXXXXXX7a2c637e3 of table 'impala::tbl1' is
>>>> under-replicated: 1 replica(s) not RUNNING
>>>>   a7ca07f9b3d94414XXXXXXXXXXXXXXXb (hostname.com:7050): RUNNING
>>>>   40XXXXXXXXXXXXXXXd5b5feda1de212b (hostname.com:7050): RUNNING
>>>> [LEADER]
>>>>   aec55b4e2acXXXXXXXXXXXXXXX9427cf (hostname.com:7050): bad state
>>>>     State:       NOT_STARTED
>>>>     Data state:  TABLET_DATA_COPYING
>>>>     Last status: TabletCopy: Downloading block 0000000005162382
>>>> (277/581)
>>>> ...
>>>> ...
>>>> ==================
>>>> Errors:
>>>> ==================
>>>> table consistency check error: Corruption: 52 table(s) are bad
>>>>
>>>> FAILED
>>>> Runtime error: ksck discovered errors
>>>> ```
>>>>
>>>>
>>>>
>>>> 2017-04-13 3:47 GMT+09:00 Dan Burkert <danburk...@apache.org>:
>>>>
>>>>> Hi Jason, answers inline:
>>>>>
>>>>> On Wed, Apr 12, 2017 at 5:53 AM, Jason Heo <jason.heo....@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Q1. Can I disable redistributing tablets on failure of a tserver? The
>>>>>> reason why I need this is described in Background.
>>>>>>
>>>>>
>>>>> We don't have any kind of built-in maintenance mode that would prevent
>>>>> this, but it can be achieved by setting a flag on each of the tablet
>>>>> servers.  The goal is not to disable re-replicating tablets, but instead 
>>>>> to
>>>>> avoid kicking the failed replica out of the tablet groups to begin with.
>>>>> There is a config flag to control exactly that: 'evict_failed_followers'.
>>>>> This isn't considered a stable or supported flag, but it should have the
>>>>> effect you are looking for, if you set it to false on each of the tablet
>>>>> servers, by running:
>>>>>
>>>>>     kudu tserver set-flag <tserver-addr> evict_failed_followers false
>>>>> --force
>>>>>
>>>>> for each tablet server.  When you are done, set it back to the default
>>>>> 'true' value.  This isn't something we routinely test (especially setting
>>>>> it without restarting the server), so please test before trying this on a
>>>>> production cluster.
>>>>>
>>>>> Q2. redistribution goes on even if the failed tserver reconnected to
>>>>>> cluster. In my test cluster, it took 2 hours to distribute when a tserver
>>>>>> which has 3TB data was killed.
>>>>>>
>>>>>
>>>>> This seems slow.  What's the speed of your network?  How many nodes?
>>>>> How many tablet replicas were on the failed tserver, and were the replica
>>>>> sizes evenly balanced?  Next time this happens, you might try monitoring
>>>>> with 'kudu ksck' to ensure there aren't additional problems in the 
>>>>> cluster (admin guide
>>>>> on the ksck tool
>>>>> <https://github.com/apache/kudu/blob/master/docs/administration.adoc#ksck>
>>>>> ).
>>>>>
>>>>>
>>>>>> Q3. `--follower_unavailable_considered_failed_sec` can be changed
>>>>>> without restarting cluster?
>>>>>>
>>>>>
>>>>> The flag can be changed, but it comes with the same caveats as above:
>>>>>
>>>>>     'kudu tserver set-flag <tserver-addr>
>>>>> follower_unavailable_considered_failed_sec 900 --force'
>>>>>
>>>>>
>>>>> - Dan
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>

Re: Question about redistributing tablets on failure of a tserver.

Reply via email to