Hey Jason, What effect did you see with that patch applied? I've had mixed results with it in my failover tests - it hasn't resolved some of the issues that I expected it would, so I'm still looking in to it. Any feedback you have on it would be appreciated.
- Dan On Fri, May 19, 2017 at 10:07 PM, Jason Heo <jason.heo....@gmail.com> wrote: > Thanks, @dan @Todd > > This issue has been resolved via https://gerrit.cloudera.org/#/c/6925/ > > Regards, > > Jason > > 2017-05-09 4:55 GMT+09:00 Todd Lipcon <t...@cloudera.com>: > >> Hey Jason >> >> Sorry for the delayed response here. It looks from your ksck like copying >> is ongoing but hasn't yet finished. >> >> FWIW Will B is working on adding more informative output to ksck to help >> diagnose cases like this: >> https://gerrit.cloudera.org/#/c/6772/ >> >> -Todd >> >> On Thu, Apr 13, 2017 at 11:35 PM, Jason Heo <jason.heo....@gmail.com> >> wrote: >> >>> @Dan >>> >>> I monitored with `kudu ksck` while re-replication is occurring, but I'm >>> not sure if this output means my cluster has a problem. (It seems just >>> indicating one tserver stopped) >>> >>> Would you please check it? >>> >>> Thank, >>> >>> Jason >>> >>> ``` >>> ... >>> ... >>> Tablet 0e29XXXXXXXXXXXXXXX1e1e3168a4d81 of table 'impala::tbl1' is >>> under-replicated: 1 replica(s) not RUNNING >>> a7ca07f9bXXXXXXXXXXXXXXXbbb21cfb (hostname.com:7050): RUNNING >>> a97644XXXXXXXXXXXXXXXdb074d4380f (hostname.com:7050): RUNNING [LEADER] >>> 401b6XXXXXXXXXXXXXXX5feda1de212b (hostname.com:7050): missing >>> >>> Tablet 550XXXXXXXXXXXXXXX08f5fc94126927 of table 'impala::tbl1' is >>> under-replicated: 1 replica(s) not RUNNING >>> aec55b4XXXXXXXXXXXXXXXdb469427cf (hostname.com:7050): RUNNING [LEADER] >>> a7ca07f9b3d94XXXXXXXXXXXXXXX1cfb (hostname.com:7050): RUNNING >>> 31461XXXXXXXXXXXXXXX3dbe060807a6 (hostname.com:7050): bad state >>> State: NOT_STARTED >>> Data state: TABLET_DATA_READY >>> Last status: Tablet initializing... >>> >>> Tablet 4a1490fcXXXXXXXXXXXXXXX7a2c637e3 of table 'impala::tbl1' is >>> under-replicated: 1 replica(s) not RUNNING >>> a7ca07f9b3d94414XXXXXXXXXXXXXXXb (hostname.com:7050): RUNNING >>> 40XXXXXXXXXXXXXXXd5b5feda1de212b (hostname.com:7050): RUNNING [LEADER] >>> aec55b4e2acXXXXXXXXXXXXXXX9427cf (hostname.com:7050): bad state >>> State: NOT_STARTED >>> Data state: TABLET_DATA_COPYING >>> Last status: TabletCopy: Downloading block 0000000005162382 (277/581) >>> ... >>> ... >>> ================== >>> Errors: >>> ================== >>> table consistency check error: Corruption: 52 table(s) are bad >>> >>> FAILED >>> Runtime error: ksck discovered errors >>> ``` >>> >>> >>> >>> 2017-04-13 3:47 GMT+09:00 Dan Burkert <danburk...@apache.org>: >>> >>>> Hi Jason, answers inline: >>>> >>>> On Wed, Apr 12, 2017 at 5:53 AM, Jason Heo <jason.heo....@gmail.com> >>>> wrote: >>>> >>>>> >>>>> Q1. Can I disable redistributing tablets on failure of a tserver? The >>>>> reason why I need this is described in Background. >>>>> >>>> >>>> We don't have any kind of built-in maintenance mode that would prevent >>>> this, but it can be achieved by setting a flag on each of the tablet >>>> servers. The goal is not to disable re-replicating tablets, but instead to >>>> avoid kicking the failed replica out of the tablet groups to begin with. >>>> There is a config flag to control exactly that: 'evict_failed_followers'. >>>> This isn't considered a stable or supported flag, but it should have the >>>> effect you are looking for, if you set it to false on each of the tablet >>>> servers, by running: >>>> >>>> kudu tserver set-flag <tserver-addr> evict_failed_followers false >>>> --force >>>> >>>> for each tablet server. When you are done, set it back to the default >>>> 'true' value. This isn't something we routinely test (especially setting >>>> it without restarting the server), so please test before trying this on a >>>> production cluster. >>>> >>>> Q2. redistribution goes on even if the failed tserver reconnected to >>>>> cluster. In my test cluster, it took 2 hours to distribute when a tserver >>>>> which has 3TB data was killed. >>>>> >>>> >>>> This seems slow. What's the speed of your network? How many nodes? >>>> How many tablet replicas were on the failed tserver, and were the replica >>>> sizes evenly balanced? Next time this happens, you might try monitoring >>>> with 'kudu ksck' to ensure there aren't additional problems in the cluster >>>> (admin guide >>>> on the ksck tool >>>> <https://github.com/apache/kudu/blob/master/docs/administration.adoc#ksck> >>>> ). >>>> >>>> >>>>> Q3. `--follower_unavailable_considered_failed_sec` can be changed >>>>> without restarting cluster? >>>>> >>>> >>>> The flag can be changed, but it comes with the same caveats as above: >>>> >>>> 'kudu tserver set-flag <tserver-addr> >>>> follower_unavailable_considered_failed_sec >>>> 900 --force' >>>> >>>> >>>> - Dan >>>> >>>> >>> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera >> > >