Thanks, @dan @Todd This issue has been resolved via https://gerrit.cloudera.org/#/c/6925/
Regards, Jason 2017-05-09 4:55 GMT+09:00 Todd Lipcon <t...@cloudera.com>: > Hey Jason > > Sorry for the delayed response here. It looks from your ksck like copying > is ongoing but hasn't yet finished. > > FWIW Will B is working on adding more informative output to ksck to help > diagnose cases like this: > https://gerrit.cloudera.org/#/c/6772/ > > -Todd > > On Thu, Apr 13, 2017 at 11:35 PM, Jason Heo <jason.heo....@gmail.com> > wrote: > >> @Dan >> >> I monitored with `kudu ksck` while re-replication is occurring, but I'm >> not sure if this output means my cluster has a problem. (It seems just >> indicating one tserver stopped) >> >> Would you please check it? >> >> Thank, >> >> Jason >> >> ``` >> ... >> ... >> Tablet 0e29XXXXXXXXXXXXXXX1e1e3168a4d81 of table 'impala::tbl1' is >> under-replicated: 1 replica(s) not RUNNING >> a7ca07f9bXXXXXXXXXXXXXXXbbb21cfb (hostname.com:7050): RUNNING >> a97644XXXXXXXXXXXXXXXdb074d4380f (hostname.com:7050): RUNNING [LEADER] >> 401b6XXXXXXXXXXXXXXX5feda1de212b (hostname.com:7050): missing >> >> Tablet 550XXXXXXXXXXXXXXX08f5fc94126927 of table 'impala::tbl1' is >> under-replicated: 1 replica(s) not RUNNING >> aec55b4XXXXXXXXXXXXXXXdb469427cf (hostname.com:7050): RUNNING [LEADER] >> a7ca07f9b3d94XXXXXXXXXXXXXXX1cfb (hostname.com:7050): RUNNING >> 31461XXXXXXXXXXXXXXX3dbe060807a6 (hostname.com:7050): bad state >> State: NOT_STARTED >> Data state: TABLET_DATA_READY >> Last status: Tablet initializing... >> >> Tablet 4a1490fcXXXXXXXXXXXXXXX7a2c637e3 of table 'impala::tbl1' is >> under-replicated: 1 replica(s) not RUNNING >> a7ca07f9b3d94414XXXXXXXXXXXXXXXb (hostname.com:7050): RUNNING >> 40XXXXXXXXXXXXXXXd5b5feda1de212b (hostname.com:7050): RUNNING [LEADER] >> aec55b4e2acXXXXXXXXXXXXXXX9427cf (hostname.com:7050): bad state >> State: NOT_STARTED >> Data state: TABLET_DATA_COPYING >> Last status: TabletCopy: Downloading block 0000000005162382 (277/581) >> ... >> ... >> ================== >> Errors: >> ================== >> table consistency check error: Corruption: 52 table(s) are bad >> >> FAILED >> Runtime error: ksck discovered errors >> ``` >> >> >> >> 2017-04-13 3:47 GMT+09:00 Dan Burkert <danburk...@apache.org>: >> >>> Hi Jason, answers inline: >>> >>> On Wed, Apr 12, 2017 at 5:53 AM, Jason Heo <jason.heo....@gmail.com> >>> wrote: >>> >>>> >>>> Q1. Can I disable redistributing tablets on failure of a tserver? The >>>> reason why I need this is described in Background. >>>> >>> >>> We don't have any kind of built-in maintenance mode that would prevent >>> this, but it can be achieved by setting a flag on each of the tablet >>> servers. The goal is not to disable re-replicating tablets, but instead to >>> avoid kicking the failed replica out of the tablet groups to begin with. >>> There is a config flag to control exactly that: 'evict_failed_followers'. >>> This isn't considered a stable or supported flag, but it should have the >>> effect you are looking for, if you set it to false on each of the tablet >>> servers, by running: >>> >>> kudu tserver set-flag <tserver-addr> evict_failed_followers false >>> --force >>> >>> for each tablet server. When you are done, set it back to the default >>> 'true' value. This isn't something we routinely test (especially setting >>> it without restarting the server), so please test before trying this on a >>> production cluster. >>> >>> Q2. redistribution goes on even if the failed tserver reconnected to >>>> cluster. In my test cluster, it took 2 hours to distribute when a tserver >>>> which has 3TB data was killed. >>>> >>> >>> This seems slow. What's the speed of your network? How many nodes? >>> How many tablet replicas were on the failed tserver, and were the replica >>> sizes evenly balanced? Next time this happens, you might try monitoring >>> with 'kudu ksck' to ensure there aren't additional problems in the cluster >>> (admin guide >>> on the ksck tool >>> <https://github.com/apache/kudu/blob/master/docs/administration.adoc#ksck> >>> ). >>> >>> >>>> Q3. `--follower_unavailable_considered_failed_sec` can be changed >>>> without restarting cluster? >>>> >>> >>> The flag can be changed, but it comes with the same caveats as above: >>> >>> 'kudu tserver set-flag <tserver-addr> >>> follower_unavailable_considered_failed_sec >>> 900 --force' >>> >>> >>> - Dan >>> >>> >> > > > -- > Todd Lipcon > Software Engineer, Cloudera >