Hey Jason Sorry for the delayed response here. It looks from your ksck like copying is ongoing but hasn't yet finished.
FWIW Will B is working on adding more informative output to ksck to help diagnose cases like this: https://gerrit.cloudera.org/#/c/6772/ -Todd On Thu, Apr 13, 2017 at 11:35 PM, Jason Heo <jason.heo....@gmail.com> wrote: > @Dan > > I monitored with `kudu ksck` while re-replication is occurring, but I'm > not sure if this output means my cluster has a problem. (It seems just > indicating one tserver stopped) > > Would you please check it? > > Thank, > > Jason > > ``` > ... > ... > Tablet 0e29XXXXXXXXXXXXXXX1e1e3168a4d81 of table 'impala::tbl1' is > under-replicated: 1 replica(s) not RUNNING > a7ca07f9bXXXXXXXXXXXXXXXbbb21cfb (hostname.com:7050): RUNNING > a97644XXXXXXXXXXXXXXXdb074d4380f (hostname.com:7050): RUNNING [LEADER] > 401b6XXXXXXXXXXXXXXX5feda1de212b (hostname.com:7050): missing > > Tablet 550XXXXXXXXXXXXXXX08f5fc94126927 of table 'impala::tbl1' is > under-replicated: 1 replica(s) not RUNNING > aec55b4XXXXXXXXXXXXXXXdb469427cf (hostname.com:7050): RUNNING [LEADER] > a7ca07f9b3d94XXXXXXXXXXXXXXX1cfb (hostname.com:7050): RUNNING > 31461XXXXXXXXXXXXXXX3dbe060807a6 (hostname.com:7050): bad state > State: NOT_STARTED > Data state: TABLET_DATA_READY > Last status: Tablet initializing... > > Tablet 4a1490fcXXXXXXXXXXXXXXX7a2c637e3 of table 'impala::tbl1' is > under-replicated: 1 replica(s) not RUNNING > a7ca07f9b3d94414XXXXXXXXXXXXXXXb (hostname.com:7050): RUNNING > 40XXXXXXXXXXXXXXXd5b5feda1de212b (hostname.com:7050): RUNNING [LEADER] > aec55b4e2acXXXXXXXXXXXXXXX9427cf (hostname.com:7050): bad state > State: NOT_STARTED > Data state: TABLET_DATA_COPYING > Last status: TabletCopy: Downloading block 0000000005162382 (277/581) > ... > ... > ================== > Errors: > ================== > table consistency check error: Corruption: 52 table(s) are bad > > FAILED > Runtime error: ksck discovered errors > ``` > > > > 2017-04-13 3:47 GMT+09:00 Dan Burkert <danburk...@apache.org>: > >> Hi Jason, answers inline: >> >> On Wed, Apr 12, 2017 at 5:53 AM, Jason Heo <jason.heo....@gmail.com> >> wrote: >> >>> >>> Q1. Can I disable redistributing tablets on failure of a tserver? The >>> reason why I need this is described in Background. >>> >> >> We don't have any kind of built-in maintenance mode that would prevent >> this, but it can be achieved by setting a flag on each of the tablet >> servers. The goal is not to disable re-replicating tablets, but instead to >> avoid kicking the failed replica out of the tablet groups to begin with. >> There is a config flag to control exactly that: 'evict_failed_followers'. >> This isn't considered a stable or supported flag, but it should have the >> effect you are looking for, if you set it to false on each of the tablet >> servers, by running: >> >> kudu tserver set-flag <tserver-addr> evict_failed_followers false >> --force >> >> for each tablet server. When you are done, set it back to the default >> 'true' value. This isn't something we routinely test (especially setting >> it without restarting the server), so please test before trying this on a >> production cluster. >> >> Q2. redistribution goes on even if the failed tserver reconnected to >>> cluster. In my test cluster, it took 2 hours to distribute when a tserver >>> which has 3TB data was killed. >>> >> >> This seems slow. What's the speed of your network? How many nodes? How >> many tablet replicas were on the failed tserver, and were the replica sizes >> evenly balanced? Next time this happens, you might try monitoring with >> 'kudu ksck' to ensure there aren't additional problems in the cluster (admin >> guide >> on the ksck tool >> <https://github.com/apache/kudu/blob/master/docs/administration.adoc#ksck> >> ). >> >> >>> Q3. `--follower_unavailable_considered_failed_sec` can be changed >>> without restarting cluster? >>> >> >> The flag can be changed, but it comes with the same caveats as above: >> >> 'kudu tserver set-flag <tserver-addr> >> follower_unavailable_considered_failed_sec >> 900 --force' >> >> >> - Dan >> >> > -- Todd Lipcon Software Engineer, Cloudera