Hi. I'm not sure how can I explain.
1. re-replication is reduced from 20 hours to 2 hours 40 minutes. Here are some charts. Before applying the patch: - Total Tablet Size: http://i.imgur.com/QtT2sH4.png - Network & Disk Usage: http://i.imgur.com/m4gj6p2.png (started at 10 am, ended at tommorow 6 am) After applying the patch: - Total Tablet Size: http://i.imgur.com/7RmWQA4.png - Network & Disk Usage: http://i.imgur.com/Jd7q8iY.png 2. BTW, before applying, I got many "already in progress" messages in the kudu master log file. delete failed for tablet 'tablet_id' with error code TABLET_NOT_RUNNING: Illegal state: State transition of tablet 'tablet_id' already in progress: copying tablet But, after applied, there were no such messages. 3. before applying, I used Kudu 1.3.0 and version is upgraded to 1.4 by using the patch. Thanks. 2017-05-21 0:02 GMT+09:00 Dan Burkert <danburk...@apache.org>: > Hey Jason, > > What effect did you see with that patch applied? I've had mixed results > with it in my failover tests - it hasn't resolved some of the issues that I > expected it would, so I'm still looking in to it. Any feedback you have on > it would be appreciated. > > - Dan > > On Fri, May 19, 2017 at 10:07 PM, Jason Heo <jason.heo....@gmail.com> > wrote: > >> Thanks, @dan @Todd >> >> This issue has been resolved via https://gerrit.cloudera.org/#/c/6925/ >> >> Regards, >> >> Jason >> >> 2017-05-09 4:55 GMT+09:00 Todd Lipcon <t...@cloudera.com>: >> >>> Hey Jason >>> >>> Sorry for the delayed response here. It looks from your ksck like >>> copying is ongoing but hasn't yet finished. >>> >>> FWIW Will B is working on adding more informative output to ksck to help >>> diagnose cases like this: >>> https://gerrit.cloudera.org/#/c/6772/ >>> >>> -Todd >>> >>> On Thu, Apr 13, 2017 at 11:35 PM, Jason Heo <jason.heo....@gmail.com> >>> wrote: >>> >>>> @Dan >>>> >>>> I monitored with `kudu ksck` while re-replication is occurring, but I'm >>>> not sure if this output means my cluster has a problem. (It seems just >>>> indicating one tserver stopped) >>>> >>>> Would you please check it? >>>> >>>> Thank, >>>> >>>> Jason >>>> >>>> ``` >>>> ... >>>> ... >>>> Tablet 0e29XXXXXXXXXXXXXXX1e1e3168a4d81 of table 'impala::tbl1' is >>>> under-replicated: 1 replica(s) not RUNNING >>>> a7ca07f9bXXXXXXXXXXXXXXXbbb21cfb (hostname.com:7050): RUNNING >>>> a97644XXXXXXXXXXXXXXXdb074d4380f (hostname.com:7050): RUNNING >>>> [LEADER] >>>> 401b6XXXXXXXXXXXXXXX5feda1de212b (hostname.com:7050): missing >>>> >>>> Tablet 550XXXXXXXXXXXXXXX08f5fc94126927 of table 'impala::tbl1' is >>>> under-replicated: 1 replica(s) not RUNNING >>>> aec55b4XXXXXXXXXXXXXXXdb469427cf (hostname.com:7050): RUNNING >>>> [LEADER] >>>> a7ca07f9b3d94XXXXXXXXXXXXXXX1cfb (hostname.com:7050): RUNNING >>>> 31461XXXXXXXXXXXXXXX3dbe060807a6 (hostname.com:7050): bad state >>>> State: NOT_STARTED >>>> Data state: TABLET_DATA_READY >>>> Last status: Tablet initializing... >>>> >>>> Tablet 4a1490fcXXXXXXXXXXXXXXX7a2c637e3 of table 'impala::tbl1' is >>>> under-replicated: 1 replica(s) not RUNNING >>>> a7ca07f9b3d94414XXXXXXXXXXXXXXXb (hostname.com:7050): RUNNING >>>> 40XXXXXXXXXXXXXXXd5b5feda1de212b (hostname.com:7050): RUNNING >>>> [LEADER] >>>> aec55b4e2acXXXXXXXXXXXXXXX9427cf (hostname.com:7050): bad state >>>> State: NOT_STARTED >>>> Data state: TABLET_DATA_COPYING >>>> Last status: TabletCopy: Downloading block 0000000005162382 >>>> (277/581) >>>> ... >>>> ... >>>> ================== >>>> Errors: >>>> ================== >>>> table consistency check error: Corruption: 52 table(s) are bad >>>> >>>> FAILED >>>> Runtime error: ksck discovered errors >>>> ``` >>>> >>>> >>>> >>>> 2017-04-13 3:47 GMT+09:00 Dan Burkert <danburk...@apache.org>: >>>> >>>>> Hi Jason, answers inline: >>>>> >>>>> On Wed, Apr 12, 2017 at 5:53 AM, Jason Heo <jason.heo....@gmail.com> >>>>> wrote: >>>>> >>>>>> >>>>>> Q1. Can I disable redistributing tablets on failure of a tserver? The >>>>>> reason why I need this is described in Background. >>>>>> >>>>> >>>>> We don't have any kind of built-in maintenance mode that would prevent >>>>> this, but it can be achieved by setting a flag on each of the tablet >>>>> servers. The goal is not to disable re-replicating tablets, but instead >>>>> to >>>>> avoid kicking the failed replica out of the tablet groups to begin with. >>>>> There is a config flag to control exactly that: 'evict_failed_followers'. >>>>> This isn't considered a stable or supported flag, but it should have the >>>>> effect you are looking for, if you set it to false on each of the tablet >>>>> servers, by running: >>>>> >>>>> kudu tserver set-flag <tserver-addr> evict_failed_followers false >>>>> --force >>>>> >>>>> for each tablet server. When you are done, set it back to the default >>>>> 'true' value. This isn't something we routinely test (especially setting >>>>> it without restarting the server), so please test before trying this on a >>>>> production cluster. >>>>> >>>>> Q2. redistribution goes on even if the failed tserver reconnected to >>>>>> cluster. In my test cluster, it took 2 hours to distribute when a tserver >>>>>> which has 3TB data was killed. >>>>>> >>>>> >>>>> This seems slow. What's the speed of your network? How many nodes? >>>>> How many tablet replicas were on the failed tserver, and were the replica >>>>> sizes evenly balanced? Next time this happens, you might try monitoring >>>>> with 'kudu ksck' to ensure there aren't additional problems in the >>>>> cluster (admin guide >>>>> on the ksck tool >>>>> <https://github.com/apache/kudu/blob/master/docs/administration.adoc#ksck> >>>>> ). >>>>> >>>>> >>>>>> Q3. `--follower_unavailable_considered_failed_sec` can be changed >>>>>> without restarting cluster? >>>>>> >>>>> >>>>> The flag can be changed, but it comes with the same caveats as above: >>>>> >>>>> 'kudu tserver set-flag <tserver-addr> >>>>> follower_unavailable_considered_failed_sec 900 --force' >>>>> >>>>> >>>>> - Dan >>>>> >>>>> >>>> >>> >>> >>> -- >>> Todd Lipcon >>> Software Engineer, Cloudera >>> >> >> >