Woops, I meant it should land in time for 1.4. - Dan
On Mon, May 22, 2017 at 12:32 PM, Dan Burkert <danburk...@apache.org> wrote: > Thanks for the info, Jason. I spent some more time looking at this today, > and confirmed that the patch is working as intended. I've updated the > commit message with more info about the failure that was occurring, in case > you were interested. I expect this fix will land in time for 1.5. > > - Dan > > On Sat, May 20, 2017 at 8:47 PM, Jason Heo <jason.heo....@gmail.com> > wrote: > >> Hi. >> >> I'm not sure how can I explain. >> >> 1. >> re-replication is reduced from 20 hours to 2 hours 40 minutes. >> >> Here are some charts. >> >> Before applying the patch: >> >> - Total Tablet Size: http://i.imgur.com/QtT2sH4.png >> - Network & Disk Usage: http://i.imgur.com/m4gj6p2.png (started at >> 10 am, ended at tommorow 6 am) >> >> After applying the patch: >> >> - Total Tablet Size: http://i.imgur.com/7RmWQA4.png >> - Network & Disk Usage: http://i.imgur.com/Jd7q8iY.png >> >> 2. >> BTW, before applying, I got many "already in progress" messages in the >> kudu master log file. >> >> delete failed for tablet 'tablet_id' with error code >> TABLET_NOT_RUNNING: Illegal state: State transition of tablet 'tablet_id' >> already in progress: copying tablet >> >> But, after applied, there were no such messages. >> >> 3. >> before applying, I used Kudu 1.3.0 and version is upgraded to 1.4 by >> using the patch. >> >> Thanks. >> >> >> 2017-05-21 0:02 GMT+09:00 Dan Burkert <danburk...@apache.org>: >> >>> Hey Jason, >>> >>> What effect did you see with that patch applied? I've had mixed results >>> with it in my failover tests - it hasn't resolved some of the issues that I >>> expected it would, so I'm still looking in to it. Any feedback you have on >>> it would be appreciated. >>> >>> - Dan >>> >>> On Fri, May 19, 2017 at 10:07 PM, Jason Heo <jason.heo....@gmail.com> >>> wrote: >>> >>>> Thanks, @dan @Todd >>>> >>>> This issue has been resolved via https://gerrit.cloudera.org/#/c/6925/ >>>> >>>> Regards, >>>> >>>> Jason >>>> >>>> 2017-05-09 4:55 GMT+09:00 Todd Lipcon <t...@cloudera.com>: >>>> >>>>> Hey Jason >>>>> >>>>> Sorry for the delayed response here. It looks from your ksck like >>>>> copying is ongoing but hasn't yet finished. >>>>> >>>>> FWIW Will B is working on adding more informative output to ksck to >>>>> help diagnose cases like this: >>>>> https://gerrit.cloudera.org/#/c/6772/ >>>>> >>>>> -Todd >>>>> >>>>> On Thu, Apr 13, 2017 at 11:35 PM, Jason Heo <jason.heo....@gmail.com> >>>>> wrote: >>>>> >>>>>> @Dan >>>>>> >>>>>> I monitored with `kudu ksck` while re-replication is occurring, but >>>>>> I'm not sure if this output means my cluster has a problem. (It seems >>>>>> just >>>>>> indicating one tserver stopped) >>>>>> >>>>>> Would you please check it? >>>>>> >>>>>> Thank, >>>>>> >>>>>> Jason >>>>>> >>>>>> ``` >>>>>> ... >>>>>> ... >>>>>> Tablet 0e29XXXXXXXXXXXXXXX1e1e3168a4d81 of table 'impala::tbl1' is >>>>>> under-replicated: 1 replica(s) not RUNNING >>>>>> a7ca07f9bXXXXXXXXXXXXXXXbbb21cfb (hostname.com:7050): RUNNING >>>>>> a97644XXXXXXXXXXXXXXXdb074d4380f (hostname.com:7050): RUNNING >>>>>> [LEADER] >>>>>> 401b6XXXXXXXXXXXXXXX5feda1de212b (hostname.com:7050): missing >>>>>> >>>>>> Tablet 550XXXXXXXXXXXXXXX08f5fc94126927 of table 'impala::tbl1' is >>>>>> under-replicated: 1 replica(s) not RUNNING >>>>>> aec55b4XXXXXXXXXXXXXXXdb469427cf (hostname.com:7050): RUNNING >>>>>> [LEADER] >>>>>> a7ca07f9b3d94XXXXXXXXXXXXXXX1cfb (hostname.com:7050): RUNNING >>>>>> 31461XXXXXXXXXXXXXXX3dbe060807a6 (hostname.com:7050): bad state >>>>>> State: NOT_STARTED >>>>>> Data state: TABLET_DATA_READY >>>>>> Last status: Tablet initializing... >>>>>> >>>>>> Tablet 4a1490fcXXXXXXXXXXXXXXX7a2c637e3 of table 'impala::tbl1' is >>>>>> under-replicated: 1 replica(s) not RUNNING >>>>>> a7ca07f9b3d94414XXXXXXXXXXXXXXXb (hostname.com:7050): RUNNING >>>>>> 40XXXXXXXXXXXXXXXd5b5feda1de212b (hostname.com:7050): RUNNING >>>>>> [LEADER] >>>>>> aec55b4e2acXXXXXXXXXXXXXXX9427cf (hostname.com:7050): bad state >>>>>> State: NOT_STARTED >>>>>> Data state: TABLET_DATA_COPYING >>>>>> Last status: TabletCopy: Downloading block 0000000005162382 >>>>>> (277/581) >>>>>> ... >>>>>> ... >>>>>> ================== >>>>>> Errors: >>>>>> ================== >>>>>> table consistency check error: Corruption: 52 table(s) are bad >>>>>> >>>>>> FAILED >>>>>> Runtime error: ksck discovered errors >>>>>> ``` >>>>>> >>>>>> >>>>>> >>>>>> 2017-04-13 3:47 GMT+09:00 Dan Burkert <danburk...@apache.org>: >>>>>> >>>>>>> Hi Jason, answers inline: >>>>>>> >>>>>>> On Wed, Apr 12, 2017 at 5:53 AM, Jason Heo <jason.heo....@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> Q1. Can I disable redistributing tablets on failure of a tserver? >>>>>>>> The reason why I need this is described in Background. >>>>>>>> >>>>>>> >>>>>>> We don't have any kind of built-in maintenance mode that would >>>>>>> prevent this, but it can be achieved by setting a flag on each of the >>>>>>> tablet servers. The goal is not to disable re-replicating tablets, but >>>>>>> instead to avoid kicking the failed replica out of the tablet groups to >>>>>>> begin with. There is a config flag to control exactly that: >>>>>>> 'evict_failed_followers'. This isn't considered a stable or supported >>>>>>> flag, but it should have the effect you are looking for, if you set it >>>>>>> to >>>>>>> false on each of the tablet servers, by running: >>>>>>> >>>>>>> kudu tserver set-flag <tserver-addr> evict_failed_followers >>>>>>> false --force >>>>>>> >>>>>>> for each tablet server. When you are done, set it back to the >>>>>>> default 'true' value. This isn't something we routinely test >>>>>>> (especially >>>>>>> setting it without restarting the server), so please test before trying >>>>>>> this on a production cluster. >>>>>>> >>>>>>> Q2. redistribution goes on even if the failed tserver reconnected to >>>>>>>> cluster. In my test cluster, it took 2 hours to distribute when a >>>>>>>> tserver >>>>>>>> which has 3TB data was killed. >>>>>>>> >>>>>>> >>>>>>> This seems slow. What's the speed of your network? How many >>>>>>> nodes? How many tablet replicas were on the failed tserver, and were >>>>>>> the >>>>>>> replica sizes evenly balanced? Next time this happens, you might try >>>>>>> monitoring with 'kudu ksck' to ensure there aren't additional problems >>>>>>> in >>>>>>> the cluster (admin guide on the ksck tool >>>>>>> <https://github.com/apache/kudu/blob/master/docs/administration.adoc#ksck> >>>>>>> ). >>>>>>> >>>>>>> >>>>>>>> Q3. `--follower_unavailable_considered_failed_sec` can be changed >>>>>>>> without restarting cluster? >>>>>>>> >>>>>>> >>>>>>> The flag can be changed, but it comes with the same caveats as above: >>>>>>> >>>>>>> 'kudu tserver set-flag <tserver-addr> >>>>>>> follower_unavailable_considered_failed_sec 900 --force' >>>>>>> >>>>>>> >>>>>>> - Dan >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Todd Lipcon >>>>> Software Engineer, Cloudera >>>>> >>>> >>>> >>> >> >