RE: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Ligade, Shailesh [USA] Thu, 02 Dec 2021 05:00:24 -0800

Thanks for detail steps! Really appreciated.

Just curious, since that setting (table.suspend.duration) is not working for me 
in accumulo 1.10.0, can I just stop both the masters and then restart tserver 
one at a time (or all at once)? Will that speed up the restart without getting 
into this offline tablet situation and or data loss type situation? I can stop 
the ingest, flush the tables and then bring down the master…


We can take short downtime and my understanding is that the master is the one 
keeping track of tservers and offline tablets situation. So just curious…

Thanks again

-S

From: dev1 <[email protected]>
Sent: Monday, November 29, 2021 2:56 PM
To: '[email protected]' <[email protected]>
Subject: [External] RE: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

I believe the property is table.suspend.duration (not tablet.suspended.duration 
as you have in this email) – but the shell should have thrown an error saying 
the property cannot be set in zookeeper if you had it wrong.

What do you mean by:

but when i issued restart tserver (one at a time without waiting for first to 
come up)

I’m assuming the requirement is to keep the cluster up and serving users 
without major disruption – not to rip through the restart as fast as possible.  
With 6 – 8 nodes you should still be able to do this in under an hour.  If you 
had a much larger cluster then the concept is the same but you would want to 
use some number of tservers that is a fraction of the total available that 
would be cycled at any given point in time.

In general the way that I would do a conservative, rolling restart:


  1.  [optional] pause ingest – or be prepared for recovering any failed 
ingests if they occur.
  2.  [optional] Flush tables that have continuous ingest using the wait option 
– this should help minimize recovery.
  3.  Set the table.suspend.duration
  4.  For each tserver – one (or a small group for large cluster) at a time
     *   Stop the tserver
     *   Pause long enough that ZooKeeper recognizes the lost connection
     *   Restart the tserver
     *   Pause to allow for any recovery
  5.  Reset the table.suspend.duration back to 0s (the default)

If you tail the master / manager debug log you should get a good idea of what 
is going on – there should be messages showing the tserver leaving and then 
rejoining and any other activity related to recovery.  With a rolling restart 
the idea is to keep the cluster up and serving tables – only one (or a few) 
tservers go offline and for a short duration (general less than a minute) and 
between each tserver restart, time is allowed for things to stabilize.


From: Shailesh Ligade <[email protected]<mailto:[email protected]>>
Sent: Monday, November 29, 2021 11:17 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart


Uhmm updated the setting tablet.suspended.duration to 5m

config -s tablet.suspended.duration=5m

but when i issued restart tserver (one at a time without waiting for first to 
come up), i still get all tablets unassigned 🙁 may be, I need to bring masters 
down first?

btw this is for accumulo 1.10.0

am I missing anything?

-S
________________________________
From: Shailesh Ligade <[email protected]<mailto:[email protected]>>
Sent: Monday, November 29, 2021 10:35 AM
To: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Thanks Michael,

stop cluster using admin stop? The issue is that, since we are using systemd 
with restart=always, it interferes with any of those stop (stop-all, stop-here 
etc) commands/scripts. So either we have to modify systemd settings or may be 
just shutdown vm type of operation (i think that is little brutal)

-S
________________________________
From: Michael Wall <[email protected]<mailto:[email protected]>>
Sent: Monday, November 29, 2021 9:54 AM
To: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Is there a reason to not just stop the cluster, reset the heap and restart the 
cluster?  That is simpler.

On Mon, Nov 29, 2021 at 9:37 AM dev1 
<[email protected]<mailto:[email protected]>> wrote:

Yes – and don’t forget to reset it back when you are done.



From: Ligade, Shailesh [USA] 
<[email protected]<mailto:[email protected]>>
Sent: Monday, November 29, 2021 9:36 AM
To: [email protected]<mailto:[email protected]>
Subject: RE: accumulo tserver rolling restart



Thanks,



I am assuming I can set that property using shell and it will take effect 
immediately?



Thanks



-S



From: dev1 <[email protected]<mailto:[email protected]>>
Sent: Monday, November 29, 2021 9:25 AM
To: '[email protected]<mailto:[email protected]>' 
<[email protected]<mailto:[email protected]>>
Subject: [External] RE: accumulo tserver rolling restart



See 
https://accumulo.apache.org/1.10/accumulo_user_manual.html#_restarting_process_on_a_node<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2Faccumulo.apache.org*2F1.10*2Faccumulo_user_manual.html*_restarting_process_on_a_node__*3BIw!!May37g!evyseDphy3PM_d8-tSlk89Sw1fFlSXHtH7vhiQedtcADc_P7OLEHw2kVZjlQ4Q8G_Q*24&data=04*7C01*7CSLIGADE*40FBI.GOV*7C363899b757914815738508d9b34de39b*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637737969389540183*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=p*2FOeqj*2BgzX5PV4H*2Bd3TluGSvACs2CERSRhwEnifXX1c*3D&reserved=0__;JSUlJSUlJSUlKiUlJSUlJSUlJSUlJSUlJSUlJQ!!May37g!e_nAdxcZ_YbW8DCkWUX6TA7ZQTyaCUgOoHwNBzElKw28V3WJEuUD93wefizCiH0Epg$>
 – A note on rolling restarts.



There is property that can be set (table.suspend.duration) that will delay the 
reassignment while a tserver is restarting – there is a trade-off on the data 
not being available so try to minimize the time the tserver is off-line.



From: Ligade, Shailesh [USA] 
<[email protected]<mailto:[email protected]>>
Sent: Monday, November 29, 2021 9:19 AM
To: [email protected]<mailto:[email protected]>
Subject: accumulo tserver rolling restart



Hello,



I want to restart al the tservers, say I updated the tserver heap size. Since 
we ar eusing system, I can issue restart command on a tserver. This causes all 
sorts of tablet movements even though accumulo is down for may be a second. If 
I wait for all unassigned tables to become 0, then to restart next tserver, 
then to completely restart a small cluster (6-8 nodes) take hours (roughly 4k+ 
tablets per tserver)



What may be right way to perform such routine maintenance operation? Is there a 
delay setting we can change so that it will not move tablets around? What may 
be a safe delay value?



-S

RE: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Reply via email to