I believe the property is table.suspend.duration (not tablet.suspended.duration 
as you have in this email) – but the shell should have thrown an error saying 
the property cannot be set in zookeeper if you had it wrong.

What do you mean by:

but when i issued restart tserver (one at a time without waiting for first to 
come up)

I’m assuming the requirement is to keep the cluster up and serving users 
without major disruption – not to rip through the restart as fast as possible.  
With 6 – 8 nodes you should still be able to do this in under an hour.  If you 
had a much larger cluster then the concept is the same but you would want to 
use some number of tservers that is a fraction of the total available that 
would be cycled at any given point in time.

In general the way that I would do a conservative, rolling restart:


  1.  [optional] pause ingest – or be prepared for recovering any failed 
ingests if they occur.
  2.  [optional] Flush tables that have continuous ingest using the wait option 
– this should help minimize recovery.
  3.  Set the table.suspend.duration
  4.  For each tserver – one (or a small group for large cluster) at a time
     *   Stop the tserver
     *   Pause long enough that ZooKeeper recognizes the lost connection
     *   Restart the tserver
     *   Pause to allow for any recovery
  5.  Reset the table.suspend.duration back to 0s (the default)

If you tail the master / manager debug log you should get a good idea of what 
is going on – there should be messages showing the tserver leaving and then 
rejoining and any other activity related to recovery.  With a rolling restart 
the idea is to keep the cluster up and serving tables – only one (or a few) 
tservers go offline and for a short duration (general less than a minute) and 
between each tserver restart, time is allowed for things to stabilize.


From: Shailesh Ligade <slig...@fbi.gov>
Sent: Monday, November 29, 2021 11:17 AM
To: user@accumulo.apache.org
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart


Uhmm updated the setting tablet.suspended.duration to 5m

config -s tablet.suspended.duration=5m

but when i issued restart tserver (one at a time without waiting for first to 
come up), i still get all tablets unassigned 🙁 may be, I need to bring masters 
down first?

btw this is for accumulo 1.10.0

am I missing anything?

-S
________________________________
From: Shailesh Ligade <slig...@fbi.gov<mailto:slig...@fbi.gov>>
Sent: Monday, November 29, 2021 10:35 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
<user@accumulo.apache.org<mailto:user@accumulo.apache.org>>
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Thanks Michael,

stop cluster using admin stop? The issue is that, since we are using systemd 
with restart=always, it interferes with any of those stop (stop-all, stop-here 
etc) commands/scripts. So either we have to modify systemd settings or may be 
just shutdown vm type of operation (i think that is little brutal)

-S
________________________________
From: Michael Wall <mjw...@gmail.com<mailto:mjw...@gmail.com>>
Sent: Monday, November 29, 2021 9:54 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> 
<user@accumulo.apache.org<mailto:user@accumulo.apache.org>>
Subject: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Is there a reason to not just stop the cluster, reset the heap and restart the 
cluster?  That is simpler.

On Mon, Nov 29, 2021 at 9:37 AM dev1 
<d...@etcoleman.com<mailto:d...@etcoleman.com>> wrote:

Yes – and don’t forget to reset it back when you are done.



From: Ligade, Shailesh [USA] 
<ligade_shail...@bah.com<mailto:ligade_shail...@bah.com>>
Sent: Monday, November 29, 2021 9:36 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: RE: accumulo tserver rolling restart



Thanks,



I am assuming I can set that property using shell and it will take effect 
immediately?



Thanks



-S



From: dev1 <d...@etcoleman.com<mailto:d...@etcoleman.com>>
Sent: Monday, November 29, 2021 9:25 AM
To: 'user@accumulo.apache.org<mailto:user@accumulo.apache.org>' 
<user@accumulo.apache.org<mailto:user@accumulo.apache.org>>
Subject: [External] RE: accumulo tserver rolling restart



See 
https://accumulo.apache.org/1.10/accumulo_user_manual.html#_restarting_process_on_a_node<https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2Faccumulo.apache.org%2F1.10%2Faccumulo_user_manual.html*_restarting_process_on_a_node__%3BIw!!May37g!evyseDphy3PM_d8-tSlk89Sw1fFlSXHtH7vhiQedtcADc_P7OLEHw2kVZjlQ4Q8G_Q%24&data=04%7C01%7CSLIGADE%40FBI.GOV%7C363899b757914815738508d9b34de39b%7C022914a9b95f4b7bbace551ce1a04071%7C0%7C0%7C637737969389540183%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=p%2FOeqj%2BgzX5PV4H%2Bd3TluGSvACs2CERSRhwEnifXX1c%3D&reserved=0>
 – A note on rolling restarts.



There is property that can be set (table.suspend.duration) that will delay the 
reassignment while a tserver is restarting – there is a trade-off on the data 
not being available so try to minimize the time the tserver is off-line.



From: Ligade, Shailesh [USA] 
<ligade_shail...@bah.com<mailto:ligade_shail...@bah.com>>
Sent: Monday, November 29, 2021 9:19 AM
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: accumulo tserver rolling restart



Hello,



I want to restart al the tservers, say I updated the tserver heap size. Since 
we ar eusing system, I can issue restart command on a tserver. This causes all 
sorts of tablet movements even though accumulo is down for may be a second. If 
I wait for all unassigned tables to become 0, then to restart next tserver, 
then to completely restart a small cluster (6-8 nodes) take hours (roughly 4k+ 
tablets per tserver)



What may be right way to perform such routine maintenance operation? Is there a 
delay setting we can change so that it will not move tablets around? What may 
be a safe delay value?



-S

Reply via email to