Why are you specifically calling out pausing migrations?

And other than suspending migrations - what you have outlined is essentially 
the procedure(s) that can be done today.  I'm not sure how much impact there is 
with pausing migrations, as long as the suspend duration is increased.

Overall, I think we should be striving to  have services that are designed to 
be crash-only. That way a rolling restart is not a special case - just a series 
of planned crashes.  The suspend duration may be a special provision to limit 
churn, but it does slow recovery - so there is a trade-off there. Another 
trade-off would be flushing tablets of a tserver before the restart - this goes 
against the crash-only philosophy, but flushing can minimize the recovery 
necessary.  Working towards minimizing recovery would have benefits for the 
system in general, not just to support rolling restarts.

Potential issues that may need to be addressed:

  - What do you do about ingest? You'd need to account for both bulk and 
continuous ingest.  Stopping ingest for the entirety of the procedure might not 
be desired, but allowing it to continue would likely have the similar impacts 
as allowing migrations to continue.  With systems that perform a lot of 
continuous ingest, they would also likely benefit from flushing if ingest was 
not paused.
- What about compactions? 
 - The restarting of the tservers likely needs to be handled outside of 
Accumulo - there are too many ways that services are managed to account for 
variations - we could provide examples, but ultimately cluster users would need 
to tailor systemd or whatever they happen to use to their needs.
- The time for the duration of the restart is very user dependent. Some could 
decide that a very slow walk, would be "best" to minimize possible impacts to 
user scans, while others could opt to just rip off the band-aid - where user 
scans would be more likely to be impacted - but would occur over a smaller, 
defined window.  Some may decide that it should be completed within an hour, 
others might decide that completion within a single shift was acceptable, and 
others, well let's really stretch this out.
- Do you want to make special provisions for tservers that are hosting the root 
and metadata tablet(s)? If you identify those servers, you can elect to do them 
first so that they are out of the way - or do them last, or maybe it does not 
matter?  These tablets are the ones most likely to benefit from flushing before 
the restart to minimize recovery to the minimum extent practical. Depending on 
settings, flushing the metadata table may really help - a very active system 
and long periods between the gc runs and the gc flush / compaction settings.  
The metadata should recover without any special provisions, but there are 
opportunities to speed up the process.

Ed Coleman

-----Original Message-----
From: Adam Lerman <aler...@gmail.com> 
Sent: Monday, May 9, 2022 6:37 AM
To: dev@accumulo.apache.org
Subject: Rolling Update for patches

Accumulo Devs --

Wanted to put a feeler out if there was interest in adding a method for rolling 
updates to accumulo, especially for patch updates. I would love to see this 
adopted in the future so that patch updates could be applied with no downtime 
for the cluster.

My general thought would be:

1) put the system in upgrade mode (shell command)
           - suspend migrations, increase tserver.suspend.duration
2) Manually update and roll tservers -- slowly so as not to cause too much churn
3) Bounce the common services (manager, gc, etc)
4) Verify all looks good
5) take system out of upgrade mode

Does anyone have any thoughts about adding something like this?

Thanks!

Adam

Reply via email to