This is what we do: Announce deployment well ahead of time - allowing users to plan switches to other clusters if desired.
Deploy: Leave topologies/workers running. Shutdown nimbus/upgrade/restart it. Validate nimbus stable. In parallel for subsets of N supervisors: Shutdown N supervisors/upgrade/kill workers on the supervisor/restart supervisor Validate supervisors are up Repeat until all supervisors fully upgraded Usually storm releases allow compatibility and this will work. At times, things can break. To mitigate you can do this pretesting: 1) Test in dev by deploying the old production release and testing the above while a topology runs on multiple supervisors and supervisors are partially upgraded. Validate all workers stay up after only some supervisors are upgraded. 2) Have a staging cluster where users run their topologies on test data and they can validate that their topologies run fine during the upgrade. If all this is done, you can have pretty decent confidence the release won't break things. If the release will break things, announce full downtime and plan on users stopping all topologies right before deploy and resubmitting once done. Our deployments are all automated/run in parallel. On Fri, Sep 20, 2024 at 2:08 AM Karthick <ibmkarthickma...@gmail.com> wrote: > Dear Apache Storm Community, > > I am currently managing an Apache Storm cluster with 38 nodes: 3 dedicated > to ZooKeeper, 1 to Nimbus and the UI, and 34 nodes running Supervisor and > Logviewer processes. Each node has 2 Workers. > > At present, our topology update process involves the following steps: > > 1. Killing the existing topology. > 2. Changing dependency JARs under the external-lib dir and restarting > Nimbus. > 3. Changing dependency JARs under the external-lib dir and restarting > Supervisors. > 4. Submitting the new topology. > > Each operation takes about 2–3 minutes. As the number of Supervisor nodes > increases, the overall time for topology updates is becoming a concern. > > I am reaching out to seek advice on how to optimize this process, as I > believe there are more efficient ways to handle topology updates in > large-scale Storm deployments. Specifically: > > - Is there a more efficient process to handle code changes without > having to manually restart Nimbus and Supervisors? > - How can I reduce the overall time for topology updates, especially as > our cluster continues to grow? > - Are there industry-standard practices for implementing rolling updates > or automating the deployment process? > > Any insights, recommendations, or best practices that could help streamline > our update process would be greatly appreciated. > > Thank you for your time, and I look forward to your suggestions! >