Re: Seeking Suggestions for Optimizing Storm Topology Update Process

Aaron Gresch Fri, 20 Sep 2024 08:17:06 -0700

This is what we do:

Announce deployment well ahead of time - allowing users to plan switches to
other clusters if desired.

Deploy:
Leave topologies/workers running.
Shutdown nimbus/upgrade/restart it.
Validate nimbus stable.

In parallel for subsets of N supervisors:
Shutdown N supervisors/upgrade/kill workers on the supervisor/restart
supervisor
Validate supervisors are up

Repeat until all supervisors fully upgraded

Usually storm releases allow compatibility and this will work.  At times,
things can break.  To mitigate you can do this pretesting:

1) Test in dev by deploying the old production release and testing the
above while a topology runs on multiple supervisors and supervisors are
partially upgraded.  Validate all workers stay up after only some
supervisors are upgraded.
2) Have a staging cluster where users run their topologies on test data and
they can validate that their topologies run fine during the upgrade.

If all this is done, you can have pretty decent confidence the release
won't break things.  If the release will break things, announce full
downtime and plan on users stopping all topologies right before deploy and
resubmitting once done.

Our deployments are all automated/run in parallel.

On Fri, Sep 20, 2024 at 2:08 AM Karthick <ibmkarthickma...@gmail.com> wrote:

> Dear Apache Storm Community,
>
> I am currently managing an Apache Storm cluster with 38 nodes: 3 dedicated
> to ZooKeeper, 1 to Nimbus and the UI, and 34 nodes running Supervisor and
> Logviewer processes. Each node has 2 Workers.
>
> At present, our topology update process involves the following steps:
>
>    1. Killing the existing topology.
>    2. Changing dependency JARs under the external-lib dir and restarting
>    Nimbus.
>    3. Changing dependency JARs  under the external-lib dir and restarting
>    Supervisors.
>    4. Submitting the new topology.
>
> Each operation takes about 2–3 minutes. As the number of Supervisor nodes
> increases, the overall time for topology updates is becoming a concern.
>
> I am reaching out to seek advice on how to optimize this process, as I
> believe there are more efficient ways to handle topology updates in
> large-scale Storm deployments. Specifically:
>
>    - Is there a more efficient process to handle code changes without
>    having to manually restart Nimbus and Supervisors?
>    - How can I reduce the overall time for topology updates, especially as
>    our cluster continues to grow?
>    - Are there industry-standard practices for implementing rolling updates
>    or automating the deployment process?
>
> Any insights, recommendations, or best practices that could help streamline
> our update process would be greatly appreciated.
>
> Thank you for your time, and I look forward to your suggestions!
>

Re: Seeking Suggestions for Optimizing Storm Topology Update Process

Reply via email to