Re: Handling stale jobs in multi-node OFBiz environment with auto-scaling enabled

Ankit Joshi Fri, 24 Apr 2026 07:54:30 -0700

Hi All,

Just to update on this, I've updated OFBIZ-13383
<https://issues.apache.org/jira/browse/OFBIZ-13383> with use-case details,
an implementation plan, a sample case, and a possible workflow diagram. I
will work to provide the patch soon for further community review.


Looking forward to any suggestions/thoughts.

Thanks & Regards,
Ankit Joshi

On Mon, Apr 13, 2026 at 6:36 PM Ankit Joshi <[email protected]>
wrote:

> Thanks gil for providing insights about your experience with this issue,
> which reflects its relevancy and frequency of encountering.
>
> I've created OFBIZ-13383
> <https://issues.apache.org/jira/browse/OFBIZ-13383> to further share the
> implementation plan and related details there.
>
> Thanks & Regards,
> Ankit Joshi
>
> On Fri, Apr 10, 2026 at 7:18 PM gil.portenseigne <
> [email protected]> wrote:
>
>> Hello Ankit,
>>
>> We also met the issue and currently solved it using pre-stop kubernetes
>> feature, for a pod to clean everything that is running before stoping,
>> using a shell script and sql.
>>
>> It is effective for the big part, but sometimes it happens that an
>> instance took a new job at the end of the pre-stop script.
>>
>> We added a way for our instance to ask what are the pod ids that are
>> currently running to clean those remaining jobs. Now everything is ok,
>> but not so clean.
>>
>> I'm please to read your ideas to solve this issue, and i think it goes
>> the good way.
>>
>> Nice one, thanks !
>>
>> [...]
>>
>> > As a *next* step, I think the out-of-the-box Job Poller should *itself
>> *be
>> > able to validate and handle such stale jobs and re-assign them to the
>> other
>> > active node for further processing. For this, I propose implementing a
>> *Lease
>> > + Heartbeat based job ownership *approach could be helpful here. This
>> > validation method will include 3 steps:
>> >
>> > *#1 Assigning the node as the Job owner *
>> > -- Assign the individual node identiifer (instance-id) as the owner for
>> all
>> > jobs it is running (*runByInstanceId*) along with a new custom field (
>> > *JobSandbox.leaseUpdatedStamp*) that will help the Job poller track the
>> > last time the lease was updated by the node, confirming the node was
>> still
>> > active at that time.
>> >
>> > *#2 Heartbeat / Lease Renewal*
>> > -- At a configured interval, the Job Poller running on each node will
>> > update the lease timestamp for the open/in-progress jobs that the node
>> > currently owns.
>> >
>> > *#3 Lease Expiry Validation*
>> > The JobPoller running on each active node will also periodically
>> validate
>> > whether all the jobs owned by that node itself are actively updating
>> their
>> > heartbeat within the specified threshold. Any job that fails to update
>> its
>> > heartbeat within the given threshold will be considered owned by a stale
>> > node and will be eligible for recovery. Job poller will release such
>> stale
>> > jobs identified, making them available for other active nodes to pick.
>> >
>> > *Proposed time frequency/intervals:*
>> > *-- **Lease update* Interval*: **every 5 minutes*
>> > *-- Lease Expiry* Threshold: *10 minutes*
>> > *-- Lease Expiry validation* : every *8 minutes*
>> >
>> > *Points to consider*
>> > -- Each node should have unique node identifier (runByInstanceId) that
>> will
>> > help to track/validate aliveness for each individual node.
>> > -- The time intervals suggested above could also be added as a
>> configurable
>> > option via data.
>> >
>> > Looking forward to valuable thoughts on it. I'll create a Jira ticket
>> for
>> > this and will update the details there according to the inputs.
>> >
>> > Thanks & Regards,
>> > Ankit Joshi
>>
>

Re: Handling stale jobs in multi-node OFBiz environment with auto-scaling enabled

Reply via email to