Hi All, Just to update on this, I've updated OFBIZ-13383 <https://issues.apache.org/jira/browse/OFBIZ-13383> with use-case details, an implementation plan, a sample case, and a possible workflow diagram. I will work to provide the patch soon for further community review.
Looking forward to any suggestions/thoughts. Thanks & Regards, Ankit Joshi On Mon, Apr 13, 2026 at 6:36 PM Ankit Joshi <[email protected]> wrote: > Thanks gil for providing insights about your experience with this issue, > which reflects its relevancy and frequency of encountering. > > I've created OFBIZ-13383 > <https://issues.apache.org/jira/browse/OFBIZ-13383> to further share the > implementation plan and related details there. > > Thanks & Regards, > Ankit Joshi > > On Fri, Apr 10, 2026 at 7:18 PM gil.portenseigne < > [email protected]> wrote: > >> Hello Ankit, >> >> We also met the issue and currently solved it using pre-stop kubernetes >> feature, for a pod to clean everything that is running before stoping, >> using a shell script and sql. >> >> It is effective for the big part, but sometimes it happens that an >> instance took a new job at the end of the pre-stop script. >> >> We added a way for our instance to ask what are the pod ids that are >> currently running to clean those remaining jobs. Now everything is ok, >> but not so clean. >> >> I'm please to read your ideas to solve this issue, and i think it goes >> the good way. >> >> Nice one, thanks ! >> >> [...] >> >> > As a *next* step, I think the out-of-the-box Job Poller should *itself >> *be >> > able to validate and handle such stale jobs and re-assign them to the >> other >> > active node for further processing. For this, I propose implementing a >> *Lease >> > + Heartbeat based job ownership *approach could be helpful here. This >> > validation method will include 3 steps: >> > >> > *#1 Assigning the node as the Job owner * >> > -- Assign the individual node identiifer (instance-id) as the owner for >> all >> > jobs it is running (*runByInstanceId*) along with a new custom field ( >> > *JobSandbox.leaseUpdatedStamp*) that will help the Job poller track the >> > last time the lease was updated by the node, confirming the node was >> still >> > active at that time. >> > >> > *#2 Heartbeat / Lease Renewal* >> > -- At a configured interval, the Job Poller running on each node will >> > update the lease timestamp for the open/in-progress jobs that the node >> > currently owns. >> > >> > *#3 Lease Expiry Validation* >> > The JobPoller running on each active node will also periodically >> validate >> > whether all the jobs owned by that node itself are actively updating >> their >> > heartbeat within the specified threshold. Any job that fails to update >> its >> > heartbeat within the given threshold will be considered owned by a stale >> > node and will be eligible for recovery. Job poller will release such >> stale >> > jobs identified, making them available for other active nodes to pick. >> > >> > *Proposed time frequency/intervals:* >> > *-- **Lease update* Interval*: **every 5 minutes* >> > *-- Lease Expiry* Threshold: *10 minutes* >> > *-- Lease Expiry validation* : every *8 minutes* >> > >> > *Points to consider* >> > -- Each node should have unique node identifier (runByInstanceId) that >> will >> > help to track/validate aliveness for each individual node. >> > -- The time intervals suggested above could also be added as a >> configurable >> > option via data. >> > >> > Looking forward to valuable thoughts on it. I'll create a Jira ticket >> for >> > this and will update the details there according to the inputs. >> > >> > Thanks & Regards, >> > Ankit Joshi >> >
