Daniel Iancu created SLING-13169:
------------------------------------
Summary: Fix Sling JobManager readiness condition to preserve
topology on transient probe failures
Key: SLING-13169
URL: https://issues.apache.org/jira/browse/SLING-13169
Project: Sling
Issue Type: Bug
Reporter: Daniel Iancu
h2. Background
Sling jobs are scheduled on pods that are entering shutdown, causing "Process
implementation not found" errors ({{{}WorkflowException{}}}) when OSGi services
have already deregistered.
h2. Root Cause
The original fix in {{org.apache.sling.event}} (v4.3.18) had a design flaw:
when the readiness condition was removed,
{{unbindJobProcessingEnabledCondition()}} called {{stopProcessing()}} which:
# Called {{caps.deactivate()}} — marked the topology as inactive
# Set {{this.topologyCapabilities = null}} — *destroyed the topology state*
# Triggered {{QueueManager.configurationChanged(false)}} → {{stopAllJobs()}} +
{{restart()}} (outdated all queues)
When the readiness condition returned (probe recovered),
{{bindJobProcessingEnabledCondition()}} called {{notifyListeners()}} but
{{topologyCapabilities}} was still {{null}}. So QueueManager received
{{configurationChanged(false)}} again — *jobs never resumed* until an unrelated
topology event recreated the capabilities.
Additionally, {{JobSchedulerImpl}} uses only the listener boolean from
{{configurationChanged()}} without independently checking
{{isJobProcessingEnabled()}}. If topology were preserved but the listener
boolean still reflected only topology state, {{JobSchedulerImpl}} would
continue scheduling jobs on a pod that should be draining.
A transient 2-3 second readiness blip permanently killed job processing.
h2. Proposed Fix (in {{org.apache.sling.event}})
The readiness condition should control *job pickup and scheduling*, not
*topology lifecycle*. The listener contract must reflect both topology state
AND readiness state so that all listeners — including {{JobSchedulerImpl}} —
behave correctly.
*Change 1 — {{JobManagerConfiguration.unbindJobProcessingEnabledCondition()}}:*
Remove the {{stopProcessing()}} call. When the condition is removed, just null
the condition reference and notify listeners. *Do not destroy topology state.*
This is the core fix that enables recovery when the condition returns.
*Change 2 — {{JobManagerConfiguration.notifyListeners()}}:*
Incorporate the readiness condition into the boolean sent to listeners. Instead
of sending {{caps != null}}, send {{caps != null && isJobProcessingEnabled()}}.
This ensures all listeners — including {{JobSchedulerImpl}} — see the correct
combined state and stop scheduling/processing when the readiness condition is
absent.
*Change 3 — {{JobManagerConfiguration.addListener()}}:*
Apply the same combined-state logic to the initial callback when a listener
subscribes. Instead of sending {{this.topologyCapabilities != null}}, send
{{this.topologyCapabilities != null && isJobProcessingEnabled()}}. This ensures
a freshly registered {{JobSchedulerImpl}} does not become active when topology
exists but the readiness condition is absent.
h2. What is NOT changed
* {{QueueManager.configurationChanged()}} is *not modified*. The {{restart()}}
and {{stopAllJobs()}} calls remain. {{restart()}} only outdates queues and
reschedules the retry list — it does not touch topology. With topology
preserved (Change 1), recovery on rebind works through
{{configurationChanged(true)}} → {{fullTopicScan()}}.
* {{stopProcessing()}} from topology events and {{deactivate()}} are unchanged.
Actual topology changes and real shutdown still destroy topology state as
before.
h2. Expected Behavior After Fix
|| Scenario || Current (rolled back) || With fix ||
| Pod shutting down | Jobs start on dying pod, fail with NPE, lost as ERROR |
New jobs blocked; in-flight jobs on the dying pod still fail as ERROR and are
*lost* (see limitation below) |
| Transient readiness blip (2-3s) | N/A (fix is rolled back) | New job pickup
and scheduling paused; queues outdated and rebuilt on recovery; processing
resumes automatically when probe recovers |
| Readiness probe permanently fails | N/A (fix is rolled back) | New job pickup
stopped; topology preserved; jobs resume instantly if probe eventually recovers
|
| Normal topology change | Handled by {{doHandleTopologyEvent}} →
{{stopProcessing()}} | Unchanged |
| Component deactivation (real shutdown) | {{deactivate()}} →
{{stopProcessing()}} | Unchanged |
h2. Important Limitation
This fix prevents *new* jobs from starting on a dying pod, but in-flight jobs
that are already running when the condition is removed will continue to
completion or failure. If they fail with an exception, they are marked ERROR
and *lost* — they are *not* retried on another pod. This is existing behavior
({{JobQueueImpl.java:400-406}}: "we don't reschedule if an exception occurs")
and is not changed by this fix. Addressing in-flight job retry on failure is a
separate concern.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)