[
https://issues.apache.org/jira/browse/SLING-13169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Daniel Iancu updated SLING-13169:
---------------------------------
Description:
h2. Background
Sling jobs are scheduled on pods that are entering shutdown, causing "Process
implementation not found" errors ({{{}WorkflowException{}}}) when OSGi services
have already deregistered.
h2. Root Cause
The [original fix|https://issues.apache.org/jira/browse/SLING-12743] in
{{org.apache.sling.event}} (v4.3.18) had a design flaw: when the readiness
condition was removed, {{unbindJobProcessingEnabledCondition()}} called
{{stopProcessing()}} which:
# Called {{caps.deactivate()}} — marked the topology as inactive
# Set {{this.topologyCapabilities = null}} — *destroyed the topology state*
# Triggered {{QueueManager.configurationChanged(false)}} → {{stopAllJobs()}} +
{{restart()}} (outdated all queues)
When the readiness condition returned (probe recovered),
{{bindJobProcessingEnabledCondition()}} called {{notifyListeners()}} but
{{topologyCapabilities}} was still {{{}null{}}}. So QueueManager received
{{configurationChanged(false)}} again — *jobs never resumed* until an unrelated
topology event recreated the capabilities.
Additionally, {{JobSchedulerImpl}} uses only the listener boolean from
{{configurationChanged()}} without independently checking
{{{}isJobProcessingEnabled(){}}}. If topology were preserved but the listener
boolean still reflected only topology state, {{JobSchedulerImpl}} would
continue scheduling jobs on a pod that should be draining.
A transient 2-3 second readiness blip permanently killed job processing.
h2. Proposed Fix (in {{{}org.apache.sling.event{}}})
The readiness condition should control {*}job pickup and scheduling{*}, not
{*}topology lifecycle{*}. The listener contract must reflect both topology
state AND readiness state so that all listeners — including
{{JobSchedulerImpl}} — behave correctly.
*Change 1 —
{{{}JobManagerConfiguration.unbindJobProcessingEnabledCondition(){}}}:*
Remove the {{stopProcessing()}} call. When the condition is removed, just null
the condition reference and notify listeners. *Do not destroy topology state.*
This is the core fix that enables recovery when the condition returns.
*Change 2 — {{{}JobManagerConfiguration.notifyListeners(){}}}:*
Incorporate the readiness condition into the boolean sent to listeners. Instead
of sending {{{}caps != null{}}}, send {{{}caps != null &&
isJobProcessingEnabled(){}}}. This ensures all listeners — including
{{JobSchedulerImpl}} — see the correct combined state and stop
scheduling/processing when the readiness condition is absent.
*Change 3 — {{{}JobManagerConfiguration.addListener(){}}}:*
Apply the same combined-state logic to the initial callback when a listener
subscribes. Instead of sending {{{}this.topologyCapabilities != null{}}}, send
{{{}this.topologyCapabilities != null && isJobProcessingEnabled(){}}}. This
ensures a freshly registered {{JobSchedulerImpl}} does not become active when
topology exists but the readiness condition is absent.
h2. What is NOT changed
* {{QueueManager.configurationChanged()}} is {*}not modified{*}. The
{{restart()}} and {{stopAllJobs()}} calls remain. {{restart()}} only outdates
queues and reschedules the retry list — it does not touch topology. With
topology preserved (Change 1), recovery on rebind works through
{{configurationChanged(true)}} → {{{}fullTopicScan(){}}}.
* {{stopProcessing()}} from topology events and {{deactivate()}} are
unchanged. Actual topology changes and real shutdown still destroy topology
state as before.
h2. Expected Behavior After Fix
||Scenario||Current (rolled back)||With fix||
|Pod shutting down|Jobs start on dying pod, fail with NPE, lost as ERROR|New
jobs blocked; in-flight jobs on the dying pod still fail as ERROR and are
*lost* (see limitation below)|
|Transient readiness blip (2-3s)|N/A (fix is rolled back)|New job pickup and
scheduling paused; queues outdated and rebuilt on recovery; processing resumes
automatically when probe recovers|
|Readiness probe permanently fails|N/A (fix is rolled back)|New job pickup
stopped; topology preserved; jobs resume instantly if probe eventually recovers|
|Normal topology change|Handled by {{doHandleTopologyEvent}} →
{{stopProcessing()}}|Unchanged|
|Component deactivation (real shutdown)|{{deactivate()}} →
{{stopProcessing()}}|Unchanged|
h2. Important Limitation
This fix prevents *new* jobs from starting on a dying pod, but in-flight jobs
that are already running when the condition is removed will continue to
completion or failure. If they fail with an exception, they are marked ERROR
and *lost* — they are *not* retried on another pod. This is existing behavior
({{{}JobQueueImpl.java:400-406{}}}: "we don't reschedule if an exception
occurs") and is not changed by this fix. Addressing in-flight job retry on
failure is a separate concern.
was:
h2. Background
Sling jobs are scheduled on pods that are entering shutdown, causing "Process
implementation not found" errors ({{{}WorkflowException{}}}) when OSGi services
have already deregistered.
h2. Root Cause
The original fix in {{org.apache.sling.event}} (v4.3.18) had a design flaw:
when the readiness condition was removed,
{{unbindJobProcessingEnabledCondition()}} called {{stopProcessing()}} which:
# Called {{caps.deactivate()}} — marked the topology as inactive
# Set {{this.topologyCapabilities = null}} — *destroyed the topology state*
# Triggered {{QueueManager.configurationChanged(false)}} → {{stopAllJobs()}} +
{{restart()}} (outdated all queues)
When the readiness condition returned (probe recovered),
{{bindJobProcessingEnabledCondition()}} called {{notifyListeners()}} but
{{topologyCapabilities}} was still {{{}null{}}}. So QueueManager received
{{configurationChanged(false)}} again — *jobs never resumed* until an unrelated
topology event recreated the capabilities.
Additionally, {{JobSchedulerImpl}} uses only the listener boolean from
{{configurationChanged()}} without independently checking
{{{}isJobProcessingEnabled(){}}}. If topology were preserved but the listener
boolean still reflected only topology state, {{JobSchedulerImpl}} would
continue scheduling jobs on a pod that should be draining.
A transient 2-3 second readiness blip permanently killed job processing.
h2. Proposed Fix (in {{{}org.apache.sling.event{}}})
The readiness condition should control {*}job pickup and scheduling{*}, not
{*}topology lifecycle{*}. The listener contract must reflect both topology
state AND readiness state so that all listeners — including
{{JobSchedulerImpl}} — behave correctly.
*Change 1 —
{{{}JobManagerConfiguration.unbindJobProcessingEnabledCondition(){}}}:*
Remove the {{stopProcessing()}} call. When the condition is removed, just null
the condition reference and notify listeners. *Do not destroy topology state.*
This is the core fix that enables recovery when the condition returns.
*Change 2 — {{{}JobManagerConfiguration.notifyListeners(){}}}:*
Incorporate the readiness condition into the boolean sent to listeners. Instead
of sending {{{}caps != null{}}}, send {{{}caps != null &&
isJobProcessingEnabled(){}}}. This ensures all listeners — including
{{JobSchedulerImpl}} — see the correct combined state and stop
scheduling/processing when the readiness condition is absent.
*Change 3 — {{{}JobManagerConfiguration.addListener(){}}}:*
Apply the same combined-state logic to the initial callback when a listener
subscribes. Instead of sending {{{}this.topologyCapabilities != null{}}}, send
{{{}this.topologyCapabilities != null && isJobProcessingEnabled(){}}}. This
ensures a freshly registered {{JobSchedulerImpl}} does not become active when
topology exists but the readiness condition is absent.
h2. What is NOT changed
* {{QueueManager.configurationChanged()}} is {*}not modified{*}. The
{{restart()}} and {{stopAllJobs()}} calls remain. {{restart()}} only outdates
queues and reschedules the retry list — it does not touch topology. With
topology preserved (Change 1), recovery on rebind works through
{{configurationChanged(true)}} → {{{}fullTopicScan(){}}}.
* {{stopProcessing()}} from topology events and {{deactivate()}} are
unchanged. Actual topology changes and real shutdown still destroy topology
state as before.
h2. Expected Behavior After Fix
||Scenario||Current (rolled back)||With fix||
|Pod shutting down|Jobs start on dying pod, fail with NPE, lost as ERROR|New
jobs blocked; in-flight jobs on the dying pod still fail as ERROR and are
*lost* (see limitation below)|
|Transient readiness blip (2-3s)|N/A (fix is rolled back)|New job pickup and
scheduling paused; queues outdated and rebuilt on recovery; processing resumes
automatically when probe recovers|
|Readiness probe permanently fails|N/A (fix is rolled back)|New job pickup
stopped; topology preserved; jobs resume instantly if probe eventually recovers|
|Normal topology change|Handled by {{doHandleTopologyEvent}} →
{{stopProcessing()}}|Unchanged|
|Component deactivation (real shutdown)|{{deactivate()}} →
{{stopProcessing()}}|Unchanged|
h2. Important Limitation
This fix prevents *new* jobs from starting on a dying pod, but in-flight jobs
that are already running when the condition is removed will continue to
completion or failure. If they fail with an exception, they are marked ERROR
and *lost* — they are *not* retried on another pod. This is existing behavior
({{{}JobQueueImpl.java:400-406{}}}: "we don't reschedule if an exception
occurs") and is not changed by this fix. Addressing in-flight job retry on
failure is a separate concern.
> Fix Sling JobManager readiness condition to preserve topology on transient
> probe failures
> -----------------------------------------------------------------------------------------
>
> Key: SLING-13169
> URL: https://issues.apache.org/jira/browse/SLING-13169
> Project: Sling
> Issue Type: Bug
> Reporter: Daniel Iancu
> Priority: Major
>
> h2. Background
> Sling jobs are scheduled on pods that are entering shutdown, causing "Process
> implementation not found" errors ({{{}WorkflowException{}}}) when OSGi
> services have already deregistered.
> h2. Root Cause
> The [original fix|https://issues.apache.org/jira/browse/SLING-12743] in
> {{org.apache.sling.event}} (v4.3.18) had a design flaw: when the readiness
> condition was removed, {{unbindJobProcessingEnabledCondition()}} called
> {{stopProcessing()}} which:
> # Called {{caps.deactivate()}} — marked the topology as inactive
> # Set {{this.topologyCapabilities = null}} — *destroyed the topology state*
> # Triggered {{QueueManager.configurationChanged(false)}} → {{stopAllJobs()}}
> + {{restart()}} (outdated all queues)
> When the readiness condition returned (probe recovered),
> {{bindJobProcessingEnabledCondition()}} called {{notifyListeners()}} but
> {{topologyCapabilities}} was still {{{}null{}}}. So QueueManager received
> {{configurationChanged(false)}} again — *jobs never resumed* until an
> unrelated topology event recreated the capabilities.
> Additionally, {{JobSchedulerImpl}} uses only the listener boolean from
> {{configurationChanged()}} without independently checking
> {{{}isJobProcessingEnabled(){}}}. If topology were preserved but the listener
> boolean still reflected only topology state, {{JobSchedulerImpl}} would
> continue scheduling jobs on a pod that should be draining.
> A transient 2-3 second readiness blip permanently killed job processing.
> h2. Proposed Fix (in {{{}org.apache.sling.event{}}})
> The readiness condition should control {*}job pickup and scheduling{*}, not
> {*}topology lifecycle{*}. The listener contract must reflect both topology
> state AND readiness state so that all listeners — including
> {{JobSchedulerImpl}} — behave correctly.
> *Change 1 —
> {{{}JobManagerConfiguration.unbindJobProcessingEnabledCondition(){}}}:*
> Remove the {{stopProcessing()}} call. When the condition is removed, just
> null the condition reference and notify listeners. *Do not destroy topology
> state.* This is the core fix that enables recovery when the condition returns.
> *Change 2 — {{{}JobManagerConfiguration.notifyListeners(){}}}:*
> Incorporate the readiness condition into the boolean sent to listeners.
> Instead of sending {{{}caps != null{}}}, send {{{}caps != null &&
> isJobProcessingEnabled(){}}}. This ensures all listeners — including
> {{JobSchedulerImpl}} — see the correct combined state and stop
> scheduling/processing when the readiness condition is absent.
> *Change 3 — {{{}JobManagerConfiguration.addListener(){}}}:*
> Apply the same combined-state logic to the initial callback when a listener
> subscribes. Instead of sending {{{}this.topologyCapabilities != null{}}},
> send {{{}this.topologyCapabilities != null && isJobProcessingEnabled(){}}}.
> This ensures a freshly registered {{JobSchedulerImpl}} does not become active
> when topology exists but the readiness condition is absent.
> h2. What is NOT changed
> * {{QueueManager.configurationChanged()}} is {*}not modified{*}. The
> {{restart()}} and {{stopAllJobs()}} calls remain. {{restart()}} only outdates
> queues and reschedules the retry list — it does not touch topology. With
> topology preserved (Change 1), recovery on rebind works through
> {{configurationChanged(true)}} → {{{}fullTopicScan(){}}}.
> * {{stopProcessing()}} from topology events and {{deactivate()}} are
> unchanged. Actual topology changes and real shutdown still destroy topology
> state as before.
> h2. Expected Behavior After Fix
> ||Scenario||Current (rolled back)||With fix||
> |Pod shutting down|Jobs start on dying pod, fail with NPE, lost as ERROR|New
> jobs blocked; in-flight jobs on the dying pod still fail as ERROR and are
> *lost* (see limitation below)|
> |Transient readiness blip (2-3s)|N/A (fix is rolled back)|New job pickup and
> scheduling paused; queues outdated and rebuilt on recovery; processing
> resumes automatically when probe recovers|
> |Readiness probe permanently fails|N/A (fix is rolled back)|New job pickup
> stopped; topology preserved; jobs resume instantly if probe eventually
> recovers|
> |Normal topology change|Handled by {{doHandleTopologyEvent}} →
> {{stopProcessing()}}|Unchanged|
> |Component deactivation (real shutdown)|{{deactivate()}} →
> {{stopProcessing()}}|Unchanged|
> h2. Important Limitation
> This fix prevents *new* jobs from starting on a dying pod, but in-flight jobs
> that are already running when the condition is removed will continue to
> completion or failure. If they fail with an exception, they are marked ERROR
> and *lost* — they are *not* retried on another pod. This is existing behavior
> ({{{}JobQueueImpl.java:400-406{}}}: "we don't reschedule if an exception
> occurs") and is not changed by this fix. Addressing in-flight job retry on
> failure is a separate concern.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)