[ 
https://issues.apache.org/jira/browse/SLING-13169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Iancu updated SLING-13169:
---------------------------------
    Description: 
h2. Background 

Sling jobs are scheduled on pods that are entering shutdown, causing "Process 
implementation not found" errors ({{{}WorkflowException{}}}) when OSGi services 
have already deregistered.
h2. Root Cause

The original fix in {{org.apache.sling.event}} (v4.3.18) had a design flaw: 
when the readiness condition was removed, 
{{unbindJobProcessingEnabledCondition()}} called {{stopProcessing()}} which:
 # Called {{caps.deactivate()}} — marked the topology as inactive
 # Set {{this.topologyCapabilities = null}} — *destroyed the topology state*
 # Triggered {{QueueManager.configurationChanged(false)}} → {{stopAllJobs()}} + 
{{restart()}} (outdated all queues)

When the readiness condition returned (probe recovered), 
{{bindJobProcessingEnabledCondition()}} called {{notifyListeners()}} but 
{{topologyCapabilities}} was still {{{}null{}}}. So QueueManager received 
{{configurationChanged(false)}} again — *jobs never resumed* until an unrelated 
topology event recreated the capabilities.

Additionally, {{JobSchedulerImpl}} uses only the listener boolean from 
{{configurationChanged()}} without independently checking 
{{{}isJobProcessingEnabled(){}}}. If topology were preserved but the listener 
boolean still reflected only topology state, {{JobSchedulerImpl}} would 
continue scheduling jobs on a pod that should be draining.

A transient 2-3 second readiness blip permanently killed job processing.
h2. Proposed Fix (in {{{}org.apache.sling.event{}}})

The readiness condition should control {*}job pickup and scheduling{*}, not 
{*}topology lifecycle{*}. The listener contract must reflect both topology 
state AND readiness state so that all listeners — including 
{{JobSchedulerImpl}} — behave correctly.

*Change 1 — 
{{{}JobManagerConfiguration.unbindJobProcessingEnabledCondition(){}}}:*
Remove the {{stopProcessing()}} call. When the condition is removed, just null 
the condition reference and notify listeners. *Do not destroy topology state.* 
This is the core fix that enables recovery when the condition returns.

*Change 2 — {{{}JobManagerConfiguration.notifyListeners(){}}}:*
Incorporate the readiness condition into the boolean sent to listeners. Instead 
of sending {{{}caps != null{}}}, send {{{}caps != null && 
isJobProcessingEnabled(){}}}. This ensures all listeners — including 
{{JobSchedulerImpl}} — see the correct combined state and stop 
scheduling/processing when the readiness condition is absent.

*Change 3 — {{{}JobManagerConfiguration.addListener(){}}}:*
Apply the same combined-state logic to the initial callback when a listener 
subscribes. Instead of sending {{{}this.topologyCapabilities != null{}}}, send 
{{{}this.topologyCapabilities != null && isJobProcessingEnabled(){}}}. This 
ensures a freshly registered {{JobSchedulerImpl}} does not become active when 
topology exists but the readiness condition is absent.
h2. What is NOT changed
 * {{QueueManager.configurationChanged()}} is {*}not modified{*}. The 
{{restart()}} and {{stopAllJobs()}} calls remain. {{restart()}} only outdates 
queues and reschedules the retry list — it does not touch topology. With 
topology preserved (Change 1), recovery on rebind works through 
{{configurationChanged(true)}} → {{{}fullTopicScan(){}}}.
 * {{stopProcessing()}} from topology events and {{deactivate()}} are 
unchanged. Actual topology changes and real shutdown still destroy topology 
state as before.

h2. Expected Behavior After Fix
||Scenario||Current (rolled back)||With fix||
|Pod shutting down|Jobs start on dying pod, fail with NPE, lost as ERROR|New 
jobs blocked; in-flight jobs on the dying pod still fail as ERROR and are 
*lost* (see limitation below)|
|Transient readiness blip (2-3s)|N/A (fix is rolled back)|New job pickup and 
scheduling paused; queues outdated and rebuilt on recovery; processing resumes 
automatically when probe recovers|
|Readiness probe permanently fails|N/A (fix is rolled back)|New job pickup 
stopped; topology preserved; jobs resume instantly if probe eventually recovers|
|Normal topology change|Handled by {{doHandleTopologyEvent}} → 
{{stopProcessing()}}|Unchanged|
|Component deactivation (real shutdown)|{{deactivate()}} → 
{{stopProcessing()}}|Unchanged|
h2. Important Limitation

This fix prevents *new* jobs from starting on a dying pod, but in-flight jobs 
that are already running when the condition is removed will continue to 
completion or failure. If they fail with an exception, they are marked ERROR 
and *lost* — they are *not* retried on another pod. This is existing behavior 
({{{}JobQueueImpl.java:400-406{}}}: "we don't reschedule if an exception 
occurs") and is not changed by this fix. Addressing in-flight job retry on 
failure is a separate concern.

  was:
h2. Background 

Sling jobs are scheduled on pods that are entering shutdown, causing "Process 
implementation not found" errors ({{{}WorkflowException{}}}) when OSGi services 
have already deregistered.

h2. Root Cause 

The original fix in {{org.apache.sling.event}} (v4.3.18) had a design flaw: 
when the readiness condition was removed, 
{{unbindJobProcessingEnabledCondition()}} called {{stopProcessing()}} which:

# Called {{caps.deactivate()}} — marked the topology as inactive
# Set {{this.topologyCapabilities = null}} — *destroyed the topology state*
# Triggered {{QueueManager.configurationChanged(false)}} → {{stopAllJobs()}} + 
{{restart()}} (outdated all queues)

When the readiness condition returned (probe recovered), 
{{bindJobProcessingEnabledCondition()}} called {{notifyListeners()}} but 
{{topologyCapabilities}} was still {{null}}. So QueueManager received 
{{configurationChanged(false)}} again — *jobs never resumed* until an unrelated 
topology event recreated the capabilities.

Additionally, {{JobSchedulerImpl}} uses only the listener boolean from 
{{configurationChanged()}} without independently checking 
{{isJobProcessingEnabled()}}. If topology were preserved but the listener 
boolean still reflected only topology state, {{JobSchedulerImpl}} would 
continue scheduling jobs on a pod that should be draining.

A transient 2-3 second readiness blip permanently killed job processing.

h2. Proposed Fix (in {{org.apache.sling.event}})

The readiness condition should control *job pickup and scheduling*, not 
*topology lifecycle*. The listener contract must reflect both topology state 
AND readiness state so that all listeners — including {{JobSchedulerImpl}} — 
behave correctly.

*Change 1 — {{JobManagerConfiguration.unbindJobProcessingEnabledCondition()}}:*
Remove the {{stopProcessing()}} call. When the condition is removed, just null 
the condition reference and notify listeners. *Do not destroy topology state.* 
This is the core fix that enables recovery when the condition returns.

*Change 2 — {{JobManagerConfiguration.notifyListeners()}}:*
Incorporate the readiness condition into the boolean sent to listeners. Instead 
of sending {{caps != null}}, send {{caps != null && isJobProcessingEnabled()}}. 
This ensures all listeners — including {{JobSchedulerImpl}} — see the correct 
combined state and stop scheduling/processing when the readiness condition is 
absent.

*Change 3 — {{JobManagerConfiguration.addListener()}}:*
Apply the same combined-state logic to the initial callback when a listener 
subscribes. Instead of sending {{this.topologyCapabilities != null}}, send 
{{this.topologyCapabilities != null && isJobProcessingEnabled()}}. This ensures 
a freshly registered {{JobSchedulerImpl}} does not become active when topology 
exists but the readiness condition is absent.

h2. What is NOT changed

* {{QueueManager.configurationChanged()}} is *not modified*. The {{restart()}} 
and {{stopAllJobs()}} calls remain. {{restart()}} only outdates queues and 
reschedules the retry list — it does not touch topology. With topology 
preserved (Change 1), recovery on rebind works through 
{{configurationChanged(true)}} → {{fullTopicScan()}}.
* {{stopProcessing()}} from topology events and {{deactivate()}} are unchanged. 
Actual topology changes and real shutdown still destroy topology state as 
before.

h2. Expected Behavior After Fix

|| Scenario || Current (rolled back) || With fix ||
| Pod shutting down | Jobs start on dying pod, fail with NPE, lost as ERROR | 
New jobs blocked; in-flight jobs on the dying pod still fail as ERROR and are 
*lost* (see limitation below) |
| Transient readiness blip (2-3s) | N/A (fix is rolled back) | New job pickup 
and scheduling paused; queues outdated and rebuilt on recovery; processing 
resumes automatically when probe recovers |
| Readiness probe permanently fails | N/A (fix is rolled back) | New job pickup 
stopped; topology preserved; jobs resume instantly if probe eventually recovers 
|
| Normal topology change | Handled by {{doHandleTopologyEvent}} → 
{{stopProcessing()}} | Unchanged |
| Component deactivation (real shutdown) | {{deactivate()}} → 
{{stopProcessing()}} | Unchanged |

h2. Important Limitation

This fix prevents *new* jobs from starting on a dying pod, but in-flight jobs 
that are already running when the condition is removed will continue to 
completion or failure. If they fail with an exception, they are marked ERROR 
and *lost* — they are *not* retried on another pod. This is existing behavior 
({{JobQueueImpl.java:400-406}}: "we don't reschedule if an exception occurs") 
and is not changed by this fix. Addressing in-flight job retry on failure is a 
separate concern.


> Fix Sling JobManager readiness condition to preserve topology on transient 
> probe failures
> -----------------------------------------------------------------------------------------
>
>                 Key: SLING-13169
>                 URL: https://issues.apache.org/jira/browse/SLING-13169
>             Project: Sling
>          Issue Type: Bug
>            Reporter: Daniel Iancu
>            Priority: Major
>
> h2. Background 
> Sling jobs are scheduled on pods that are entering shutdown, causing "Process 
> implementation not found" errors ({{{}WorkflowException{}}}) when OSGi 
> services have already deregistered.
> h2. Root Cause
> The original fix in {{org.apache.sling.event}} (v4.3.18) had a design flaw: 
> when the readiness condition was removed, 
> {{unbindJobProcessingEnabledCondition()}} called {{stopProcessing()}} which:
>  # Called {{caps.deactivate()}} — marked the topology as inactive
>  # Set {{this.topologyCapabilities = null}} — *destroyed the topology state*
>  # Triggered {{QueueManager.configurationChanged(false)}} → {{stopAllJobs()}} 
> + {{restart()}} (outdated all queues)
> When the readiness condition returned (probe recovered), 
> {{bindJobProcessingEnabledCondition()}} called {{notifyListeners()}} but 
> {{topologyCapabilities}} was still {{{}null{}}}. So QueueManager received 
> {{configurationChanged(false)}} again — *jobs never resumed* until an 
> unrelated topology event recreated the capabilities.
> Additionally, {{JobSchedulerImpl}} uses only the listener boolean from 
> {{configurationChanged()}} without independently checking 
> {{{}isJobProcessingEnabled(){}}}. If topology were preserved but the listener 
> boolean still reflected only topology state, {{JobSchedulerImpl}} would 
> continue scheduling jobs on a pod that should be draining.
> A transient 2-3 second readiness blip permanently killed job processing.
> h2. Proposed Fix (in {{{}org.apache.sling.event{}}})
> The readiness condition should control {*}job pickup and scheduling{*}, not 
> {*}topology lifecycle{*}. The listener contract must reflect both topology 
> state AND readiness state so that all listeners — including 
> {{JobSchedulerImpl}} — behave correctly.
> *Change 1 — 
> {{{}JobManagerConfiguration.unbindJobProcessingEnabledCondition(){}}}:*
> Remove the {{stopProcessing()}} call. When the condition is removed, just 
> null the condition reference and notify listeners. *Do not destroy topology 
> state.* This is the core fix that enables recovery when the condition returns.
> *Change 2 — {{{}JobManagerConfiguration.notifyListeners(){}}}:*
> Incorporate the readiness condition into the boolean sent to listeners. 
> Instead of sending {{{}caps != null{}}}, send {{{}caps != null && 
> isJobProcessingEnabled(){}}}. This ensures all listeners — including 
> {{JobSchedulerImpl}} — see the correct combined state and stop 
> scheduling/processing when the readiness condition is absent.
> *Change 3 — {{{}JobManagerConfiguration.addListener(){}}}:*
> Apply the same combined-state logic to the initial callback when a listener 
> subscribes. Instead of sending {{{}this.topologyCapabilities != null{}}}, 
> send {{{}this.topologyCapabilities != null && isJobProcessingEnabled(){}}}. 
> This ensures a freshly registered {{JobSchedulerImpl}} does not become active 
> when topology exists but the readiness condition is absent.
> h2. What is NOT changed
>  * {{QueueManager.configurationChanged()}} is {*}not modified{*}. The 
> {{restart()}} and {{stopAllJobs()}} calls remain. {{restart()}} only outdates 
> queues and reschedules the retry list — it does not touch topology. With 
> topology preserved (Change 1), recovery on rebind works through 
> {{configurationChanged(true)}} → {{{}fullTopicScan(){}}}.
>  * {{stopProcessing()}} from topology events and {{deactivate()}} are 
> unchanged. Actual topology changes and real shutdown still destroy topology 
> state as before.
> h2. Expected Behavior After Fix
> ||Scenario||Current (rolled back)||With fix||
> |Pod shutting down|Jobs start on dying pod, fail with NPE, lost as ERROR|New 
> jobs blocked; in-flight jobs on the dying pod still fail as ERROR and are 
> *lost* (see limitation below)|
> |Transient readiness blip (2-3s)|N/A (fix is rolled back)|New job pickup and 
> scheduling paused; queues outdated and rebuilt on recovery; processing 
> resumes automatically when probe recovers|
> |Readiness probe permanently fails|N/A (fix is rolled back)|New job pickup 
> stopped; topology preserved; jobs resume instantly if probe eventually 
> recovers|
> |Normal topology change|Handled by {{doHandleTopologyEvent}} → 
> {{stopProcessing()}}|Unchanged|
> |Component deactivation (real shutdown)|{{deactivate()}} → 
> {{stopProcessing()}}|Unchanged|
> h2. Important Limitation
> This fix prevents *new* jobs from starting on a dying pod, but in-flight jobs 
> that are already running when the condition is removed will continue to 
> completion or failure. If they fail with an exception, they are marked ERROR 
> and *lost* — they are *not* retried on another pod. This is existing behavior 
> ({{{}JobQueueImpl.java:400-406{}}}: "we don't reschedule if an exception 
> occurs") and is not changed by this fix. Addressing in-flight job retry on 
> failure is a separate concern.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to