[ https://issues.apache.org/jira/browse/SLING-12078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stefan Egli updated SLING-12078: -------------------------------- Description: Two regular cases where a job is stored as part of JobManager.addJob(): * when a topology is defined, it directly gets stored to the appropriate assigned/target slingId subtree. This is the most frequent case by far. * if no topology is defined (no TOPOLOGY_INIT received) yet, it gets put into the unassigned subtree. Later upon receiving TOPOLOGY_INIT CheckTopologyTask.fullRun() finds such unassigned jobs and moves them to the corresponding assigned subtree. There is a suspect race condition (test case to be provided), which happens between the thread doing JobManager.addJob() and the thread handling the TOPOLOGY_INIT: * JobManager.addJob determines the target slingId - which is not yet defined, as TOPOLOGY_INIT is just being handled concurrently * CheckTopologyTask.fullRun(), as part of TOPOLOGY_INIT handling, however does not yet find the above new job in unassigned, as the job is just being stored concurrently. The result is a job in the unassigned subtree, which waits until the next TopologyEvent happens - which then invokes CheckTopologyTask.fullRun() - which then finds the unassigned job and re/assigns it accordingly. So the job is never lost, but substantially delayed due to this. (the frequency of TopologyEvents depends on actual cluster/property changes happening in the topology and can thus vary). Tasks: * provide a test case to reproduce * fix the race-condition * undo [this|https://github.com/apache/sling-org-apache-sling-event/commit/d16686705908099b26d0a3233f61c4e209880f93] and [this|https://github.com/apache/sling-org-apache-sling-event/commit/dea04990b770a92f29c2504aa33d8158d68da58f] commit was: Two regular cases where a job is stored as part of JobManager.addJob(): * when a topology is defined, it directly gets stored to the appropriate assigned/target slingId subtree. This is the most frequent case by far. * if no topology is defined (no TOPOLOGY_INIT received) yet, it gets put into the unassigned subtree. Later upon receiving TOPOLOGY_INIT CheckTopologyTask.fullRun() finds such unassigned jobs and moves them to the corresponding assigned subtree. There is a suspect race condition (test case to be provided), which happens between the thread doing JobManager.addJob() and the thread handling the TOPOLOGY_INIT: * JobManager.addJob determines the target slingId - which is not yet defined, as TOPOLOGY_INIT is just being handled concurrently * CheckTopologyTask.fullRun(), as part of TOPOLOGY_INIT handling, however does not yet find the above new job in unassigned, as the job is just being stored concurrently. The result is a job in the unassigned subtree, which waits until the next TopologyEvent happens - which then invokes CheckTopologyTask.fullRun() - which then finds the unassigned job and re/assigns it accordingly. So the job is never lost, but substantially delayed due to this. (the frequency of TopologyEvents depends on actual cluster/property changes happening in the topology and can thus vary). Tasks: * provide a test case to reproduce * fix the race-condition * undo [this|https://github.com/apache/sling-org-apache-sling-event/commit/d16686705908099b26d0a3233f61c4e209880f93| and [this|https://github.com/apache/sling-org-apache-sling-event/commit/dea04990b770a92f29c2504aa33d8158d68da58f] commit > Suspected race condition between TOPOLOGY_INIT and JobManager.addJob > -------------------------------------------------------------------- > > Key: SLING-12078 > URL: https://issues.apache.org/jira/browse/SLING-12078 > Project: Sling > Issue Type: Bug > Components: Event > Affects Versions: Event 4.3.12 > Reporter: Stefan Egli > Priority: Major > > Two regular cases where a job is stored as part of JobManager.addJob(): > * when a topology is defined, it directly gets stored to the appropriate > assigned/target slingId subtree. This is the most frequent case by far. > * if no topology is defined (no TOPOLOGY_INIT received) yet, it gets put > into the unassigned subtree. Later upon receiving TOPOLOGY_INIT > CheckTopologyTask.fullRun() finds such unassigned jobs and moves them to the > corresponding assigned subtree. > There is a suspect race condition (test case to be provided), which happens > between the thread doing JobManager.addJob() and the thread handling the > TOPOLOGY_INIT: > * JobManager.addJob determines the target slingId - which is not yet > defined, as TOPOLOGY_INIT is just being handled concurrently > * CheckTopologyTask.fullRun(), as part of TOPOLOGY_INIT handling, however > does not yet find the above new job in unassigned, as the job is just being > stored concurrently. > The result is a job in the unassigned subtree, which waits until the next > TopologyEvent happens - which then invokes CheckTopologyTask.fullRun() - > which then finds the unassigned job and re/assigns it accordingly. So the job > is never lost, but substantially delayed due to this. (the frequency of > TopologyEvents depends on actual cluster/property changes happening in the > topology and can thus vary). > Tasks: > * provide a test case to reproduce > * fix the race-condition > * undo > [this|https://github.com/apache/sling-org-apache-sling-event/commit/d16686705908099b26d0a3233f61c4e209880f93] > and > [this|https://github.com/apache/sling-org-apache-sling-event/commit/dea04990b770a92f29c2504aa33d8158d68da58f] > commit -- This message was sent by Atlassian Jira (v8.20.10#820010)