https://bugs.kde.org/show_bug.cgi?id=520433

            Bug ID: 520433
           Summary: Deadlock in `IdentityProvider::cancel` on shutdown
    Classification: Applications
           Product: digikam
      Version First unspecified
       Reported In:
          Platform: Other
                OS: Linux
            Status: REPORTED
          Severity: normal
          Priority: NOR
         Component: AdvancedRename-album
          Assignee: [email protected]
          Reporter: [email protected]
  Target Milestone: ---

Hi folks, 

as part of working on https://bugs.kde.org/show_bug.cgi?id=514751, I ran into
this deadlock. I haven't fully analysed it, but it's possible that users would
run into this on upgrade - which would be a major problem.

Briefly:
* I built digikam from current git and launched it. My previous system version
was `digikam 9.0.0-1` (archlinux).
* The face recognition dialog pops up. I don't use FR so I click cancel.
* I use digikam normally, then quit.
* The UI window exits, but the process deadlocks and never quits, never
persists settings.

Visible symptoms: changed settings are not persisted, and changed UI state (eg
switching from the Search pane to the Albums pane, or clearing the Search
filter) is not persisted. Every future launch will leave a hung process behind,
every future launch will reuse the same stale settings and UI state.

Note I selected a random Component for this report since it's impossible to
find the correct one in that list.. 

The AI-generated bug report follows (take with a grain of salt):

## Affected version / environment

- digiKam: git master, commit `447081e` ("Cleanup pass after code review",
  2026-05-16). `DIGIKAM_VERSION = 9.1.0`.
- Build type: Debug, x86_64. Built locally; system Qt and KF packages.
- Qt: 6.11.1 (`qt6-base 6.11.1-1` on Arch).
- KF6: distro defaults.
- OS: Arch Linux, kernel 7.0.8.
- DB backend: SQLite (`Database Type=QSQLITE`).
- Display: Wayland (KDE Plasma).

## Reproduction

1. Have an existing digiKam install with face data trained on an older
   recognition model (in my case, an upgrade from packaged 9.0.0 to a
   local 9.1.0 build).
2. Launch digiKam. After ~1 s, `slotCheckFaceTrainingVersion()` opens the
   `FaceTrainingUpgradeDlg` because
`IdentityProvider::checkRetrainingRequired()`
   returns true.
3. Click **Cancel** on the dialog.
4. Use digiKam normally (no requirement to push any items — see "Bug is
   independent of dialog choice" below).
5. Hit **Ctrl-Q** (or close the main window).

**Observed:** Main window disappears, but the process keeps running
indefinitely. `kill` is needed to terminate.

**Side effect:** Because the destructor never finishes,
`ApplicationSettings::saveSettings()`
(`core/app/main/digikamapp.cpp:353`) does not run; next launch's restored
state is from the previous *successful* shutdown.

## Diagnosis

### Main thread (Thread 1) — blocked in `IdentityProvider::cancel()`

```
#3  pthread_cond_wait () from /usr/lib/libc.so.6
#4  QWaitCondition::wait(QMutex*, QDeadlineTimer) from /usr/lib/libQt6Core.so.6
#5  QFutureInterfaceBase::waitForFinished()             from
/usr/lib/libQt6Core.so.6
#6  QFuture<bool>::waitForFinished                      qfuture.h:104
#7  Digikam::IdentityProvider::cancel                  
identityprovider.cpp:191
#8  Digikam::DigikamApp::~DigikamApp                    digikamapp.cpp:287 (=
line 281 call to IdentityProvider::cancel)
... QCoreApplication::exec ...
#22 main                                                main.cpp:496
```

`identityprovider.cpp:179-201`:

```cpp
void IdentityProvider::cancel()
{
    if (d->removeThreadResult.isRunning())
    {
        d->removeQueue->push(d->removeQueue->endSignal());   // line 187
        d->removeThreadResult.waitForFinished();             // line 191  <--
blocked here
    }

    if (d->removeQueue)
    {
        delete d->removeQueue;
        d->removeQueue = nullptr;
    }
}
```

### Worker thread — blocked in `SharedQueue::pop_front()`

```
#3  pthread_cond_wait () from /usr/lib/libc.so.6
#4  QWaitCondition::wait(QMutex*, QDeadlineTimer) from /usr/lib/libQt6Core.so.6
#5  Digikam::SharedQueue<QString>::pop_front
(this=<Digikam::RecognitionTrainingUpdateQueue::queue>)
                                                        sharedqueue.h:62  (i.e.
front_.wait(&mutex_))
#6  Digikam::RecognitionTrainingUpdateQueue::pop_front 
recognitiontrainingupdatequeue.cpp:50
#7  Digikam::IdentityProvider::trainingRemoveConcurrent
identityprovider.cpp:709
#8...#17 QtConcurrent::RunFunctionTaskBase<bool>::run ...
```

This is the worker started at `identityprovider.cpp:117-128` via
`QtConcurrent::run(d->removeThreadPool,
&IdentityProvider::trainingRemoveConcurrent, this)`.

### Queue state at deadlock

>From gdb, with the process paused in the above state:

```
(gdb) print Digikam::RecognitionTrainingUpdateQueue::ref
$1 = 1                                 # exactly one consumer

(gdb) print Digikam::RecognitionTrainingUpdateQueue::queue.queue_.d.size
$2 = 1                                 # queue NOT empty

(gdb) print Digikam::RecognitionTrainingUpdateQueue::queue.queue_
$3 = {<QList<QString>> = ..., d = {d = 0x55925ad07000, ptr = 0x55925ad07010,
size = 1}}
```

So `cancel()`'s `push(endSignal)` at line 187 *did* land — the queue has
the TERMINATE sentinel — but `pop_front()` is still parked in
`front_.wait(&mutex_)`.

### `wakeAll()` from gdb does not unblock the worker

Confirmed that the worker's `front_` condvar and the queue's `front_` are
the same object (addresses verified equal via `&this->front_` in the
worker's frame vs `&((SharedQueue<QString>*)&queue)->front_`).

Manually broadcasting:

```
(gdb) print
((Digikam::SharedQueue<QString>*)&Digikam::RecognitionTrainingUpdateQueue::queue)->front_.wakeAll()
$4 = void
(gdb) cont
```

…leaves the worker still in `pthread_cond_wait`. The next backtrace is
identical to the previous one. This is the surprising part: a direct
`wakeAll()` on the same `QWaitCondition` that the worker is parked on
does not release it.

`QWaitCondition::wakeAll()` in Qt 6.11.1 acquires the wait condition's
internal mutex, sets `d->wakeups = d->waiters`, calls
`pthread_cond_broadcast(&d->cond)`, then unlocks
(`qwaitcondition_unix.cpp`). For the wakeup to be lost in this state, the
worker would need to not be visible in `d->waiters` — i.e. the worker is
parked on a *different* `QWaitCondition` instance than the one we
broadcast to, despite their addresses matching. The mechanism for that
is not understood from the static code alone.

### Releasing the deadlock from gdb

Forcing the worker future to complete works:

```
(gdb) thread 7         # the trainingRemoveConcurrent thread
(gdb) frame 7          # IdentityProvider::trainingRemoveConcurrent
(gdb) return true
(gdb) cont
```

After this, `cancel()` returns, `~DigikamApp` completes,
`digikamrc` and `digikamstaterc` get written, and the process exits.
(There is a separate, unrelated static-destructor `Q_ASSERT` failure in
`Marble::GeoTagWriter::unregisterWriter` at
`core/utilities/geolocation/engine/geodata/writer/GeoTagWriter.cpp:62`
during `__cxa_finalize`, after `main()` has returned — destruction-order
fiasco between `Marble::s_writerUpdate` and the function-local static
`s_tagWriterHash`. Cosmetic only.)

### Bug is independent of the dialog choice

The worker is launched unconditionally at first use of
`IdentityProvider::instance()` (called from
`core/app/main/digikamapp.cpp:97`). The dialog cancel only means no
training-removal hashes are pushed during the session. The worker parks
in `pop_front` from startup and stays there. `cancel()` is the first and
only producer for the entire lifetime of the session in this scenario.

## Code under suspicion

- `core/libs/facesengine/recognition/identityprovider.cpp`:
  - `cancel()` at line 179-201 — uses `wakeOne` semantics via `push()`.
  - `trainingRemoveConcurrent()` at line 701-731 — sole consumer,
    relies on `pop_front` blocking until TERMINATE arrives.
- `core/libs/mlfoundation/sharedqueue.h`:
  - `pop_front()` lines 56-69 — standard `while (isEmpty) wait()` with
    `QMutexLocker`.
  - `push_back()` lines 71-82 — `QMutexLocker` + `wakeOne()`.
  - The `SharedQueue::cancel(T const&)` helper at lines 132-139
    (`push_front` + `wakeOne(front_)` + `wakeAll(back_)`) is **not** used
    by `IdentityProvider::cancel`.

The SharedQueue primitives look textbook-correct
(`QMutexLocker` held across `enqueue`/`wakeOne`,
`while (queue_.isEmpty()) front_.wait(&mutex_)` in pop). I have no clean
explanation for why the `wakeOne` from `push_back` — or even an explicit
`wakeAll` from gdb — fails to release the worker's wait, given the
address equality.

-- 
You are receiving this mail because:
You are watching all bug changes.

Reply via email to