Heiko, it would still be good to get a test of this patch from you.  I
tested this here at Red Hat on some System Z machines.  Without the
modification made here in v2, the systems failed to boot ~10% of the time.
After the modification I do not see any boot failures.  I also was
able to reproduce the boot issue with the acpi_cpufreq driver on a very
large & fast x86 system which had closer to 100% failure rate without
the changes in v2.  After the modification in v2 the system has rebooted
all weekend without any issues.

P.

---8<---

Microsoft HyperV disables the X86_FEATURE_SMCA bit on AMD systems, and
linux guests boot with repeated errors:

amd64_edac_mod: Unknown symbol amd_unregister_ecc_decoder (err -2)
amd64_edac_mod: Unknown symbol amd_register_ecc_decoder (err -2)
amd64_edac_mod: Unknown symbol amd_report_gart_errors (err -2)
amd64_edac_mod: Unknown symbol amd_unregister_ecc_decoder (err -2)
amd64_edac_mod: Unknown symbol amd_register_ecc_decoder (err -2)
amd64_edac_mod: Unknown symbol amd_report_gart_errors (err -2)

The warnings occur because the module code erroneously returns -EEXIST
for modules that have failed to load and are in the process of being
removed from the module list.

module amd64_edac_mod has a dependency on module edac_mce_amd.  Using
modules.dep, systemd will load edac_mce_amd for every request of
amd64_edac_mod.  When the edac_mce_amd module loads, the module has
state MODULE_STATE_UNFORMED and once the module load fails and the state
becomes MODULE_STATE_GOING.  Another request for edac_mce_amd module
executes and add_unformed_module() will erroneously return -EEXIST even
though the previous instance of edac_mce_amd has MODULE_STATE_GOING.
Upon receiving -EEXIST, systemd attempts to load amd64_edac_mod, which
fails because of unknown symbols from edac_mce_amd.

add_unformed_module() must wait to return for any case other than
MODULE_STATE_LIVE to prevent a race between multiple loads of
dependent modules.

v2: The initial (old->state != MODULE_STATE_LIVE) change exposed an
additional issue in the code.  wait_event_interruptible() puts each thread
to sleep until the a module finishes loading an executes the module_wq
workqueue.  The result is a long delay during the boot.  Switching to
wait_event_interruptible_timeout() resolves the sleep problem.

Signed-off-by: Prarit Bhargava <pra...@redhat.com>
Cc: Jessica Yu <j...@kernel.org>
Cc: Heiko Carstens <heiko.carst...@de.ibm.com>
Cc: David Arcari <darc...@redhat.com>
---
 kernel/module.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/module.c b/kernel/module.c
index 1c429d8d2d74..6c868aabaf37 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3568,12 +3568,12 @@ static int add_unformed_module(struct module *mod)
        mutex_lock(&module_mutex);
        old = find_module_all(mod->name, strlen(mod->name), true);
        if (old != NULL) {
-               if (old->state == MODULE_STATE_COMING
-                   || old->state == MODULE_STATE_UNFORMED) {
+               if (old->state != MODULE_STATE_LIVE) {
                        /* Wait in case it fails to load. */
                        mutex_unlock(&module_mutex);
-                       err = wait_event_interruptible(module_wq,
-                                              finished_loading(mod->name));
+                       err = wait_event_interruptible_timeout(module_wq,
+                                              finished_loading(mod->name),
+                                              HZ/1000);
                        if (err)
                                goto out_unlocked;
                        goto again;
-- 
2.18.1

Reply via email to