On Wed, Oct 22, 2025 at 11:37:37AM -0700, Marc Herbert wrote:
> On 2025-10-21 14:26, Alison Schofield wrote:
> > The pmem_ns unit test frequently fails when run as part of the full
> > suite, yet passes when executed alone.
> > 
> > [...]
> > > Replace the NULL context parameter when calling ndctl_test_init()
> > with the available ndctl_ctx to ensure pmem_ns can find usable PMEM
> > regions.
> > 
> > Reported-by: Marc Herbert <[email protected]>
> > Closes: https://github.com/pmem/ndctl/issues/290
> > Signed-off-by: Alison Schofield <[email protected]>
> > ---
> >  test/pmem_namespaces.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/test/pmem_namespaces.c b/test/pmem_namespaces.c
> > index 4bafff5164c8..7b8de9dcb61d 100644
> > --- a/test/pmem_namespaces.c
> > +++ b/test/pmem_namespaces.c
> > @@ -191,7 +191,7 @@ int test_pmem_namespaces(int log_level, struct 
> > ndctl_test *test,
> >  
> >     if (!bus) {
> >             fprintf(stderr, "ACPI.NFIT unavailable falling back to 
> > nfit_test\n");
> > -           rc = ndctl_test_init(&kmod_ctx, &mod, NULL, log_level, test);
> > +           rc = ndctl_test_init(&kmod_ctx, &mod, ctx, log_level, test);
> >             ndctl_invalidate(ctx);
> >             bus = ndctl_bus_get_by_provider(ctx, "nfit_test.0");
> >             if (rc < 0 || !bus) {
> 
> Thanks Alison! This does fix the crash, so you can also add my Tested-By:!
> 
> But to test, I had to combine this fix with this temporary hack from
> https://github.com/pmem/ndctl/issues/290

Ah, yes I did similar to debug and test.

> 
> --- a/test/pmem_namespaces.c
> +++ b/test/pmem_namespaces.c
> @@ -189,7 +189,7 @@ int test_pmem_namespaces(int log_level, struct ndctl_test 
> *test,
>                       bus = NULL;
>       }
>  
> -     if (!bus) {
> +     if (!bus || true) {
>               fprintf(stderr, "ACPI.NFIT unavailable falling back to 
> nfit_test\n");
>               rc = ndctl_test_init(&kmod_ctx, &mod, NULL, log_level, test);
>               ndctl_invalidate(ctx);
> 
> 
> 
> ... which explains why I disagree with... the commit message! I don't think
> this necessary fix "closes" https://github.com/pmem/ndctl/issues/290 entirely.

Marc,

Thanks for the review!

Ah, you disagree with the Closes tag? I added the close tag expecting
the test case will now pass. pmem-ns will successfully fallback to
nfit_test region if ACPI.NFIT is not present or does not have the pmem
capable region. 

wrt the reason why ACPI.NFIT fails to find a suitable region, I haven't
given up on it. In my setup, it fails because the region type is 
ND_DEVICE_NAMESPACE_IO (4) rather than ND_DEVICE_NAMESPACE_PMEM (5)

wrt why it fails in your case, a full test run after boot, and with
my reproducer (simply run pmem-ns alone). I don't have the soln yet.

If you have time to check that your failure is same as with my
reproducer, you can collect and share this:

ND_DEVICE_NAMESPACE_IO is 4
ND_DEVICE_NAMESPACE_PMEM is 5


diff --git a/test/pmem_namespaces.c b/test/pmem_namespaces.c
index 4bafff5164c8..c2f25bb02025 100644
--- a/test/pmem_namespaces.c
+++ b/test/pmem_namespaces.c
@@ -180,11 +180,15 @@ int test_pmem_namespaces(int log_level, struct ndctl_test 
*test,

        bus = ndctl_bus_get_by_provider(ctx, "ACPI.NFIT");
        if (bus) {
+               int nstype;
+
                /* skip this bus if no label-enabled PMEM regions */
-               ndctl_region_foreach(bus, region)
-                       if (ndctl_region_get_nstype(region)
-                                       == ND_DEVICE_NAMESPACE_PMEM)
+               ndctl_region_foreach(bus, region) {
+                       nstype = ndctl_region_get_nstype(region);
+                       fprintf(stderr, "ALISON nstype %d\n", nstype);
+                       if (nstype == ND_DEVICE_NAMESPACE_PMEM)
                                break;
+               }
                if (!region)
                        bus = NULL;
        }

> 
> This fix does stop  the test from failing which is great and it lowers 
> dramatically
> the severity of 290. But we still don't know why ACPI.NFIT is "available" 
> most of
> the time and... sometimes not. In other words, we still don't know why this 
> test is
> non-deterministic. Of course, there will always be some non-determinism 
> because
> the kernel and QEMU are too complex to be deterministic but I don't think
> non-determism should extend to test fixtures and test code themselves like 
> this.
> Why 290 should stay open IMHO.
> 
> Also, this feels like a (missed?) opportunity to add better logging of this
> non-determinism, I mean stuff like:
> https://github.com/pmem/ndctl/issues/290#issuecomment-3260168362
> This is test code, it should not be mean with logging. All bash scripts run
> with "set -x" already so this would not make much difference to the total
> volume.
> 
> 
> Generally speaking, tests should follow a CLEAN - TEST - CLEAN logic to
> minimize interferences; as much as time allows[*]. Bug 290 demonstrates that:
> 1. Some unknown test running before pmem-ns does not clean properly after 
> itself, and
> 2. The pmem-ns test is not capable of creating a deterministic setup for 
> itself.
> 
> We still have no clue about 1. and 2. is not mitigated with logs
> and source comments. So there's still an open bug there.
> 
> Marc
> 
> 
> 
> 
> [*] there are practical limits: rebooting QEMU for each test would be too 
> slow.

Reply via email to