Re: occasional ECPG failures on dikkop (FreeBSD)

Tomas Vondra Mon, 29 Jun 2026 15:46:34 -0700

On 5/6/26 16:15, Tom Lane wrote:
> Tomas Vondra <[email protected]> writes:
>> Ah, I haven't noticed the last couple messagess, mentioning dikkop. Not
>> sure if it's the same issue - I guess we'll have to wait if the failures
>> go away once it gets fixed.
> 
>> Still, it seems like a long-standing issue. It's curious dikkop started
>> failing only very recently, after the OS upgrade.
> 
> Yeah.  Given that these are threading tests, I'm suspecting some
> change in thread scheduling behavior in this latest FBSD release,
> which somehow made it easier to hit a pre-existing issue.
>


The failures on dikkop continued, and I got annoyed by that. So I spent
a bit of time investigating this today. I managed to reproduce the
crashes in a way that allowed me to grab backtraces from the core files
(which we don't seem to get in the buildfarm). And all of them generally
looked like this:

--------------------------------------------------------------------
Program terminated with signal SIGSEGV, Segmentation fault.
Address not mapped to object.
#0  __vfprintf (fp=fp@entry=0x889dab00, locale=locale@entry=0x841dfcf0
<__xlocale_global_locale>, serrno=0, fmt0=fmt0@entry=0x20087a
"Connection: %d", ap=...) at /usr/src/lib/libc/stdio/vfprintf.c:477

warning: 477    /usr/src/lib/libc/stdio/vfprintf.c: No such file or directory
[Current thread is 1 (LWP 933707)]
(gdb) bt
#0  __vfprintf (fp=fp@entry=0x889dab00, locale=locale@entry=0x841dfcf0
<__xlocale_global_locale>, serrno=0, fmt0=fmt0@entry=0x20087a
"Connection: %d", ap=...) at /usr/src/lib/libc/stdio/vfprintf.c:477
#1  0x000000008411d434 in vsprintf_l (locale=0x841dfcf0
<__xlocale_global_locale>, ap=..., str=<optimized out>, fmt=<optimized
out>) at /usr/src/lib/libc/stdio/vsprintf.c:61
#2  vsprintf (str=0x889daefc "", fmt=<optimized out>, ap=...) at
/usr/src/lib/libc/stdio/vsprintf.c:68
#3  0x00000000841123f4 in sprintf (str=0x0, str@entry=0x889daefc "",
fmt=0x841dfcf0 <__xlocale_global_locale> "") at
/usr/src/lib/libc/stdio/sprintf.c:56
#4  0x0000000000210ed8 in fn (arg=0x0) at prep.pgc:39
#5  0x00000000847844fc in thread_start (curthread=0x63044546d010) at
/usr/src/lib/libthr/thread/thr_create.c:291
#6  0x0000000000000000 in ?? ()
--------------------------------------------------------------------

Which is a crash in snprintf, or rather vfprintf on this line:

    decpt_len = (decimal_point[1] == '\0' ? 1 : strlen(decimal_point));

That's a bit bizarre, because I thought snprintf is meant to be thread
safe. After staring at this for a while and getting nowhere, I decided
to make sure that's actually true, by doing this:

--------------------------------------------------------------------
#include <stdio.h>
#include <pthread.h>

void *fn(void *arg) {
    char buf[128];
    for(int i=0; i<10000; i++)
        sprintf(buf, "%d", 1);
    return NULL;
}

int main() {
    pthread_t th[100];

    for(int i=0; i<100; i++)
        pthread_create(&th[i], NULL, fn, NULL);

    for(int i=0; i<100; i++)
        pthread_join(th[i], NULL);

    return 0;
}
--------------------------------------------------------------------

And after running that a in a loop for a couple seconds ...

$ while /usr/bin/true; do ./test; done;
Segmentation fault (core dumped)

The backtrace is exactly the one I posted earlier. Turns out snprintf()
is not as thread-safe as I thought.

AFAIK the culprit it the line right before the crash:

    decimal_point = localeconv_l(locale)->decimal_point;
    /* The overwhelmingly common case is decpt_len == 1. */
    decpt_len = (decimal_point[1] == '\0' ? 1 : strlen(decimal_point));

The localeconv_t() rebuilds some sort of cache, protected by an atomic
flag. But it uses atomic_store_int(), which is a "relaxed" variant:

>From "atomic" manpage:

    ... The first, or relaxed variant, performs the operation without
    imposing any ordering constraints on accesses to other memory
    locations. ...

So localeconv_t() updates some fields (e.g. "decimal_point" pointer),
and sets the flag. But there's no guarantee in which order will other
threads see these updates. A thread may see the new "flag" value, but
not the decimal_point value. That seems ... not great?

I suppose this only affects architectures like ARM/AArch64, which is
what dikkop runs on (rpi5). Which would explain why we don't see this on
the x86_64 animals. opaleye didn't hit it either, but it has not run for
months, and it didn't dikkop only started hitting it ~3 months ago.

As for a fix, it seems enough to populate the cache before starting the
threads, e.g. by doing a sprintf().


regards

-- 
Tomas Vondra

Re: occasional ECPG failures on dikkop (FreeBSD)

Reply via email to