Re: clock_gettime64 vdso bug on 32-bit arm, rpi-4
On Wed, May 20, 2020 at 7:09 PM Rich Felker wrote: > > On Wed, May 20, 2020 at 12:08:10PM -0400, Rich Felker wrote: > > On Wed, May 20, 2020 at 04:41:29PM +0100, Szabolcs Nagy wrote: > > > The 05/19/2020 22:31, Arnd Bergmann wrote: > > > > On Tue, May 19, 2020 at 10:24 PM Adhemerval Zanella > > > > wrote: > > > > > On 19/05/2020 16:54, Arnd Bergmann wrote: > > > note: i could not reproduce it in qemu-system with these configs: > > > > > > qemu-system-aarch64 + arm64 kernel + compat vdso > > > qemu-system-aarch64 + kvm accel (on cortex-a72) + 32bit arm kernel > > > qemu-system-arm + cpu max + 32bit arm kernel > > > > > > so i think it's something specific to that user's setup > > > (maybe rpi hw bug or gcc miscompiled the vdso or something > > > with that particular linux, i built my own linux 5.6 because > > > i did not know the exact kernel version where the bug was seen) > > > > > > i don't have access to rpi (or other cortex-a53 where i > > > can install my own kernel) so this is as far as i got. > > > > If we have a binary of the kernel that's known to be failing on the > > hardware, it would be useful to dump its vdso and examine the > > disassembly to see if it was miscompiled. > > OK, OP posted it and I think we've solved this. See > https://github.com/richfelker/musl-cross-make/issues/96#issuecomment-631604410 Thanks a lot everyone for figuring this out. > And my analysis: > > <@dalias> see what i just found on the tracker > <@dalias> patch_vdso/vdso_nullpatch_one in arch/arm/kernel/vdso.c patches out > the time32 functions in this case > <@dalias> but not the time64 one > <@dalias> this looks like a real kernel bug that's not hw-specific except > breaking on all hardware where the patching-out is needed > <@dalias> we could possibly work around it by refusing to use the time64 vdso > unless the time32 one is also present > <@dalias> yep > <@dalias> so i think we've solved this. the kernel thought it wasnt using > vdso anymore because it patched it out > <@dalias> but it forgot to patch out the time64 one > <@dalias> so it stopped updating the data needed for vdso to work As you mentioned in the issue tracker, the patching was meant as an optimization and missing it for clock_gettime64 was a mistake but should by itself not have caused incorrect data to be returned. I would assume that there is another bug that leads to clock_gettime64 not entering the syscall fallback path as it should but instead returning bogus data. Here are some more things I found: - From reading the linux-5.6 code that was tested, I see that a condition that leads to patching out the clock_gettime() vdso should also lead to clock_gettime64() falling back to the the syscall after __arch_get_hw_counter() returns an error, but for some reason that does not happen. Presumably the presence of the patching meant that this code path was never much exercised. A missing 45939ce292b4 ("ARM: 8957/1: VDSO: Match ARMv8 timer in cntvct_functional()") would explain the problem, if it happened on linux-5.6-rc7 or earlier. The fix was merged in the final v5.6 though. - The patching may actually be counterproductive because it means that clock_gettime(CLOCK_*COARSE, ...) has to go through the system call when it could just return the time of the last timer tick regardless of the clocksource. - We may get bitten by errata handling on 32-bit kernels running on 64-bit hardware that has errata workaround in arch/arm64 for compat mode but not in native arm kernels. ARM64_ERRATUM_1418040, ARM64_ERRATUM_858921 or SUN50I_ERRATUM_UNKNOWN1 are examples of workaround that are not used on 32-bit kernels running on 64-bit hardware. Arnd
Re: clock_gettime64 vdso bug on 32-bit arm, rpi-4
On Wed, May 20, 2020 at 12:08:10PM -0400, Rich Felker wrote: > On Wed, May 20, 2020 at 04:41:29PM +0100, Szabolcs Nagy wrote: > > The 05/19/2020 22:31, Arnd Bergmann wrote: > > > On Tue, May 19, 2020 at 10:24 PM Adhemerval Zanella > > > wrote: > > > > On 19/05/2020 16:54, Arnd Bergmann wrote: > > > > > Jack Schmidt reported a bug for the arm32 clock_gettimeofday64 vdso > > > > > call last > > > > > month: https://github.com/richfelker/musl-cross-make/issues/96 and > > > > > https://github.com/raspberrypi/linux/issues/3579 > > > > > > > > > > As Will Deacon pointed out, this was never reported on the mailing > > > > > list, > > > > > so I'll try to summarize what we know, so this can hopefully be > > > > > resolved soon. > > > > > > > > > > - This happened reproducibly on Linux-5.6 on a 32-bit Raspberry Pi > > > > > patched > > > > >kernel running on a 64-bit Raspberry Pi 4b (bcm2711) when calling > > > > >clock_gettime64(CLOCK_REALTIME) > > > > > > > > Does it happen with other clocks as well? > > > > > > Unclear. > > > > > > > > - The kernel tree is at https://github.com/raspberrypi/linux/, but I > > > > > could > > > > > see no relevant changes compared to a mainline kernel. > > > > > > > > Is this bug reproducible with mainline kernel or mainline kernel can't > > > > be > > > > booted on bcm2711? > > > > > > Mainline linux-5.6 should boot on that machine but might not have > > > all the other features, so I think users tend to use the raspberry pi > > > kernel sources for now. > > > > > > > > - From the report, I see that the returned time value is larger than > > > > > the > > > > > expected time, by 3.4 to 14.5 million seconds in four samples, my > > > > > guess is that a random number gets added in at some point. > > > > > > > > What kind code are you using to reproduce it? It is threaded or issue > > > > clock_gettime from signal handlers? > > > > > > The reproducer is very simple without threads or signals, > > > see the start of https://github.com/richfelker/musl-cross-make/issues/96 > > > > > > It does rely on calling into the musl wrapper, not the direct vdso > > > call. > > > > > > > > - From other sources, I found that the Raspberry Pi clocksource runs > > > > > at 54 MHz, with a mask value of 0xff. From these numbers > > > > > I would expect that reading a completely random hardware register > > > > > value would result in an offset up to 1.33 billion seconds, which is > > > > > around factor 100 more than the error we see, though similar. > > > > > > > > > > - The test case calls the musl clock_gettime() function, which falls > > > > > back to > > > > > the clock_gettime64() syscall on kernels prior to 5.5, or to the > > > > > 32-bit > > > > > clock_gettime() prior to Linux-5.1. As reported in the bug, > > > > > Linux-4.19 does > > > > > not show the bug. > > > > > > > > > > - The behavior was not reproduced on the same user space in qemu, > > > > > though I cannot tell whether the exact same kernel binary was used. > > > > > > > > > > - glibc-2.31 calls the same clock_gettime64() vdso function on arm to > > > > > implement clock_gettime(), but earlier versions did not. I have not > > > > > seen any reports of this bug, which could be explained by users > > > > > generally being on older versions. > > > > > > > > > > - As far as I can tell, there are no reports of this bug from other > > > > > users, > > > > > and so far nobody could reproduce it. > > > > note: i could not reproduce it in qemu-system with these configs: > > > > qemu-system-aarch64 + arm64 kernel + compat vdso > > qemu-system-aarch64 + kvm accel (on cortex-a72) + 32bit arm kernel > > qemu-system-arm + cpu max + 32bit arm kernel > > > > so i think it's something specific to that user's setup > > (maybe rpi hw bug or gcc miscompiled the vdso or something > > with that particular linux, i built my own linux 5.6 because > > i did not know the exact kernel version where the bug was seen) > > > > i don't have access to rpi (or other cortex-a53 where i > > can install my own kernel) so this is as far as i got. > > If we have a binary of the kernel that's known to be failing on the > hardware, it would be useful to dump its vdso and examine the > disassembly to see if it was miscompiled. OK, OP posted it and I think we've solved this. See https://github.com/richfelker/musl-cross-make/issues/96#issuecomment-631604410 And my analysis: <@dalias> see what i just found on the tracker <@dalias> patch_vdso/vdso_nullpatch_one in arch/arm/kernel/vdso.c patches out the time32 functions in this case <@dalias> but not the time64 one <@dalias> this looks like a real kernel bug that's not hw-specific except breaking on all hardware where the patching-out is needed <@dalias> we could possibly work around it by refusing to use the time64 vdso unless the time32 one is also present <@dalias> yep <@dalias> so i think we've solved this. the kernel thought it wasnt using
Re: clock_gettime64 vdso bug on 32-bit arm, rpi-4
On Wed, May 20, 2020 at 04:41:29PM +0100, Szabolcs Nagy wrote: > The 05/19/2020 22:31, Arnd Bergmann wrote: > > On Tue, May 19, 2020 at 10:24 PM Adhemerval Zanella > > wrote: > > > On 19/05/2020 16:54, Arnd Bergmann wrote: > > > > Jack Schmidt reported a bug for the arm32 clock_gettimeofday64 vdso > > > > call last > > > > month: https://github.com/richfelker/musl-cross-make/issues/96 and > > > > https://github.com/raspberrypi/linux/issues/3579 > > > > > > > > As Will Deacon pointed out, this was never reported on the mailing list, > > > > so I'll try to summarize what we know, so this can hopefully be > > > > resolved soon. > > > > > > > > - This happened reproducibly on Linux-5.6 on a 32-bit Raspberry Pi > > > > patched > > > >kernel running on a 64-bit Raspberry Pi 4b (bcm2711) when calling > > > >clock_gettime64(CLOCK_REALTIME) > > > > > > Does it happen with other clocks as well? > > > > Unclear. > > > > > > - The kernel tree is at https://github.com/raspberrypi/linux/, but I > > > > could > > > > see no relevant changes compared to a mainline kernel. > > > > > > Is this bug reproducible with mainline kernel or mainline kernel can't be > > > booted on bcm2711? > > > > Mainline linux-5.6 should boot on that machine but might not have > > all the other features, so I think users tend to use the raspberry pi > > kernel sources for now. > > > > > > - From the report, I see that the returned time value is larger than the > > > > expected time, by 3.4 to 14.5 million seconds in four samples, my > > > > guess is that a random number gets added in at some point. > > > > > > What kind code are you using to reproduce it? It is threaded or issue > > > clock_gettime from signal handlers? > > > > The reproducer is very simple without threads or signals, > > see the start of https://github.com/richfelker/musl-cross-make/issues/96 > > > > It does rely on calling into the musl wrapper, not the direct vdso > > call. > > > > > > - From other sources, I found that the Raspberry Pi clocksource runs > > > > at 54 MHz, with a mask value of 0xff. From these numbers > > > > I would expect that reading a completely random hardware register > > > > value would result in an offset up to 1.33 billion seconds, which is > > > > around factor 100 more than the error we see, though similar. > > > > > > > > - The test case calls the musl clock_gettime() function, which falls > > > > back to > > > > the clock_gettime64() syscall on kernels prior to 5.5, or to the > > > > 32-bit > > > > clock_gettime() prior to Linux-5.1. As reported in the bug, > > > > Linux-4.19 does > > > > not show the bug. > > > > > > > > - The behavior was not reproduced on the same user space in qemu, > > > > though I cannot tell whether the exact same kernel binary was used. > > > > > > > > - glibc-2.31 calls the same clock_gettime64() vdso function on arm to > > > > implement clock_gettime(), but earlier versions did not. I have not > > > > seen any reports of this bug, which could be explained by users > > > > generally being on older versions. > > > > > > > > - As far as I can tell, there are no reports of this bug from other > > > > users, > > > > and so far nobody could reproduce it. > > note: i could not reproduce it in qemu-system with these configs: > > qemu-system-aarch64 + arm64 kernel + compat vdso > qemu-system-aarch64 + kvm accel (on cortex-a72) + 32bit arm kernel > qemu-system-arm + cpu max + 32bit arm kernel > > so i think it's something specific to that user's setup > (maybe rpi hw bug or gcc miscompiled the vdso or something > with that particular linux, i built my own linux 5.6 because > i did not know the exact kernel version where the bug was seen) > > i don't have access to rpi (or other cortex-a53 where i > can install my own kernel) so this is as far as i got. If we have a binary of the kernel that's known to be failing on the hardware, it would be useful to dump its vdso and examine the disassembly to see if it was miscompiled. Rich
Re: clock_gettime64 vdso bug on 32-bit arm, rpi-4
The 05/19/2020 22:31, Arnd Bergmann wrote: > On Tue, May 19, 2020 at 10:24 PM Adhemerval Zanella > wrote: > > On 19/05/2020 16:54, Arnd Bergmann wrote: > > > Jack Schmidt reported a bug for the arm32 clock_gettimeofday64 vdso call > > > last > > > month: https://github.com/richfelker/musl-cross-make/issues/96 and > > > https://github.com/raspberrypi/linux/issues/3579 > > > > > > As Will Deacon pointed out, this was never reported on the mailing list, > > > so I'll try to summarize what we know, so this can hopefully be resolved > > > soon. > > > > > > - This happened reproducibly on Linux-5.6 on a 32-bit Raspberry Pi patched > > >kernel running on a 64-bit Raspberry Pi 4b (bcm2711) when calling > > >clock_gettime64(CLOCK_REALTIME) > > > > Does it happen with other clocks as well? > > Unclear. > > > > - The kernel tree is at https://github.com/raspberrypi/linux/, but I could > > > see no relevant changes compared to a mainline kernel. > > > > Is this bug reproducible with mainline kernel or mainline kernel can't be > > booted on bcm2711? > > Mainline linux-5.6 should boot on that machine but might not have > all the other features, so I think users tend to use the raspberry pi > kernel sources for now. > > > > - From the report, I see that the returned time value is larger than the > > > expected time, by 3.4 to 14.5 million seconds in four samples, my > > > guess is that a random number gets added in at some point. > > > > What kind code are you using to reproduce it? It is threaded or issue > > clock_gettime from signal handlers? > > The reproducer is very simple without threads or signals, > see the start of https://github.com/richfelker/musl-cross-make/issues/96 > > It does rely on calling into the musl wrapper, not the direct vdso > call. > > > > - From other sources, I found that the Raspberry Pi clocksource runs > > > at 54 MHz, with a mask value of 0xff. From these numbers > > > I would expect that reading a completely random hardware register > > > value would result in an offset up to 1.33 billion seconds, which is > > > around factor 100 more than the error we see, though similar. > > > > > > - The test case calls the musl clock_gettime() function, which falls back > > > to > > > the clock_gettime64() syscall on kernels prior to 5.5, or to the 32-bit > > > clock_gettime() prior to Linux-5.1. As reported in the bug, Linux-4.19 > > > does > > > not show the bug. > > > > > > - The behavior was not reproduced on the same user space in qemu, > > > though I cannot tell whether the exact same kernel binary was used. > > > > > > - glibc-2.31 calls the same clock_gettime64() vdso function on arm to > > > implement clock_gettime(), but earlier versions did not. I have not > > > seen any reports of this bug, which could be explained by users > > > generally being on older versions. > > > > > > - As far as I can tell, there are no reports of this bug from other users, > > > and so far nobody could reproduce it. note: i could not reproduce it in qemu-system with these configs: qemu-system-aarch64 + arm64 kernel + compat vdso qemu-system-aarch64 + kvm accel (on cortex-a72) + 32bit arm kernel qemu-system-arm + cpu max + 32bit arm kernel so i think it's something specific to that user's setup (maybe rpi hw bug or gcc miscompiled the vdso or something with that particular linux, i built my own linux 5.6 because i did not know the exact kernel version where the bug was seen) i don't have access to rpi (or other cortex-a53 where i can install my own kernel) so this is as far as i got. > > > > > > - The current musl git tree has been patched to not call clock_gettime64 > > >on ARM because of this problem, so it cannot be used for reproducing > > > it. > > > > So should glibc follow musl and remove arm clock_gettime6y4 vDSO support > > or this bug is localized to an specific kernel version running on an > > specific hardware? > > I hope we can figure out what is actually going on soon, there is probably > no need to change glibc before we have. > > Arnd --
Re: clock_gettime64 vdso bug on 32-bit arm, rpi-4
On Tue, May 19, 2020 at 05:24:18PM -0300, Adhemerval Zanella wrote: > > > On 19/05/2020 16:54, Arnd Bergmann wrote: > > Jack Schmidt reported a bug for the arm32 clock_gettimeofday64 vdso call > > last > > month: https://github.com/richfelker/musl-cross-make/issues/96 and > > https://github.com/raspberrypi/linux/issues/3579 > > > > As Will Deacon pointed out, this was never reported on the mailing list, > > so I'll try to summarize what we know, so this can hopefully be resolved > > soon. > > > > - This happened reproducibly on Linux-5.6 on a 32-bit Raspberry Pi patched > >kernel running on a 64-bit Raspberry Pi 4b (bcm2711) when calling > >clock_gettime64(CLOCK_REALTIME) > > Does it happen with other clocks as well? > > > > > - The kernel tree is at https://github.com/raspberrypi/linux/, but I could > > see no relevant changes compared to a mainline kernel. > > Is this bug reproducible with mainline kernel or mainline kernel can't be > booted on bcm2711? > > > > > - From the report, I see that the returned time value is larger than the > > expected time, by 3.4 to 14.5 million seconds in four samples, my > > guess is that a random number gets added in at some point. > > What kind code are you using to reproduce it? It is threaded or issue > clock_gettime from signal handlers? Original report thread is here: https://github.com/richfelker/musl-cross-make/issues/96 The reporter originally misunderstood the issue and wrongly attributed it to difference between gettimeofday and clock_gettime but it was just big jumps between successive vdso clock_gettime64 calls. No transformation was being done on the output of the vdso function; as long as it succeeds musl just returns directly with the value it stored in the timespec. No threads or anything fancy were involved. Current musl will no longer call it but you should be able to dlopen("linux-gate.so.1", RTLD_NOW|RTLD_LOCAL) then use dlsym to get its address and call it (not tested; I've never used it this way). > > - The current musl git tree has been patched to not call clock_gettime64 > >on ARM because of this problem, so it cannot be used for reproducing it. > > So should glibc follow musl and remove arm clock_gettime6y4 vDSO support > or this bug is localized to an specific kernel version running on an > specific hardware? For musl it was important to disable it asap pending a fix, because users are expected to generate static binaries, and these could make it into the wild without anyone realizing they're broken until much later when run on an affected kernel (especially since pre-5.6 kernels would hide the issue entirely due to lacking vdso). Ideally a fix will be something we can detect (e.g. new symbol version) so as not to risk calling the broken one, but whether that's necessary may depend on what's affected. I'm not sure if glibc should do the same; it's not often used in static linking, and replacing libc (shared lib, or re-static-linking which LGPL requires you to facilitate to distribute static binaries) could solve the issue on affected systems. Rich
Re: clock_gettime64 vdso bug on 32-bit arm, rpi-4
On Tue, May 19, 2020 at 10:24 PM Adhemerval Zanella wrote: > On 19/05/2020 16:54, Arnd Bergmann wrote: > > Jack Schmidt reported a bug for the arm32 clock_gettimeofday64 vdso call > > last > > month: https://github.com/richfelker/musl-cross-make/issues/96 and > > https://github.com/raspberrypi/linux/issues/3579 > > > > As Will Deacon pointed out, this was never reported on the mailing list, > > so I'll try to summarize what we know, so this can hopefully be resolved > > soon. > > > > - This happened reproducibly on Linux-5.6 on a 32-bit Raspberry Pi patched > >kernel running on a 64-bit Raspberry Pi 4b (bcm2711) when calling > >clock_gettime64(CLOCK_REALTIME) > > Does it happen with other clocks as well? Unclear. > > - The kernel tree is at https://github.com/raspberrypi/linux/, but I could > > see no relevant changes compared to a mainline kernel. > > Is this bug reproducible with mainline kernel or mainline kernel can't be > booted on bcm2711? Mainline linux-5.6 should boot on that machine but might not have all the other features, so I think users tend to use the raspberry pi kernel sources for now. > > - From the report, I see that the returned time value is larger than the > > expected time, by 3.4 to 14.5 million seconds in four samples, my > > guess is that a random number gets added in at some point. > > What kind code are you using to reproduce it? It is threaded or issue > clock_gettime from signal handlers? The reproducer is very simple without threads or signals, see the start of https://github.com/richfelker/musl-cross-make/issues/96 It does rely on calling into the musl wrapper, not the direct vdso call. > > - From other sources, I found that the Raspberry Pi clocksource runs > > at 54 MHz, with a mask value of 0xff. From these numbers > > I would expect that reading a completely random hardware register > > value would result in an offset up to 1.33 billion seconds, which is > > around factor 100 more than the error we see, though similar. > > > > - The test case calls the musl clock_gettime() function, which falls back to > > the clock_gettime64() syscall on kernels prior to 5.5, or to the 32-bit > > clock_gettime() prior to Linux-5.1. As reported in the bug, Linux-4.19 > > does > > not show the bug. > > > > - The behavior was not reproduced on the same user space in qemu, > > though I cannot tell whether the exact same kernel binary was used. > > > > - glibc-2.31 calls the same clock_gettime64() vdso function on arm to > > implement clock_gettime(), but earlier versions did not. I have not > > seen any reports of this bug, which could be explained by users > > generally being on older versions. > > > > - As far as I can tell, there are no reports of this bug from other users, > > and so far nobody could reproduce it. > > > > - The current musl git tree has been patched to not call clock_gettime64 > >on ARM because of this problem, so it cannot be used for reproducing it. > > So should glibc follow musl and remove arm clock_gettime6y4 vDSO support > or this bug is localized to an specific kernel version running on an > specific hardware? I hope we can figure out what is actually going on soon, there is probably no need to change glibc before we have. Arnd
Re: clock_gettime64 vdso bug on 32-bit arm, rpi-4
On 19/05/2020 16:54, Arnd Bergmann wrote: > Jack Schmidt reported a bug for the arm32 clock_gettimeofday64 vdso call last > month: https://github.com/richfelker/musl-cross-make/issues/96 and > https://github.com/raspberrypi/linux/issues/3579 > > As Will Deacon pointed out, this was never reported on the mailing list, > so I'll try to summarize what we know, so this can hopefully be resolved soon. > > - This happened reproducibly on Linux-5.6 on a 32-bit Raspberry Pi patched >kernel running on a 64-bit Raspberry Pi 4b (bcm2711) when calling >clock_gettime64(CLOCK_REALTIME) Does it happen with other clocks as well? > > - The kernel tree is at https://github.com/raspberrypi/linux/, but I could > see no relevant changes compared to a mainline kernel. Is this bug reproducible with mainline kernel or mainline kernel can't be booted on bcm2711? > > - From the report, I see that the returned time value is larger than the > expected time, by 3.4 to 14.5 million seconds in four samples, my > guess is that a random number gets added in at some point. What kind code are you using to reproduce it? It is threaded or issue clock_gettime from signal handlers? > > - From other sources, I found that the Raspberry Pi clocksource runs > at 54 MHz, with a mask value of 0xff. From these numbers > I would expect that reading a completely random hardware register > value would result in an offset up to 1.33 billion seconds, which is > around factor 100 more than the error we see, though similar. > > - The test case calls the musl clock_gettime() function, which falls back to > the clock_gettime64() syscall on kernels prior to 5.5, or to the 32-bit > clock_gettime() prior to Linux-5.1. As reported in the bug, Linux-4.19 does > not show the bug. > > - The behavior was not reproduced on the same user space in qemu, > though I cannot tell whether the exact same kernel binary was used. > > - glibc-2.31 calls the same clock_gettime64() vdso function on arm to > implement clock_gettime(), but earlier versions did not. I have not > seen any reports of this bug, which could be explained by users > generally being on older versions. > > - As far as I can tell, there are no reports of this bug from other users, > and so far nobody could reproduce it. > > - The current musl git tree has been patched to not call clock_gettime64 >on ARM because of this problem, so it cannot be used for reproducing it. So should glibc follow musl and remove arm clock_gettime6y4 vDSO support or this bug is localized to an specific kernel version running on an specific hardware? > > If anyone has other information that may help figure out what is going > on, please share. > > Arnd >