Re: Understanding PR kern/43997 (kernel timing problems / qemu)
Date:Sun, 30 Jul 2017 16:04:38 - (UTC) From:mlel...@serpens.de (Michael van Elst) Message-ID:| There are slower emulated systems that don't have these issues. (*) Yes, that it is not qemu's execution speed was (really, always) becoming obvious. | If the host misses interrupts, time in the guest just passes slower | than real-time. But inside the guest it is consistent. If we could achieve that (which changing the timecounter in qemu apparently achieves) it would at least make the world become rational. Of course, keeping the timing running faster would be better - if we were able to get to a state where the client/guest were actually able to talk to the outside world (that part is easy) and run NTP, and act as a time server that others could trust, that would be ideal. | This is not to be confused with the kernel idea of wall-clock time | (i.e. what date reports). wall-clock time is usually maintained | by hardware seperated from the interrupt timers. The 'date; sleep 5; date' | sequence therefore can show that 10 seconds passed. But that is totally broken. While there is no guarantee that a sleep will wake up after exactly the time requested, it should be as close as is reasonably possible - and on an unloaded system, where there is sufficient RAM, and nothing swapped out, and nothing computing for cpu cycles, that sequence should (always) show something between 5 and 5 and a bit seconds have passed. If the cpu is busy, or things are getting swapped/paged out, then we can expect slower (not only for processes waiting upon timer signals, but for everything), and that's acceptable. But otherwise, inconsistent timing is not acceptable. All kinds of applications (including network protocols) require time to be kept in a way that is at least close to what others observe, even if not identical. One easy (poor) fix is simply to do as used to be done, and have kernel wall clock time maintained by the tick interrupt - that makes things consistent, but without any real expectation of accuracy. The alternative is to make the tic counts depend upon the external wall clock time source, so they keep in sync - much the same as the power companies do with frequency, over any short period, the nominal 50/60 Hz frequency can drift around a lot, but when measured over any reasonable period, those things are highly accurate (which is why old AC frequency based tick systems used to have very good long term time stability, provided they never lost clock interrupts.) | The problem with qemu is that it's running on a NetBSD host and | therefore cannot issue interrupts based on host time unless the | host has a larger HZ value. In the system of most interest, the host, and the guest, are the exact same system (the exact same binary kernel) - unless we alter the config of one of them explicitly to avoid this issue, they cannot help but have the same HZ value. As long as the emulated qemu client has access to a reasonably accurate ToD value (which it obviously does, as the host's time is available to qemu, and can be, and is it seems, made available to the guest) there's no reason at all the guest cannot produce the correct number of ticks. And doing so (since it is just a generic NetBSD) would solve the similar, but less blatant issue for any other system using ticks, where the occasional clock interrupt might get lost, and where there is some other ToD source available. | With host and guest running at HZ=100, it's obvious that interrupts | mostly come just too late and require two ticks on the host, thus | slowing down guest time by a factor of two. Yes, that is a very good explanation for the observed behaviour, and I cannot help but be grateful that simply beginning to discuss this issue has provided so many insights into what is happening, and what we can do to fix things. When there is no alternative than tick interrupts, we can, and do, use those to measure time, and everything works - just if the ticks are not received at the expected rate time keeping drifts away from real time (but invisibly when considered only within the system.) When there is some better measure of real time we can use, we can make sure that keeps all time keeping synchronised better, regardless of whether the system is "tickless" or still tick based - it isn't required that every single tick be 1/HZ apart (they never are precisely anyway) just that over the long term (which in computing is a half second or so) the correct number of ticks have occurred. I think it should be possible to make that happen, and that is what I am going to see if I can do. Then we can see if we can find a (good enough) way to make nanosleep() less ticky - whether by giving up on ticks altogether (which is probably not the best solution - even if we don't use ticks for timing, we'd end up emulating them for other things, if only to avoid needing to rewrite too
Re: kmem_alloc(0, f)
On Sun, Jul 30, 2017 at 03:23:50PM -, Michael van Elst wrote: > So what does kmem_alloc(0, KM_SLEEP) do? fail where KM_SLEEP says it > cannot fail? I don't think that it can return a zero sized allocation > (i.e. ptr != NULL that cannot be dereferenced). Sure it could, return a pointer inside some red zone unmapped (but reserved kva) page. On typical setups and modulo syscctl vm.user_va0_disable e.g. "return (void*)16;" just as a simple example. Martin
Re: kmem API to allocate arrays
On Sun, Jul 30, 2017 at 03:30:59PM -, Michael van Elst wrote: > Reallocation is usually a reason for memory fragmentation. I would > rather try to avoid it instead of making it easier. Agreeed. Also for kernel drivers, resizing an array allocation is a very rare operation and no good reason to overcomplicate the api. Martin
Re: Understanding PR kern/43997 (kernel timing problems / qemu)
g...@gson.org (Andreas Gustafsson) writes: >Frank Kardel wrote: >> Fixing that requires some more work. But I am surprised that the qemu >> interrupt rate is seemingly somewhat around 50Hz. It shouldn't have a problem on Linux. -- -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: kmem API to allocate arrays
On 30.07.2017 16:51, Taylor R Campbell wrote: >> Date: Sun, 30 Jul 2017 16:24:07 +0200 >> From: Kamil Rytarowski>> >> I would allow size to be 0, like with the original reallocarr(3). It >> might be less pretty, but more compatible with the original model and >> less vulnerable to accidental panics for no good reason. > > Hard to imagine a legitimate use case for size = 0. Almost always, > the parameter will be sizeof(struct foo), or some kind of blocksize > which necessarily has to be nonzero. > > I started writing some example code, and I'm not too keen on having to > write kmem_reallocarr for initial allocation and final freeing, so if > we adopted this, I'd like to have > > int kmem_allocarr(void *ptrp, size_t size, size_t count, km_flag_t flags); > int kmem_reallocarr(void *ptrp, size_t size, size_t ocnt, size_t ncnt, > km_flag_t flags); > void kmem_freearr(void *ptrp, size_t size, size_t count); > > ...at which point it's actually not clear to me that we have much of a > use for kmem_reallocarr. Maybe we do -- I haven't surveyed many > users. > > This still doesn't address the question of whether or how we should > express bounds on the allowed sizes of the arrays. > I see, perhaps it's legitimate to avoid realloc due to fragmentation. Without this reallocarr has little point. signature.asc Description: OpenPGP digital signature
Re: kmem API to allocate arrays
campbell+netbsd-tech-k...@mumble.net (Taylor R Campbell) writes: >Initially I was reluctant to do that because (a) we don't even have a >kmem_realloc, perhaps for some particular reason, and (b) it requires >an extra parameter for the old size. But I don't know any particular >reason in (a), and perhaps (b) not so bad after all. Here's a draft: Reallocation is usually a reason for memory fragmentation. I would rather try to avoid it instead of making it easier. -- -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: kmem_alloc(0, f)
mar...@duskware.de (Martin Husemann) writes: >On Sat, Jul 29, 2017 at 02:04:42PM +, Taylor R Campbell wrote: >> This seems like a foot-oriented panic gun, and it's been a source of >> problems in the past. Can we change it? >I think it is a valuable tool to catch driver bugs early during >developement, but wouldn't mind to reduce it to a KASSERT. So what does kmem_alloc(0, KM_SLEEP) do? fail where KM_SLEEP says it cannot fail? I don't think that it can return a zero sized allocation (i.e. ptr != NULL that cannot be dereferenced). -- -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: Understanding PR kern/43997 (kernel timing problems / qemu)
>> # time sleep 10 >>10.02 real 0.00 user 0.00 sys >> This actually took 20 seconds of real time (manually timed with a >> stopwatch). > [...], but an error of a factor 2 looks suspicious. This is tickling old memories. I think I ran into a case where requesting timer ticks at 100Hz actually got them at 50Hz instead, even though the kernel was running with 100Hz ticks. I've done some searching and completely failed to find either the program exhibiting the symptom (I _think_ it was userland) or the fix, but it might be worth looking into the possibility that this is another manifestation of the same underlying problem, whatever it was. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: kmem API to allocate arrays
> Date: Sun, 30 Jul 2017 16:24:07 +0200 > From: Kamil Rytarowski> > I would allow size to be 0, like with the original reallocarr(3). It > might be less pretty, but more compatible with the original model and > less vulnerable to accidental panics for no good reason. Hard to imagine a legitimate use case for size = 0. Almost always, the parameter will be sizeof(struct foo), or some kind of blocksize which necessarily has to be nonzero. I started writing some example code, and I'm not too keen on having to write kmem_reallocarr for initial allocation and final freeing, so if we adopted this, I'd like to have int kmem_allocarr(void *ptrp, size_t size, size_t count, km_flag_t flags); int kmem_reallocarr(void *ptrp, size_t size, size_t ocnt, size_t ncnt, km_flag_t flags); voidkmem_freearr(void *ptrp, size_t size, size_t count); ...at which point it's actually not clear to me that we have much of a use for kmem_reallocarr. Maybe we do -- I haven't surveyed many users. This still doesn't address the question of whether or how we should express bounds on the allowed sizes of the arrays.
Re: kmem API to allocate arrays
On 30.07.2017 15:45, Taylor R Campbell wrote: >> Date: Sun, 30 Jul 2017 10:22:11 +0200 >> From: Kamil Rytarowski>> >> I think we should go for kmem_reallocarr(). It has been designed for >> overflows like realocarray(3) with an option to be capable to resize a >> table fron 1 to N elements and back from N to 0 including freeing. > > Initially I was reluctant to do that because (a) we don't even have a > kmem_realloc, perhaps for some particular reason, and (b) it requires > an extra parameter for the old size. But I don't know any particular > reason in (a), and perhaps (b) not so bad after all. Here's a draft: > > int > kmem_reallocarr(void *ptrp, size_t size, size_t ocnt, size_t ncnt, int flags) > { > void *optr, *nptr; > > KASSERT(size != 0); > if (__predict_false((size|ncnt) >= SQRT_SIZE_MAX && > ncnt > SIZE_MAX/size)) > return ENOMEM; > > memcpy(, ptrp, sizeof(void *)); > KASSERT((ocnt == 0) == (optr == NULL)); > if (ncnt == 0) { > nptr = NULL; > } else { > nptr = kmem_alloc(size*ncnt, flags); > KASSERT(nptr != NULL || flags == KM_NOSLEEP); > if (nptr == NULL) > return ENOMEM; > } > KASSERT((ncnt == 0) == (nptr == NULL)); > if (ocnt & ncnt) > memcpy(nptr, optr, size*MIN(ocnt, ncnt)); > if (ocnt != 0) > kmem_free(optr, size*ocnt); > memcpy(ptrp, , sizeof(void *)); > > return 0; > } > I would allow size to be 0, like with the original reallocarr(3). It might be less pretty, but more compatible with the original model and less vulnerable to accidental panics for no good reason. signature.asc Description: OpenPGP digital signature
Re: Understanding PR kern/43997 (kernel timing problems / qemu)
Frank Kardel wrote: > Fixing that requires some more work. But I am surprised that the qemu > interrupt rate is seemingly somewhat around 50Hz. > Could it be a bug in qemu getting the frequeny not right. qemu should > read the clock to get the frequencies right an possibly skip > usleeps less than i/HZ possibly managing an error-budget. I haven't > looked into qemu at all, but an error of a factor 2 looks suspicious. I fully agree. -- Andreas Gustafsson, g...@gson.org
Re: kmem API to allocate arrays
> Date: Sun, 30 Jul 2017 10:22:11 +0200 > From: Kamil Rytarowski> > I think we should go for kmem_reallocarr(). It has been designed for > overflows like realocarray(3) with an option to be capable to resize a > table fron 1 to N elements and back from N to 0 including freeing. Initially I was reluctant to do that because (a) we don't even have a kmem_realloc, perhaps for some particular reason, and (b) it requires an extra parameter for the old size. But I don't know any particular reason in (a), and perhaps (b) not so bad after all. Here's a draft: int kmem_reallocarr(void *ptrp, size_t size, size_t ocnt, size_t ncnt, int flags) { void *optr, *nptr; KASSERT(size != 0); if (__predict_false((size|ncnt) >= SQRT_SIZE_MAX && ncnt > SIZE_MAX/size)) return ENOMEM; memcpy(, ptrp, sizeof(void *)); KASSERT((ocnt == 0) == (optr == NULL)); if (ncnt == 0) { nptr = NULL; } else { nptr = kmem_alloc(size*ncnt, flags); KASSERT(nptr != NULL || flags == KM_NOSLEEP); if (nptr == NULL) return ENOMEM; } KASSERT((ncnt == 0) == (nptr == NULL)); if (ocnt & ncnt) memcpy(nptr, optr, size*MIN(ocnt, ncnt)); if (ocnt != 0) kmem_free(optr, size*ocnt); memcpy(ptrp, , sizeof(void *)); return 0; }
Re: Understanding PR kern/43997 (kernel timing problems / qemu)
Hi Andreas ! On 07/30/17 15:20, Andreas Gustafsson wrote: > Frank Kardel wrote: >> Could you check which timecounter is used under qemu? >> >> sysctl kern.timecounter.hardware > > # sysctl kern.timecounter.hardware > kern.timecounter.hardware = hpet0 > >> Usually the timecounters are hardware-based and have no relation >> to the clockinterrupt. In case of qemu you might get a good >> emulated timecounter, but a suboptimal clockinterupt. >> If this is the case it helps to use the clockinterrupt >> itself as timecounter for the wall clock time to avoid a discrepancy >> between clockinterrupt-driven timeout handling and wall-clock time tracking. >> >> sysctl -w kern.timecounter.hardware=clockinterrupt > > # sysctl -w kern.timecounter.hardware=clockinterrupt > kern.timecounter.hardware: hpet0 -> clockinterrupt > # time sleep 10 >10.02 real 0.00 user 0.00 sys > > This actually took 20 seconds of real time (manually timed with a > stopwatch). > >> This is the opposite from deducing the missed clock interrupts >> from the wall clock time and keeps timeout handling and in the emulation >> observed wall-time synchronized no matter how slow >> the clock-interrupts are - the emulated wall clock time will be >> at the same rate. > > Right, but I would still rather see the bug fixed than worked around > this way. Fixing that requires some more work. But I am surprised that the qemu interrupt rate is seemingly somewhat around 50Hz. Could it be a bug in qemu getting the frequeny not right. qemu should read the clock to get the frequencies right an possibly skip usleeps less than i/HZ possibly managing an error-budget. I haven't looked into qemu at all, but an error of a factor 2 looks suspicious. Frank
Re: Understanding PR kern/43997 (kernel timing problems / qemu)
Could you check which timecounter is used under qemu? sysctl kern.timecounter.hardware Usually the timecounters are hardware-based and have no relation to the clockinterrupt. In case of qemu you might get a good emulated timecounter, but a suboptimal clockinterupt. If this is the case it helps to use the clockinterrupt itself as timecounter for the wall clock time to avoid a discrepancy between clockinterrupt-driven timeout handling and wall-clock time tracking. sysctl -w kern.timecounter.hardware=clockinterrupt This is the opposite from deducing the missed clock interrupts from the wall clock time and keeps timeout handling and in the emulation observed wall-time synchronized no matter how slow the clock-interrupts are - the emulated wall clock time will be at the same rate. This might be a workaround for the current qemu issue and does not affect any discussion about improving sleep timing or migrating to a tick-less kernel. BTW: even a tick-less kernel will need to e a minimum interrupt frequency in order to avoid undetected timecounter wrapping. Frank On 07/30/17 14:22, Robert Elz wrote: Date:Sun, 30 Jul 2017 13:01:50 +0300 From:Andreas GustafssonMessage-ID: <22909.44686.188004.117...@guava.gson.org> | I don't think the slowness of qemu's emulation is the actual cause of | its inability to simulate clock interrupts at 100 Hz. Yes, I was wondering about that, as if it was, there'd often be no time left for anything else... | If my theory is correct, there are at least three ways the problem | could be fixed: | | - Improve the time resolution of sleeps on the host system, | - Make qemu deal better with hosts unable to sleep for short periods Either, or both, of those should be fixed, and I might get to take a look at the first one (the insides of qemu are not all that appealing...) but | - Make the guest system deal better with missed timer interrupts. This one needs to be fixed. an idle system that says it takes 13 seconds to do a sleep 10 is simply broken. Fixing the other issues (or either one of them) would make it much harder to work on this one - that is keeping the qemu/host relationship stable allows a platform where the timekeeping issues in the kernel are known to occur, so a good way to verify any fix, so I think this should be fixed first. kre
Re: Understanding PR kern/43997 (kernel timing problems / qemu)
Date:Sun, 30 Jul 2017 13:01:50 +0300 From:Andreas GustafssonMessage-ID: <22909.44686.188004.117...@guava.gson.org> | I don't think the slowness of qemu's emulation is the actual cause of | its inability to simulate clock interrupts at 100 Hz. Yes, I was wondering about that, as if it was, there'd often be no time left for anything else... | If my theory is correct, there are at least three ways the problem | could be fixed: | | - Improve the time resolution of sleeps on the host system, | - Make qemu deal better with hosts unable to sleep for short periods Either, or both, of those should be fixed, and I might get to take a look at the first one (the insides of qemu are not all that appealing...) but | - Make the guest system deal better with missed timer interrupts. This one needs to be fixed. an idle system that says it takes 13 seconds to do a sleep 10 is simply broken. Fixing the other issues (or either one of them) would make it much harder to work on this one - that is keeping the qemu/host relationship stable allows a platform where the timekeeping issues in the kernel are known to occur, so a good way to verify any fix, so I think this should be fixed first. kre
Re: kmem_alloc(0, f)
On Sat, Jul 29, 2017 at 02:04:42PM +, Taylor R Campbell wrote: > This seems like a foot-oriented panic gun, and it's been a source of > problems in the past. Can we change it? I think it is a valuable tool to catch driver bugs early during developement, but wouldn't mind to reduce it to a KASSERT. Martin
Re: Understanding PR kern/43997 (kernel timing problems / qemu)
Robert Elz wrote: > I want to leave /bin/sh to percolate for a while, make sure there are > no issues with it as it is, before starting on the next round of > cleanups and bug fixes, so I was looking for something else to poke > my nose into ... > > [Aside: the people I added to the cc of this message are those who have > added text to PR kern/43997 and so I thought might be interested, if you're > not, just say...] > > kern/43997 is the "qemu is too slow, clock interrupts get lost, timing > gets all messed up" problem that plagues many of the ATF tests that kind > of expect time to be maintained rationally. Thank you for looking into this. > Now there's no question that qemu is slow, for example, on my amd64 Xen > DomU test system, the shell arithmetic test of ++x (etc) takes: > var_preinc: [0.077617s] Passed. > whereas from the latest completed b5 (qemu) test run (as of this e-mail) > var_preinc Passed N/A 6.200489s > > That's about 80 times slower (and most of the other tests show similar > factors). I don't think we can blame qemu for that, given what it is > doing. > > So, it is hardly surprising that, to borrow Paul's words from the PR: > On (at least) amd64 architecture, qemu cannot simulate clock > interrupts at 100Hz. I don't think the slowness of qemu's emulation is the actual cause of its inability to simulate clock interrupts at 100 Hz. Rather, I think it is more likely caused by the inability of qemu to sleep for periods shorter than 10 ms due to limitations of the underlying host OS, such as that documented in the BUGS section of nanosleep(2). That this is at least partly a host system issue is supported by the observation that when qemu is hosted on a Linux system, the timing in the NetBSD guest is much more accurate than when qemu is hosted on NetBSD, on similar hardware: NetBSD-on-qemu-on-NetBSD# time sleep 10 13.00 real 0.00 user 0.03 sys NetBSD-on-qemu-on-Linux# time sleep 10 10.13 real 0.02 user 0.02 sys If my theory is correct, there are at least three ways the problem could be fixed: - Improve the time resolution of sleeps on the host system, as recently discussed on tech-kern in a thread starting with http://mail-index.netbsd.org/tech-kern/2017/07/02/msg022024.html - Make qemu deal better with hosts unable to sleep for short periods of time, or - Make the guest system deal better with missed timer interrupts. -- Andreas Gustafsson, g...@gson.org
Understanding PR kern/43997 (kernel timing problems / qemu)
I want to leave /bin/sh to percolate for a while, make sure there are no issues with it as it is, before starting on the next round of cleanups and bug fixes, so I was looking for something else to poke my nose into ... [Aside: the people I added to the cc of this message are those who have added text to PR kern/43997 and so I thought might be interested, if you're not, just say...] kern/43997 is the "qemu is too slow, clock interrupts get lost, timing gets all messed up" problem that plagues many of the ATF tests that kind of expect time to be maintained rationally. Now there's no question that qemu is slow, for example, on my amd64 Xen DomU test system, the shell arithmetic test of ++x (etc) takes: var_preinc: [0.077617s] Passed. whereas from the latest completed b5 (qemu) test run (as of this e-mail) var_preinc Passed N/A 6.200489s That's about 80 times slower (and most of the other tests show similar factors). I don't think we can blame qemu for that, given what it is doing. So, it is hardly surprising that, to borrow Paul's words from the PR: On (at least) amd64 architecture, qemu cannot simulate clock interrupts at 100Hz. nor that Therefore, a simple "date ; sleep 5; date" command actually requires 10 seconds to complete! This (aside from the workload it creates on b5) shouldn't even really be an issue, I don't think we have any ATF NTP tests, and if we did, attempting those in a qemu emulated environment would be insane. The problem is really (again from the PR) The routines sleep(3), usleep(3), and nanosleep(2) wake-up based on the occurrence of clock ticks. However, the timer interrupt routine determines the actual absolute time. which means that the NetBSD kernel is getting itself out of sync - it is not maintaining one consistent view of the time for the system it is running. Whether its time view internally matches the outside reality is not really a big issue - obviously it is better if it does, at least as close as possible (without external time sync mechanisms, nothing is perfect) but internally it really should be consistent. What's more, at least from the description of the problem, I see nothing that would prevent the same issue arising (probably on a much smaller scale) on any system that happened to suffer an interrupt storm (due to either something broken, some kind of attack, or just a very heavy workload) that happens to last more than 10ms (on a 100Hz based tick system, 1ms on an alpha with 1024Hz) and causes a clock tick to be lost. So, I think qemu is no more than a good environment for simulating the underlying problem, and not itself in any material way related to the problem, which is squarely a NetBSD kernel issue. If there's no disagreement about this analysis, I plan on digging into the clock/time handling parts of the kernel, and fixing this (whatever it takes...) My current guess of the "whatever it takes" is that something along the lines of we know absolute time (the timer interrupt routine uses it) we know when the last clock tick happened (we made it happen, we can remember when that was) we can calculate how many clock ticks should have been generated in the intervening period tick tick tick... is needed. But I am yet to delve into the code (this is mostly just from the PR.) Note: this can be optimised so that there's very little (though probably not zero) extra work in the common case where nothing is being missed. kre