Re: svn commit: r252032 - head/sys/amd64/include
On Wed, 26 Jun 2013, Gleb Smirnoff wrote: On Wed, Jun 26, 2013 at 11:42:39AM +1000, Bruce Evans wrote: B> > Anyway, as Gleb said, there is no point in B> > optimizing the i386 kernel. B> B> I said that there is every point in optimizing the i386 kernel. This B> applies even more to other 32-bit arches. Some CPUs are much slower B> than modern x86's. They shouldn't be slowed down more by inefficient B> KPIs. I didn't mean that i386 arch is a relic and should be ignored at all. What I actually meant, is that the problem of performance drop due to cache poisoning and loss of statistics with simple "+=" operation can be observed only at extremely high event rates, with multiple processors involved. I think you already fixed cache poisoning, and it has nothing to do with whether the access is a simple "+=' operation. amd64 still uses a simple '+=' operation (written in asm so that it uses the right instructions), so the slow cmpxch8b used on i386 can't be doing anything good for the cache. The counter(9) is solution for these conditions. Thus we are interested in optimising amd64, not i386. The latter isn't affected neither positively nor negatively with these changes, just because last i386 CPUs can't reach the event rates where need for counter(9) arises. Yes, you can tweak implementation and obtain better results with microbenchmarks, but I bet that any change in counter(9) implementation won't affect packet forwarding rate on any i386. What we claim for i386 (and all other arches) that counter(9) is lossless, and that's all. I second to Konstantin, that we don't have objections in any changes to i386 part of counter, including a daemon, but the changes shouldn't affect amd64. amd64 should be changed too, to use 32-bit pcpu counters to avoid ifdefs and to use less cache. You can't reach event rates that overflow 32-bit counters faster than a daemon can accumulate them. For example, with 10 Gbps ethernet the maximum packet rate is about 14 Mpps. Suppose that is all handled and counted on 1 CPU, which isn't possible yet. The daemon must run once every 286 seconds to keep up with that. Byte counts are more interesting. Counting 1 G/second of anything requires running the daemon every 4 seconds. I don't remember how you distributed the counters to avoid cache poisoning. Is it one pcpu counter per cache line, so that counters never poison nor benefit from caching for other counters for the same CPU? Or is it just separate cache lines for each CPU? I think the latter. So there can be several 64-bit counters per cache line, or twice as many 32-bit counters. Bruce ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r252032 - head/sys/amd64/include
On Tue, 25 Jun 2013, Konstantin Belousov wrote: > > > Updates to the counter cannot be done from the interrupt context. > > > > This is fragile, however. It prevents using counters for things like > > counting interrupts. Most interrupt counting is now done directlyly > > and doesn't use PCPU_INC(). i386/i386 has just 1 use of PCPU_INC(). > > It is to count traps in machdep.c. Traps include nonmaskable > > interrupts. Even hard-disabling of interrupts doesn't prevent these. > > Otherwise, PCPU is used mainly for vm counters. E.g., for pagefaults. > > Now the trap is not an interrupt, so it shouldn't occur in the middle > > of the counter update and the PCPU_INC() can safely be replaced by > > a counter, but this is not clear. > Traps are not performance critical in the sense that there is no need to count > up to 1-10G traps per second. Anyway, as Gleb said, there is no point in > optimizing the i386 kernel. Hmm, don't we count semi-embedded routers? Or, do we think they are either amd64 or arm/mips? [snip all the rest] -- Sincerely, D.Marck [DM5020, MCK-RIPE, DM3-RIPN] [ FreeBSD committer: ma...@freebsd.org ] *** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- ma...@rinet.ru *** ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r252032 - head/sys/amd64/include
On Wed, 26 Jun 2013, Gleb Smirnoff wrote: > On Wed, Jun 26, 2013 at 11:42:39AM +1000, Bruce Evans wrote: > B> > Anyway, as Gleb said, there is no point in > B> > optimizing the i386 kernel. > B> > B> I said that there is every point in optimizing the i386 kernel. This > B> applies even more to other 32-bit arches. Some CPUs are much slower > B> than modern x86's. They shouldn't be slowed down more by inefficient > B> KPIs. > > I didn't mean that i386 arch is a relic and should be ignored at all. > > What I actually meant, is that the problem of performance drop due to > cache poisoning and loss of statistics with simple "+=" operation can > be observed only at extremely high event rates, with multiple processors > involved. > > The counter(9) is solution for these conditions. Thus we are interested > in optimising amd64, not i386. The latter isn't affected neither positively > nor negatively with these changes, just because last i386 CPUs can't reach > the event rates where need for counter(9) arises. Yes, you can tweak > implementation and obtain better results with microbenchmarks, but I bet > that any change in counter(9) implementation won't affect packet forwarding > rate on any i386. What we claim for i386 (and all other arches) that > counter(9) is lossless, and that's all. > > I second to Konstantin, that we don't have objections in any changes to > i386 part of counter, including a daemon, but the changes shouldn't affect > amd64. Ah, apparently this mostly answers the question I've just asked ;) -- Sincerely, D.Marck [DM5020, MCK-RIPE, DM3-RIPN] [ FreeBSD committer: ma...@freebsd.org ] *** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- ma...@rinet.ru *** ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r252032 - head/sys/amd64/include
Bruce, On Wed, Jun 26, 2013 at 11:42:39AM +1000, Bruce Evans wrote: B> > Anyway, as Gleb said, there is no point in B> > optimizing the i386 kernel. B> B> I said that there is every point in optimizing the i386 kernel. This B> applies even more to other 32-bit arches. Some CPUs are much slower B> than modern x86's. They shouldn't be slowed down more by inefficient B> KPIs. I didn't mean that i386 arch is a relic and should be ignored at all. What I actually meant, is that the problem of performance drop due to cache poisoning and loss of statistics with simple "+=" operation can be observed only at extremely high event rates, with multiple processors involved. The counter(9) is solution for these conditions. Thus we are interested in optimising amd64, not i386. The latter isn't affected neither positively nor negatively with these changes, just because last i386 CPUs can't reach the event rates where need for counter(9) arises. Yes, you can tweak implementation and obtain better results with microbenchmarks, but I bet that any change in counter(9) implementation won't affect packet forwarding rate on any i386. What we claim for i386 (and all other arches) that counter(9) is lossless, and that's all. I second to Konstantin, that we don't have objections in any changes to i386 part of counter, including a daemon, but the changes shouldn't affect amd64. -- Totus tuus, Glebius. ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r252032 - head/sys/amd64/include
On Tue, 25 Jun 2013, Konstantin Belousov wrote: On Tue, Jun 25, 2013 at 08:14:41PM +1000, Bruce Evans wrote: On Tue, 25 Jun 2013, Konstantin Belousov wrote: Updates to the counter cannot be done from the interrupt context. This is fragile, however. It prevents using counters for things like counting interrupts. Most interrupt counting is now done directlyly and doesn't use PCPU_INC(). i386/i386 has just 1 use of PCPU_INC(). It is to count traps in machdep.c. Traps include nonmaskable interrupts. Even hard-disabling of interrupts doesn't prevent these. ... Traps are not performance critical in the sense that there is no need to count up to 1-10G traps per second. The trap count increment is not a performance problem, but a correctness one. Disabling interrupts for a multi-word increment doesn't work for NMIs even for !SMP or pcpu accesses with SMP. Anyway, as Gleb said, there is no point in optimizing the i386 kernel. I said that there is every point in optimizing the i386 kernel. This applies even more to other 32-bit arches. Some CPUs are much slower than modern x86's. They shouldn't be slowed down more by inefficient KPIs. The fetch problem of sometime missing the carry bit is present on all 32bit arches now, I think. Also the zeroing problem. Since there are no atomic ops in sight, zeroing has several races. My transferring code uses atomic ops on only 1 CPU, so it has half of these races. See below. Right, zeroing should also use atomic 64bit writes on i386. Updated patch is at end. It's getting even more complicated and further from what I want. It also doesn't work. Disabling interrupts doesn't work for the SMP case. The zeroing needs to run in a per-CPU daemon. The arm implementation of atomic.h is also relevant and is especially interesting. I looked at it after mdf@ pointed out ARM_RAS_START. This seems to only be used by arm in userland. It is even larger and slower and more complicated than critical sections, so it doesn't seem to be useful for counters. In the kernel, arm has several implementations depending on machine capabilities. The ifdefs are hard to understand. On some machine, it seems to use the equivalent of a cmpxchg loop. On others, it just disables interrupts. Disabling interrupts is presumably better than a critical section, and userland has to use ARM_RAS_START for some arches because neither disabling interrupts not critical sections are available in userland. None of this care helps avoid the races in pcpu.h, since pcpu.h intentionally avoids using atomic ops. None of this care helps avoid the races in counter.h or make counter.h efficient, since counter.h uses its own methods. To reach the same quality as atomic.h, counter.h would have to copy half of the ifdefs and methods in atomic.h. BTW, why do you claim that disabling interrupts is faster and better method to disable preemption than critical section ? For arm, it is because if the critical section method were better, then arm should use it for atomic ops too. Of course not. Atomics must be atomic WRT the interrupts as well. Using critical sections only provides atomicity against preemption but not against interruption. Indeed, atomics would be fragile if they couldn't be used in interrupt handlers. So in a non-optimized implementation it is safe to disable interrupts in hardware. An optimized solution would accept the fragility like the counter increment does, and have use special atomic ops that do more for the few variables that are accessed by interrupt handlers. The arm looks quite deficient anyway. From the arm/include/atomic.h, it seems that the arch did not have (usable) atomics until v6. And ARM does not support SMP in the stock HEAD still. Disabling interrupts only gives atomicity for non-SMP. Thus the cases where this method is used indicate an arch that doesn't support SMP. More interestingly for the current discussion, arm and other several other arches apparently don't have interrupt-safe load-modify-store instructions even for register-sized memory variables. Thus they need to either disable interrupts or use a cmpxchg-type instruction even to add to register-sized variables. % static inline void % counter_u64_add(counter_u64_t c, int64_t inc) % { % % #if 1 % /* 20 cycles on A64. */ % /* 31 cycles on A64 with lock prefix. */ % /* 35 cycles on A64 with mfence prefix. */ % if ((cpu_feature & 1) == 1) { % counter_64_inc_8b(c, inc); % } Old tests said that this takes 18 cycles. It now takes 20. Adding a lock or mfence prefix to the cmpxchg8b gives the times stated above. This shows how bad using cmpxch8b is. We could use atomic ops for everything and only be 50% slower. But we shouldn't use either cmpxchg8b or atomic ops. We should use 32-bit addl, which takes about 4 cycles. Use of cmpxchg8b provides the soundness to the algorithm
Re: svn commit: r252032 - head/sys/amd64/include
On Tue, Jun 25, 2013 at 08:14:41PM +1000, Bruce Evans wrote: > On Tue, 25 Jun 2013, Konstantin Belousov wrote: > > > Updates to the counter cannot be done from the interrupt context. > > This is fragile, however. It prevents using counters for things like > counting interrupts. Most interrupt counting is now done directlyly > and doesn't use PCPU_INC(). i386/i386 has just 1 use of PCPU_INC(). > It is to count traps in machdep.c. Traps include nonmaskable > interrupts. Even hard-disabling of interrupts doesn't prevent these. > Otherwise, PCPU is used mainly for vm counters. E.g., for pagefaults. > Now the trap is not an interrupt, so it shouldn't occur in the middle > of the counter update and the PCPU_INC() can safely be replaced by > a counter, but this is not clear. Traps are not performance critical in the sense that there is no need to count up to 1-10G traps per second. Anyway, as Gleb said, there is no point in optimizing the i386 kernel. > > > The fetch problem of sometime missing the carry bit is present on all > > 32bit arches now, I think. > > Also the zeroing problem. Since there are no atomic ops in sight, zeroing > has several races. My transferring code uses atomic ops on only 1 CPU, > so it has half of these races. See below. Right, zeroing should also use atomic 64bit writes on i386. Updated patch is at end. > > >> The arm implementation of atomic.h is also relevant and is especially > >> interesting. I looked at it after mdf@ pointed out ARM_RAS_START. > >> This seems to only be used by arm in userland. It is even larger and > >> slower and more complicated than critical sections, so it doesn't seem > >> to be useful for counters. In the kernel, arm has several > >> implementations depending on machine capabilities. The ifdefs are > >> hard to understand. On some machine, it seems to use the equivalent > >> of a cmpxchg loop. On others, it just disables interrupts. Disabling > >> interrupts is presumably better than a critical section, and userland > >> has to use ARM_RAS_START for some arches because neither disabling > >> interrupts not critical sections are available in userland. None > >> of this care helps avoid the races in pcpu.h, since pcpu.h > >> intentionally > >> avoids using atomic ops. None of this care helps avoid the races in > >> counter.h or make counter.h efficient, since counter.h uses its own > >> methods. To reach the same quality as atomic.h, counter.h would have > >> to copy half of the ifdefs and methods in atomic.h. > > BTW, why do you claim that disabling interrupts is faster and better method > > to disable preemption than critical section ? > > For arm, it is because if the critical section method were better, then > arm should use it for atomic ops too. Of course not. Atomics must be atomic WRT the interrupts as well. Using critical sections only provides atomicity against preemption but not against interruption. The arm looks quite deficient anyway. From the arm/include/atomic.h, it seems that the arch did not have (usable) atomics until v6. And ARM does not support SMP in the stock HEAD still. > > For i386, it is because it reduces the window in which the value is invalid > (other CPUs can see the invalid value), and is easier to use and shorter > (2-3 bytes) and thus better suited to inline code. My examples of inline > code mostly used it in rarely-executed cases where its speed doesn't matter > and it is just easier to use. Actual testing shows that it is fairly > efficient, so can be used in all cases. On Athlon64, it is faster than > cmpxchg8b but slower than an inline optimized critical section. This > testing is in userland, where there may be an extra penalty for using > cli/sti/pushfl/popfl: > > % static inline void > % counter_u64_add(counter_u64_t c, int64_t inc) > % { > % > % #if 1 > % /* 20 cycles on A64. */ > % /* 31 cycles on A64 with lock prefix. */ > % /* 35 cycles on A64 with mfence prefix. */ > % if ((cpu_feature & 1) == 1) { > % counter_64_inc_8b(c, inc); > % } > > Old tests said that this takes 18 cycles. It now takes 20. > > Adding a lock or mfence prefix to the cmpxchg8b gives the times stated > above. This shows how bad using cmpxch8b is. We could use atomic ops > for everything and only be 50% slower. But we shouldn't use either > cmpxchg8b or atomic ops. We should use 32-bit addl, which takes about > 4 cycles. Use of cmpxchg8b provides the soundness to the algorithm, by using atomic 64 bit loads and stores. As I stated, I am not going to spent time microoptimizing for i386. Current implementation is good enough. > > % struct thread > % { > % uint32_t td_critnest; > % uint32_t td_owepreempt; > % } td0; > % > % volatile struct thread *td = &td0; > % > % void > % dummy_accum(void) > % { > % } > % > % void > % fake_mi_switch(void) > % { > % } > > Added more emulation for a compl
Re: svn commit: r252032 - head/sys/amd64/include
On Tue, 25 Jun 2013, Gleb Smirnoff wrote: On Mon, Jun 24, 2013 at 11:16:33PM +1000, Bruce Evans wrote: B> > K> This is quite interesting idea, but I still did not decided if it B> > K> acceptable. The issue is that we could add the carry to the other B> > K> processor counter, if the preemption kicks in at right time between B> > K> two instructions. I did not found any argument why would it be B> > K> wrong, the races for fetch operation seems to be the same with either B> > K> local update method. B> > B> > This would be wrong since update isn't locked. Thus, if we are put on B> > other CPU between two instructions, and in second instruction updating B> > another CPU counter simultaneously with the original CPU we were on, B> > then we are losing an update. B> B> Hmm, this is subtle. The update is never lost, but is applied to B> different counter, non-atomically at a later time. Non-atomicity B> only matters when there is a carry since the counter only goes B> transiently backards in this case. For example: initial state: B> B> CPU1 counter: B> CPU2 counter: fffe B> B> Start adding 1 to the first counter, doing it non-atomically by B> incrementing the low word first. B> B> CPU1 counter: (carry in CPU1 eflags) B> CPU2 counter: fffe B> B> Get preempted at this point: B> B> CPU1 counter: B> CPU2 counter: fffe (carry in CPU2 eflags) Nope, the non-atomicity isn't specific to the case when we carry bit from least significant part to most one. Sure, but the other parts already work correctly. The non-atomicity is much simplier: two CPUs write to same memory address w/o using lock prefix. No, they can't, since these are per-CPU counters. Except for broken per-CPU counters and invalid accesses by CPUs that don't own the counter in upper layer code. Thus, any update mechanism that first calculates absolute per-CPU address, then writes to it w/o lock, requires critical section. Because preemption between calculation and write, and later execution on other CPU leads to two CPUs writing to same address. Only broken per-CPU counters do that. I already pointed out that all arches except amd64 and i386 have buggy PCPU_ADD() and other PCPU_*() access functions because they do do that. ia64, powerpc and sparc64 even document that that they do this in an XXX comment. Actually, ia64 and sparc64 and perhaps some others are not quite this bad, since their pcpu pointer is in a special register that should get switched by context switches. It still takes a lot of magic for the compiler to use this register for the accesses and not copy it to another register that doesn't get context switched. Their pcpu access bugs may be limited to the compiler possibly generating non-atomic load-modify-store instructions. Their XXX comment has wording bugs either way. It says that "this" operation should be made atomic, but the scope of the comment is 4 operations. The first operation is PCPU_ADD(), and it used to be the only one that does load-modify-store. Now it is followed by the PCPU_INC() operation which also does load-modify-store. Aligned word-sized operations have a chance of being atomic enough for the other operations. Bruce ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r252032 - head/sys/amd64/include
On Mon, Jun 24, 2013 at 11:16:33PM +1000, Bruce Evans wrote: B> > K> This is quite interesting idea, but I still did not decided if it B> > K> acceptable. The issue is that we could add the carry to the other B> > K> processor counter, if the preemption kicks in at right time between B> > K> two instructions. I did not found any argument why would it be B> > K> wrong, the races for fetch operation seems to be the same with either B> > K> local update method. B> > B> > This would be wrong since update isn't locked. Thus, if we are put on B> > other CPU between two instructions, and in second instruction updating B> > another CPU counter simultaneously with the original CPU we were on, B> > then we are losing an update. B> B> Hmm, this is subtle. The update is never lost, but is applied to B> different counter, non-atomically at a later time. Non-atomicity B> only matters when there is a carry since the counter only goes B> transiently backards in this case. For example: initial state: B> B> CPU1 counter: B> CPU2 counter: fffe B> B> Start adding 1 to the first counter, doing it non-atomically by B> incrementing the low word first. B> B> CPU1 counter: (carry in CPU1 eflags) B> CPU2 counter: fffe B> B> Get preempted at this point: B> B> CPU1 counter: B> CPU2 counter: fffe (carry in CPU2 eflags) Nope, the non-atomicity isn't specific to the case when we carry bit from least significant part to most one. The non-atomicity is much simplier: two CPUs write to same memory address w/o using lock prefix. Thus, any update mechanism that first calculates absolute per-CPU address, then writes to it w/o lock, requires critical section. Because preemption between calculation and write, and later execution on other CPU leads to two CPUs writing to same address. -- Totus tuus, Glebius. ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r252032 - head/sys/amd64/include
On Tue, 25 Jun 2013, Konstantin Belousov wrote: On Tue, Jun 25, 2013 at 12:45:36PM +1000, Bruce Evans wrote: On Mon, 24 Jun 2013, Konstantin Belousov wrote: ... The following is the prototype for the x86. The other 64bit architectures are handled exactly like amd64. For 32bit !x86 arches, i.e. arm, mips32 and powerpc32, some more investigation is needed. That works of course, but is too complicated, machine-dependent and slow for me. ... The non-optimized avoids these problems using a critical section. I'm not sure that works. Anyway, counter.h shouldn't have to understand pcpu better than pcpu.h does in order to work. Updates to the counter cannot be done from the interrupt context. This is fragile, however. It prevents using counters for things like counting interrupts. Most interrupt counting is now done directlyly and doesn't use PCPU_INC(). i386/i386 has just 1 use of PCPU_INC(). It is to count traps in machdep.c. Traps include nonmaskable interrupts. Even hard-disabling of interrupts doesn't prevent these. Otherwise, PCPU is used mainly for vm counters. E.g., for pagefaults. Now the trap is not an interrupt, so it shouldn't occur in the middle of the counter update and the PCPU_INC() can safely be replaced by a counter, but this is not clear. The fetch problem of sometime missing the carry bit is present on all 32bit arches now, I think. Also the zeroing problem. Since there are no atomic ops in sight, zeroing has several races. My transferring code uses atomic ops on only 1 CPU, so it has half of these races. See below. The arm implementation of atomic.h is also relevant and is especially interesting. I looked at it after mdf@ pointed out ARM_RAS_START. This seems to only be used by arm in userland. It is even larger and slower and more complicated than critical sections, so it doesn't seem to be useful for counters. In the kernel, arm has several implementations depending on machine capabilities. The ifdefs are hard to understand. On some machine, it seems to use the equivalent of a cmpxchg loop. On others, it just disables interrupts. Disabling interrupts is presumably better than a critical section, and userland has to use ARM_RAS_START for some arches because neither disabling interrupts not critical sections are available in userland. None of this care helps avoid the races in pcpu.h, since pcpu.h intentionally avoids using atomic ops. None of this care helps avoid the races in counter.h or make counter.h efficient, since counter.h uses its own methods. To reach the same quality as atomic.h, counter.h would have to copy half of the ifdefs and methods in atomic.h. BTW, why do you claim that disabling interrupts is faster and better method to disable preemption than critical section ? For arm, it is because if the critical section method were better, then arm should use it for atomic ops too. For i386, it is because it reduces the window in which the value is invalid (other CPUs can see the invalid value), and is easier to use and shorter (2-3 bytes) and thus better suited to inline code. My examples of inline code mostly used it in rarely-executed cases where its speed doesn't matter and it is just easier to use. Actual testing shows that it is fairly efficient, so can be used in all cases. On Athlon64, it is faster than cmpxchg8b but slower than an inline optimized critical section. This testing is in userland, where there may be an extra penalty for using cli/sti/pushfl/popfl: % static inline void % counter_u64_add(counter_u64_t c, int64_t inc) % { % % #if 1 % /* 20 cycles on A64. */ % /* 31 cycles on A64 with lock prefix. */ % /* 35 cycles on A64 with mfence prefix. */ % if ((cpu_feature & 1) == 1) { % counter_64_inc_8b(c, inc); % } Old tests said that this takes 18 cycles. It now takes 20. Adding a lock or mfence prefix to the cmpxchg8b gives the times stated above. This shows how bad using cmpxch8b is. We could use atomic ops for everything and only be 50% slower. But we shouldn't use either cmpxchg8b or atomic ops. We should use 32-bit addl, which takes about 4 cycles. % struct thread % { % uint32_t td_critnest; % uint32_t td_owepreempt; % } td0; % % volatile struct thread *td = &td0; % % void % dummy_accum(void) % { % } % % void % fake_mi_switch(void) % { % } Added more emulation for a complete critical section emulation. % static inline void % alt_counter_u64_add(counter_u64_t c, int64_t inc) % { % #if 0 % /* 8.5 cycles on A64. */ % td->td_critnest++; % __asm __volatile("addl %1,%%ds:%0" : "=m,m" (*c) : "?i,r" (inc)); % td->td_critnest++; % #elif 0 % /* 7 cycles on A64. */ % uint32_t ocritnest; % % ocritnest = td->td_critnest; % td->td_critnest = ocritnest + 1; % __asm __volatile("addl %1,%%ds:%0" : "=m,m" (*c) : "?i,r" (inc)); % td->td_c
Re: svn commit: r252032 - head/sys/amd64/include
On Tue, Jun 25, 2013 at 12:45:36PM +1000, Bruce Evans wrote: > On Mon, 24 Jun 2013, Konstantin Belousov wrote: > > > On Sun, Jun 23, 2013 at 07:57:57PM +1000, Bruce Evans wrote: > >> The case that can't be fixed by rereading the counters is when fetching > >> code runs in between the stores. If the stores are on a another CPU > >> that is currently executing them, then we can keep checking that the > >> counters don't change for the maximum time that any pair of stores > >> take to complete. This time is quite long and difficult to determine > >> (but it can be bounded by hard-disabling interrupts in the storer, and > >> limited by prefetching). If the stores are on the same CPU, then we > >> have no good way to detect that another thread is in the middle of > >> them, or to switch back to that thread. So we currently prevent this > >> case using slow methods. > > > > We are already explicit about the fetched value potentially not > > reflecting any real moment value, but indeed, having the fetch result > > off by 2^32 is not acceptable. > > > > We need atomic 64bit loads to fix this, which is provided by the same > > cmpxchg8b instruction on the 8-byte aligned elements. Intel guarantees > > that 8-byte aligned loads and stores are atomic. > > > > The following is the prototype for the x86. The other 64bit > > architectures are handled exactly like amd64. For 32bit !x86 arches, > > i.e. arm, mips32 and powerpc32, some more investigation is needed. > > That works of course, but is too complicated, machine-dependent and slow > for me. > (I just checked all the pcpu.h implementations and found that > all of them except amd64 and i386 still use the quick and racy > PCPU_PTR(member) += (value) for PCPU_ADD(). This has 1 sure > race and 1 compiler/machine-dependent race: > - PCPU_PTR() is only usable if preemption (resulting in rescheduling >to run on a different CPU) is prevented in the caller. Otherwise, >the pointer may become invalid immediately iafter it is initialized. > - the compiler may generate a racy load-add-store operation for +=. >Using PCPU_PTR() makes this a smaller problem than using a thread- >local pointer. It prevents the store going to a different CPU >than the load. > PCPU_GET() and PCPU_SET() have the first of these problems on all > arches except amd64 and i386. Even curthread wouldn't work in the > SMP case if it has the default implementation of PCPU_GET(curthread). > However, it has a non-default implementation on all arches except > arm and mips. So even curthread seems to be broken for arm and mips.) Brokeness (or, from other PoV, a restriction on use of PCPU_GET/SET/INC) is well-known. For ARM, I believe the in-tree kernels do not support SMP at all, so curthread as is is not broken. For mips it seems as if the pcpup located on the wired page which mapping is context-switched, that should make the curthread functional. > > The non-optimized avoids these problems using a > critical section. I'm not sure that works. Anyway, counter.h shouldn't > have to understand pcpu better than pcpu.h does in order to work. Updates to the counter cannot be done from the interrupt context. The fetch problem of sometime missing the carry bit is present on all 32bit arches now, I think. > > The arm implementation of atomic.h is also relevant and is especially > interesting. I looked at it after mdf@ pointed out ARM_RAS_START. > This seems to only be used by arm in userland. It is even larger and > slower and more complicated than critical sections, so it doesn't seem > to be useful for counters. In the kernel, arm has several > implementations depending on machine capabilities. The ifdefs are > hard to understand. On some machine, it seems to use the equivalent > of a cmpxchg loop. On others, it just disables interrupts. Disabling > interrupts is presumably better than a critical section, and userland > has to use ARM_RAS_START for some arches because neither disabling > interrupts not critical sections are available in userland. None > of this care helps avoid the races in pcpu.h, since pcpu.h intentionally > avoids using atomic ops. None of this care helps avoid the races in > counter.h or make counter.h efficient, since counter.h uses its own > methods. To reach the same quality as atomic.h, counter.h would have > to copy half of the ifdefs and methods in atomic.h. BTW, why do you claim that disabling interrupts is faster and better method to disable preemption than critical section ? > > The slowness that I care about is only in the counter update. Anything > complicated there will be slow. > > My current best design: > - use ordinary mutexes to protect counter fetches in non-per-CPU contexts. > - use native-sized or always 32-bit counters. Counter updates are done >by a single addl on i386.
Re: svn commit: r252032 - head/sys/amd64/include
On 25/06/2013, at 12:54, Bruce Evans wrote: >> - run a daemon every few minutes to fetch all the counters, so that >> the native-sized counters are in no danger of overflowing on systems >> that don't run statistics programs often enough to fetch the counters >> to actually use. > > There is at least 1 counter decrement (add -1) in tcp, so the native counters > need to be signed. You could convert decrements into an increment of a separate counter and then subtract that value from the others when collecting them all. -- Daniel O'Connor software and network engineer for Genesis Software - http://www.gsoft.com.au "The nice thing about standards is that there are so many of them to choose from." -- Andrew Tanenbaum GPG Fingerprint - 5596 B766 97C0 0E94 4347 295E E593 DC20 7B3F CE8C ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r252032 - head/sys/amd64/include
On Tue, 25 Jun 2013, I wrote: My current best design: - use ordinary mutexes to protect counter fetches in non-per-CPU contexts. - use native-sized or always 32-bit counters. Counter updates are done by a single addl on i386. Fix pcpu.h on arches other than amd64 and i386 and use the same method as there. - counter fetches add the native-sized pcpu counters to 64-bit non-pcpu counters, when the native-size counters are in danger of overflowing or always, under the mutex. Transferring data uses an ordinary atomic_cmpset. To avoid ifdefs, always use u_int pcpu counters. The 64-bit non-pcpu counters can easily be changed to pcpu counters if the API is extended to support pcpu data. - run a daemon every few minutes to fetch all the counters, so that the native-sized counters are in no danger of overflowing on systems that don't run statistics programs often enough to fetch the counters to actually use. There is at least 1 counter decrement (add -1) in tcp, so the native counters need to be signed. ... With my design: extern struct mtx counter_locks[]; extern uint64_t counters[]; This is pseudo-code. The extra structure must be dynamically allocated with each counter. I'm not sure how to do that. uint64_t_pcpu_zone is specialized for pcpu counters, and a non-pcpu part is needed. uint64_t r; volatile u_int *p; u_int v; Change to int. int cnum; cnum = ctonum(c); mtx_lock(&counter_locks[cnum]); /* might not need 1 per counter */ r = counters[cnum]; for (i = 0; i < mp_ncpus; i++) { p = (u_int *)((char *)c + sizeof(struct pcpu) * i); Change to int *. v = *p; /* don't care if it is stale */ if (v >= 0x8000) { Change the critical level to 2 critical levels, 0x4000 for positive values and -0x4000 for negative values. /* Transfer only when above critical level. */ while (atomic_cmpset_rel_int(p, v, 0) == 0) v = *p; /* still don't care if it is stale */ counters[cnum] += v; Even though full counter values are not expected to become negative, the native counters can easily become negative when a decrement occurs after the transfer resets them to 0. } r += v; } mtx_unlock(&counter_locks[cnum]); return (r); Mutexes give some slowness in the fetching code, but fetches eare expected to be relatively rare. I added the complication to usually avoid atomic ops at the last minute. The complication might not be worth it. The complication is to test v so as to normally avoid doing the transfer and its atomic ops (and to not use atomic ops for loading v). The complication is larger with 2 thresholds. If we always transferred, then *p would usually be small and often 0, so that decrementing it would often make it -1. This -1 must be transferred by adding -1, not by adding 0xf. Changing the type of the native counter to int gives this. Bruce ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r252032 - head/sys/amd64/include
On Mon, 24 Jun 2013, Konstantin Belousov wrote: On Sun, Jun 23, 2013 at 07:57:57PM +1000, Bruce Evans wrote: The case that can't be fixed by rereading the counters is when fetching code runs in between the stores. If the stores are on a another CPU that is currently executing them, then we can keep checking that the counters don't change for the maximum time that any pair of stores take to complete. This time is quite long and difficult to determine (but it can be bounded by hard-disabling interrupts in the storer, and limited by prefetching). If the stores are on the same CPU, then we have no good way to detect that another thread is in the middle of them, or to switch back to that thread. So we currently prevent this case using slow methods. We are already explicit about the fetched value potentially not reflecting any real moment value, but indeed, having the fetch result off by 2^32 is not acceptable. We need atomic 64bit loads to fix this, which is provided by the same cmpxchg8b instruction on the 8-byte aligned elements. Intel guarantees that 8-byte aligned loads and stores are atomic. The following is the prototype for the x86. The other 64bit architectures are handled exactly like amd64. For 32bit !x86 arches, i.e. arm, mips32 and powerpc32, some more investigation is needed. That works of course, but is too complicated, machine-dependent and slow for me. (I just checked all the pcpu.h implementations and found that all of them except amd64 and i386 still use the quick and racy PCPU_PTR(member) += (value) for PCPU_ADD(). This has 1 sure race and 1 compiler/machine-dependent race: - PCPU_PTR() is only usable if preemption (resulting in rescheduling to run on a different CPU) is prevented in the caller. Otherwise, the pointer may become invalid immediately iafter it is initialized. - the compiler may generate a racy load-add-store operation for +=. Using PCPU_PTR() makes this a smaller problem than using a thread- local pointer. It prevents the store going to a different CPU than the load. PCPU_GET() and PCPU_SET() have the first of these problems on all arches except amd64 and i386. Even curthread wouldn't work in the SMP case if it has the default implementation of PCPU_GET(curthread). However, it has a non-default implementation on all arches except arm and mips. So even curthread seems to be broken for arm and mips.) The non-optimized avoids these problems using a critical section. I'm not sure that works. Anyway, counter.h shouldn't have to understand pcpu better than pcpu.h does in order to work. The arm implementation of atomic.h is also relevant and is especially interesting. I looked at it after mdf@ pointed out ARM_RAS_START. This seems to only be used by arm in userland. It is even larger and slower and more complicated than critical sections, so it doesn't seem to be useful for counters. In the kernel, arm has several implementations depending on machine capabilities. The ifdefs are hard to understand. On some machine, it seems to use the equivalent of a cmpxchg loop. On others, it just disables interrupts. Disabling interrupts is presumably better than a critical section, and userland has to use ARM_RAS_START for some arches because neither disabling interrupts not critical sections are available in userland. None of this care helps avoid the races in pcpu.h, since pcpu.h intentionally avoids using atomic ops. None of this care helps avoid the races in counter.h or make counter.h efficient, since counter.h uses its own methods. To reach the same quality as atomic.h, counter.h would have to copy half of the ifdefs and methods in atomic.h. The slowness that I care about is only in the counter update. Anything complicated there will be slow. My current best design: - use ordinary mutexes to protect counter fetches in non-per-CPU contexts. - use native-sized or always 32-bit counters. Counter updates are done by a single addl on i386. Fix pcpu.h on arches other than amd64 and i386 and use the same method as there. - counter fetches add the native-sized pcpu counters to 64-bit non-pcpu counters, when the native-size counters are in danger of overflowing or always, under the mutex. Transferring data uses an ordinary atomic_cmpset. To avoid ifdefs, always use u_int pcpu counters. The 64-bit non-pcpu counters can easily be changed to pcpu counters if the API is extended to support pcpu data. - run a daemon every few minutes to fetch all the counters, so that the native-sized counters are in no danger of overflowing on systems that don't run statistics programs often enough to fetch the counters to actually use. ... diff --git a/sys/kern/subr_counter.c b/sys/kern/subr_counter.c index a98ed40..3222881 100644 --- a/sys/kern/subr_counter.c +++ b/sys/kern/subr_counter.c ... @@ -49,14 +51,8 @@ counter_u64_zero(
Re: svn commit: r252032 - head/sys/amd64/include
On Sun, Jun 23, 2013 at 07:57:57PM +1000, Bruce Evans wrote: > The case that can't be fixed by rereading the counters is when fetching > code runs in between the stores. If the stores are on a another CPU > that is currently executing them, then we can keep checking that the > counters don't change for the maximum time that any pair of stores > take to complete. This time is quite long and difficult to determine > (but it can be bounded by hard-disabling interrupts in the storer, and > limited by prefetching). If the stores are on the same CPU, then we > have no good way to detect that another thread is in the middle of > them, or to switch back to that thread. So we currently prevent this > case using slow methods. We are already explicit about the fetched value potentially not reflecting any real moment value, but indeed, having the fetch result off by 2^32 is not acceptable. We need atomic 64bit loads to fix this, which is provided by the same cmpxchg8b instruction on the 8-byte aligned elements. Intel guarantees that 8-byte aligned loads and stores are atomic. The following is the prototype for the x86. The other 64bit architectures are handled exactly like amd64. For 32bit !x86 arches, i.e. arm, mips32 and powerpc32, some more investigation is needed. diff --git a/sys/amd64/include/counter.h b/sys/amd64/include/counter.h index b37a4b8..ba302a3 100644 --- a/sys/amd64/include/counter.h +++ b/sys/amd64/include/counter.h @@ -36,6 +36,28 @@ extern struct pcpu __pcpu[1]; #definecounter_enter() do {} while (0) #definecounter_exit() do {} while (0) +#ifdef IN_SUBR_COUNTER_C +static inline uint64_t +counter_u64_fetch_inline(uint64_t *p) +{ + uint64_t r; + int i; + + r = 0; + for (i = 0; i < mp_ncpus; i++) + r += counter_u64_read_one((uint64_t *)c, i); + + return (r); +} +#endif + +static inline uint64_t +counter_u64_read_one(uint64_t *p, int cpu) +{ + + return (*(uint64_t *)((char *)p + sizeof(struct pcpu) * cpu)); +} + #definecounter_u64_add_protected(c, i) counter_u64_add(c, i) static inline void diff --git a/sys/i386/include/counter.h b/sys/i386/include/counter.h index 3e93b36..94dbee3 100644 --- a/sys/i386/include/counter.h +++ b/sys/i386/include/counter.h @@ -67,6 +67,50 @@ counter_64_inc_8b(uint64_t *p, int64_t inc) : "memory", "cc", "eax", "edx", "ebx", "ecx"); } +static inline uint64_t +counter_u64_read_one_8b(uint64_t *p, int cpu) +{ + uint32_t res_lo, res_high; + + __asm __volatile( + "movl %%eax,%%ebx\n\t" + "movl %%edx,%%ecx\n\t" + "cmpxchg8b (%%esi)" + : "=a" (res_lo), "=d"(res_high) + : "S" ((char *)p + sizeof(struct pcpu) * cpu) + : "cc", "ebx", "ecx"); + return (res_lo + ((uint64_t)res_high << 32)); +} + +#ifdef IN_SUBR_COUNTER_C +static inline uint64_t +counter_u64_fetch_inline(uint64_t *p) +{ + uint64_t res; + int i; + + res = 0; + if ((cpu_feature & CPUID_CX8) == 0) { + /* +* The machines without cmpxchg8b are not SMP. +* Disabling the preemption provides atomicity of the +* counter reading, since update is done in the +* critical section as well. +*/ + critical_enter(); + for (i = 0; i < mp_ncpus; i++) { + res += *(uint64_t *)((char *)p + + sizeof(struct pcpu) * i); + } + critical_exit(); + } else { + for (i = 0; i < mp_ncpus; i++) + res += counter_u64_read_one_8b(p, i); + } + return (res); +} +#endif + #definecounter_u64_add_protected(c, inc) do {\ if ((cpu_feature & CPUID_CX8) == 0) { \ CRITICAL_ASSERT(curthread); \ diff --git a/sys/kern/subr_counter.c b/sys/kern/subr_counter.c index a98ed40..3222881 100644 --- a/sys/kern/subr_counter.c +++ b/sys/kern/subr_counter.c @@ -29,11 +29,13 @@ __FBSDID("$FreeBSD$"); #include #include -#include #include #include #include #include + +#define IN_SUBR_COUNTER_C +#include static uma_zone_t uint64_pcpu_zone; @@ -49,14 +51,8 @@ counter_u64_zero(counter_u64_t c) uint64_t counter_u64_fetch(counter_u64_t c) { - uint64_t r; - int i; - r = 0; - for (i = 0; i < mp_ncpus; i++) - r += *(uint64_t *)((char *)c + sizeof(struct pcpu) * i); - - return (r); + return (counter_u64_fetch_inline(c)); } counter_u64_t pgpYUibcqooBj.pgp Description: PGP signature
Re: svn commit: r252032 - head/sys/amd64/include
[snipping everything about counter64, atomic ops, cycles, etc.] I wonder if the idea explained in this paper: http://static.usenix.org/event/usenix03/tech/freenix03/full_papers/mcgarry/mcgarry_html/ Which seems to be used in FreeBSD for some ARM atomics: http://svnweb.freebsd.org/base/head/sys/arm/include/atomic.h?view=annotate , look for ARM_RAS_START would be more efficient. To summarize: one marks a section of code such that if a thread is interrupted during the code it restarts at the beginning instead of where it was interrupted. This has been used to implement atomic increment on some hardware without the necessary instructions. Here it could be used to implement atomic increment on the per-cpu counter without the overhead of an atomic instruction. It's multiple stores to mark the section of code doing the increment, but they're all per-cpu or per thread. That may be cheaper than an atomic increment, at least on 32-bit platforms that are doing an atomic 64-bit increment. I haven't benchmarked this (ENOTIME, plus I'm on vacation right now), but using restartable sections requires three stores (add an item to a linked list, 64-bit increment for the counter, remove an item from a linked list). Some of the penalty is payed in the context switch code, which has to check if the instruction pointer is in one of these critical sections. I haven't checked to see if this code is enabled on all architectures or just ARM. But if context switches are less frequent than counter increments in the networking code, it's still a win for most uses. Thanks, matthew ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r252032 - head/sys/amd64/include
On Mon, 24 Jun 2013, Gleb Smirnoff wrote: On Sun, Jun 23, 2013 at 10:33:43AM +0300, Konstantin Belousov wrote: K> On Sat, Jun 22, 2013 at 06:58:15PM +1000, Bruce Evans wrote: K> > So the i386 version be simply "addl; adcl" to memory. Each store in K> > this is atomic at the per-CPU level. If there is no carry, then the K> > separate stores are equivalent to adding separate nonnegative values and K> > the counter value is valid at all times. If there is carry, then the K> > separate stores are equivalent to adding a negative value followed by K> > a larger positive value. The counter transiently goes backwards, but K> > no information is lost and the counter is eventually set to the correct K> > forward-going value. K> K> This is quite interesting idea, but I still did not decided if it K> acceptable. The issue is that we could add the carry to the other K> processor counter, if the preemption kicks in at right time between K> two instructions. I did not found any argument why would it be K> wrong, the races for fetch operation seems to be the same with either K> local update method. This would be wrong since update isn't locked. Thus, if we are put on other CPU between two instructions, and in second instruction updating another CPU counter simultaneously with the original CPU we were on, then we are losing an update. Hmm, this is subtle. The update is never lost, but is applied to different counter, non-atomically at a later time. Non-atomicity only matters when there is a carry since the counter only goes transiently backards in this case. For example: initial state: CPU1 counter: CPU2 counter: fffe Start adding 1 to the first counter, doing it non-atomically by incrementing the low word first. CPU1 counter: (carry in CPU1 eflags) CPU2 counter: fffe Get preempted at this point: CPU1 counter: CPU2 counter: fffe (carry in CPU2 eflags) The fetching code running at this point will see a wrong value, but has more serious bugs. Once it is fixed, it can try using a heuristic to detect this case, or the fix might involve a heuristic that detects this case too. I was thinking of keeping a global count of the previous value for each counter, and adding 0x1 when the count goes backwards. However, in the worst each component of a counter may be making the transition from 0x to 0x1, with the carry bit in eflags for all CPUs. Detecting this requires watching the low word of all components of all counters. Increments must be limited to considerably less than 32 bits, and negative increments should be disallowed, so that the low word doesn't wap faster than it can be watched. CPU2 runs and completes the increment by adding the carry to the high word: CPU1 counter: CPU2 counter: 0001 fffe The total count becomes correct once the second word is written. Statistics for each CPU should be correct, but losing this race is so rare that maybe we especially don't care about this subcase. There is currently no API for exporting the counter statistics for each CPU. Similarly if the preemption is to another thread on the same CPU, or to another thread that doesn't touch this counter and then back to the original thread (much later) on the same CPU. There may be any number of updates to the low word before the corresponding updates to the high word, mixed in any order except for the low word being updated before the high word for each increment. State for the half-done updates is saved in carry flags and registers in any number of threads. If large increments are allowed, the value may appear to have had many negative increments. Yes, the probability of such event is extremely low, but still possible. The idea on counter(9) is that although fetching might be not precise, the internal value of a counter is consistent and doesn't lose even a single update. This would allow later to use counter(9) as a cheap reference counter. The fetching code has essentially the same race (in the 32-bit case) that is laboriously worked around in the update code, irrespective of how atomic the update code is, even with 1 CPU: CPU1 counter: Start reading it on CPU1, using 32-bit load of the low word (the compiler could make this worse or just different by loading the high word first): (read this value) Get preempted; counter update occurs and adds 1 perfectly atomically: CPU1 counter: 0001 Back to fetching code on CPU1; read high word: 0001 (read this value; it is garbage) If the compiler loads the high word first, it will see the value off by ~2**32 in the opposite direction ( ). Like I said before, this can probably be handled by rereading each pcpu counter until it doesn't change. Or if large
Re: svn commit: r252032 - head/sys/amd64/include
On Mon, 24 Jun 2013, Gleb Smirnoff wrote: did you run your benchmarks in userland or in kernel? How many parallel threads were updating the same counter? Can you please share your benchmarks? Only userland, with 1 thread. I don't have any more benchmarks than the test program in previous mail. I don't see how threads have anything to do with efficiency of counter incrementation, unless slow locking is used. With threads for the same process on the same CPU, the accesses are not really different from accesses by a single thread. With threads for different processes on the same CPU, switching the address space will thrash the cache for user threads, but the pcpu area in kernel memory shouldn't be switched. It seems difficult to simulate pcpu in user address space. With threads for the same or different processes on different CPUs, there is no contention for pcpu counters. Please remind me of your old tests that did show some efficiency differences. IIRC, direct increment was unexpectedly slower. Was that on a 64-bit system? I guess it wasn't, since the committed version just uses a direct increment on amd64. On i386 with cmpxch86b, using cmpxchg8b might be more efficient because it is a 64-bit access. I don't see how that can be, since 4 32-bit accesses are needed to set up the cmpxchg8b. In fact, 1 of these accesses can be extremely slow since it has a store-to-load penalty on some arches (I have considerable experience with store-to-load penalties in FP code and large uncommitted asms in libm to avoid them.). Here is the access which is likely to have the penalty: % static inline void % counter_64_inc_8b(uint64_t *p, int64_t inc) % { % % __asm __volatile( % "movl %%fs:(%%esi),%%eax\n\t" The previous store was a cmpchg8b. Presumably that was 64 bits. No problem for this load, since it is at the same address as the store and its size mismatch doesn't have the penalty on any CPU that I know of. % "movl %%fs:4(%%esi),%%edx\n" Store to load mismatch penalty on at least AthlonXP and Athlon64. The load is from the middle of a 64-bit store, and at least these CPUs don't have hardware to forward it from the write buffer. Costs 10-20 cycles. Phenom is documented to have extra hardware to make this case as fast as the previous case. I haven't tested Pheonom. Acccording to FP benchmarks, store-to-load penalties are large on core2 and corei7 too. % "1:\n\t" % "movl %%eax,%%ebx\n\t" % "movl %%edx,%%ecx\n\t" % "addl (%%edi),%%ebx\n\t" % "adcl 4(%%edi),%%ecx\n\t" These extra memory accesses are unfortunately necessary because there aren't enough registers and the asm is a bit too simple (e.g., to add 1, more complicated asm could just add $1 with carry here, but the current asm has to store 1 to a 64-bit temporary memory variable so that it can be loaded here). These are all 32-bit accesses so they don't have penalties. There is just a lot of memory traffic for them. % "cmpxchg8b %%fs:(%%esi)\n\t" This presumably does a 64-bit load followed by a 64-bit store (when it succeeds). The load matches the previous store, so there is no penalty. % "jnz 1b" % : % : "S" ((char *)p - (char *)&__pcpu[0]), "D" (&inc) % : "memory", "cc", "eax", "edx", "ebx", "ecx"); % } The penalty may be unimportant in normal use because loads are normally separated from stores by long enough to give the write buffers a chance to flush to the cache. But loop benchmarks will always see it unless loop does enough things between the store and the load to give the large separation. Note that the penalty affects loads, so its latency is normally not hidden. I forgot about this when I ran tests on Athlon64. Athlon64 was only about 6 cycles slower than core2, for about 20 cycles per iteration altogther. Not much more, but 20 is about the penalty time, so maybe the loop is ending up testing just the penalty time, with all the other latencies in parallel with the penalty. For a quick test of this, I replaced the load that has the penalty by a load of immediate 0. This reduced the time to 14.5 cycles. So the penalty is at least 5.5 cycles. (Note that in the benchmark, the counter only goes up to about 2 billion, so the high 32 bits always has value 0, so loading immediate 0 gives the same result.) On core2 (ref10-i386) and corei7 (freefall), the same change has no effect on the time. This shows that the penalty doesn't apply on core2 or corei7, and the FP penalties that I see there have a different source. ... Testing shows that they are for loads of 64-bit values that are mismatched since the value was built up using 2 32-bit stores. Test program: % #include % % uint64_t foo; % % int % main(void) % { % unsigned i; % % for (i = 0; i < 2666813872; i++) /* sysctl -n machdep.tsc_freq */ % asm volatile( % #ifdef NO_PENALTY % "movq %%rax,foo; movq foo,%%rax" %
Re: svn commit: r252032 - head/sys/amd64/include
Bruce, did you run your benchmarks in userland or in kernel? How many parallel threads were updating the same counter? Can you please share your benchmarks? -- Totus tuus, Glebius. ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r252032 - head/sys/amd64/include
On Sun, Jun 23, 2013 at 10:33:43AM +0300, Konstantin Belousov wrote: K> On Sat, Jun 22, 2013 at 06:58:15PM +1000, Bruce Evans wrote: K> > So the i386 version be simply "addl; adcl" to memory. Each store in K> > this is atomic at the per-CPU level. If there is no carry, then the K> > separate stores are equivalent to adding separate nonnegative values and K> > the counter value is valid at all times. If there is carry, then the K> > separate stores are equivalent to adding a negative value followed by K> > a larger positive value. The counter transiently goes backwards, but K> > no information is lost and the counter is eventually set to the correct K> > forward-going value. K> K> This is quite interesting idea, but I still did not decided if it K> acceptable. The issue is that we could add the carry to the other K> processor counter, if the preemption kicks in at right time between K> two instructions. I did not found any argument why would it be K> wrong, the races for fetch operation seems to be the same with either K> local update method. This would be wrong since update isn't locked. Thus, if we are put on other CPU between two instructions, and in second instruction updating another CPU counter simultaneously with the original CPU we were on, then we are losing an update. Yes, the probability of such event is extremely low, but still possible. The idea on counter(9) is that although fetching might be not precise, the internal value of a counter is consistent and doesn't lose even a single update. This would allow later to use counter(9) as a cheap reference counter. -- Totus tuus, Glebius. ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r252032 - head/sys/amd64/include
On Sun, 23 Jun 2013, I wrote: I thought of lots of variations, but couldn't find one that works perfectly. One idea (that goes with the sign check on the low 32 bits) is to use a misaligned add to memory to copy the 31st bit as a carry bit to the the high word. The value of the counter is supposed to be [unsigned value of low 32 bits] + [unsigned value of next 24 bits] << 31 (high 8 bits not used) at all times, with the 31st bit zero at most times so that the the carry operation is rarely executed. This format is only slightly confusing for debugging (it basically has 2 31st bits, with the one in the low 32 bits rarely used). This format can be updated using something like: The above doesn't work if it is preempted -- then it might add do the carry operation more than once. But it is easy to lock using cmpxchg or disabling interrupts. Disabling interrupts requires only small code: ... Another idea is to use the high bits of the high word to encode state. They can be set atomically enough using addl/andl/orl/xorl since they are per-CPU. I couldn't quite get this to work. You could increment Here is a combined version. It uses the old shift trick instead of disabling interrupts for locking. The shift trick is not often used since it doesn't work for SMP. It works here since the counters are per-CPU: addl%1,%%fs:(%0)# only small 32-bit increments supported jns 8f sar $1,%%fs:7(%0) # hi 8 bits are for lock; init to 0xfe jc 8f # already locked addl$0x80,%%fs:3(%0)# misaligned memory access movb$0xfe,%%fs:7(%0)# unlock 8: ; Since the locked case is rarely executed and the code for it is only inline for convenience, the shortest code should be preferred and that is the simpler version that disables interrupts. Unless the lock byte can be exploited in the fetching and zeroing code. I don't see how it can be, since this type of locking only works for a single CPU. Bruce ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r252032 - head/sys/amd64/include
On Sun, 23 Jun 2013, Konstantin Belousov wrote: On Sat, Jun 22, 2013 at 06:58:15PM +1000, Bruce Evans wrote: So the i386 version be simply "addl; adcl" to memory. Each store in this is atomic at the per-CPU level. If there is no carry, then the separate stores are equivalent to adding separate nonnegative values and the counter value is valid at all times. If there is carry, then the separate stores are equivalent to adding a negative value followed by a larger positive value. The counter transiently goes backwards, but no information is lost and the counter is eventually set to the correct forward-going value. This is quite interesting idea, but I still did not decided if it acceptable. The issue is that we could add the carry to the other processor counter, if the preemption kicks in at right time between two instructions. I did not found any argument why would it be wrong, the races for fetch operation seems to be the same with either local update method. I thought about exploiting not-so-strong memory ordering to fix the fetches. On at least x86, stores are not reordered, so if the storer does a simple "addl, adcl", then the order is not much different from load low 32 bits store low 32 bits load high 32 bits store high 32 bits The loads can be in any order, but must be before the stores to the same location. I think the fetching code can depend on this (maybe only on x86, but the fetching code needs to be MD anyway). Thus it can detect most races by simply rereading the counters in almost any order until they don't change. Without this, the fetching code can easily read high and low bits from unrelated stores (when it is preempted, the race window is large), and the sophisticated cmpxchg8b code to make each store atomic is almost useless. The case that can't be fixed by rereading the counters is when fetching code runs in between the stores. If the stores are on a another CPU that is currently executing them, then we can keep checking that the counters don't change for the maximum time that any pair of stores take to complete. This time is quite long and difficult to determine (but it can be bounded by hard-disabling interrupts in the storer, and limited by prefetching). If the stores are on the same CPU, then we have no good way to detect that another thread is in the middle of them, or to switch back to that thread. So we currently prevent this case using slow methods. On the other hand, for debugging it would be quite confusing state of the counters array to observe. I thought of lots of variations, but couldn't find one that works perfectly. One idea (that goes with the sign check on the low 32 bits) is to use a misaligned add to memory to copy the 31st bit as a carry bit to the the high word. The value of the counter is supposed to be [unsigned value of low 32 bits] + [unsigned value of next 24 bits] << 31 (high 8 bits not used) at all times, with the 31st bit zero at most times so that the the carry operation is rarely executed. This format is only slightly confusing for debugging (it basically has 2 31st bits, with the one in the low 32 bits rarely used). This format can be updated using something like: addl%1,%%fs:(%0)# only small 32-bit increments supported jns 8f addl$0x80,%%fs:3(%0)# misaligned memory access 8: ; No problem if this is not preempted. The idea of the misaligned memory access is that the CPU needs to make this atomic enough even though it crosses a 32-bit boundary. BTW, is anything guaranteed if the access also crosses a cache line or page boundary? The OS would be involved for page boundaries if there is a page fault. Better not step on bugs in that. What about for misaligned 64-bit or 128-bit stores done by cmpxchg8b or SSE? Counters are 64-bit aligned, so I think that at least CPUs that are basically 64-bit must handle this like I want. The above doesn't work if it is preempted -- then it might add do the carry operation more than once. But it is easy to lock using cmpxchg or disabling interrupts. Disabling interrupts requires only small code: addl%1,%%fs:(%0)# only small 32-bit increments supported jns 8f pushfl cli testb $0x80,%%fs:3(%0) je 1f # carry already done by preemption addl$0x80,%%fs:3(%0)# misaligned memory access 1: popfl 8: ; Another idea is to use the high bits of the high word to encode state. They can be set atomically enough using addl/andl/orl/xorl since they are per-CPU. I couldn't quite get this to work. You could increment them to indicate an update in progress. This only requires 1 extra add to memory if the adcl is always done (e.g., first add 0x01 to the high word; then addl to the low word; then adcl of (inc >> 32) - 0x0100 to the high word. Increments must be limited to no
Re: svn commit: r252032 - head/sys/amd64/include
On Sun, 23 Jun 2013, Konstantin Belousov wrote: On Sat, Jun 22, 2013 at 01:37:58PM +1000, Bruce Evans wrote: On Sat, 22 Jun 2013, I wrote: ... Here are considerably expanded tests, with noninline tests dropped. Summary of times on Athlon64: simple increment: 4-7 cycles (1) simple increment preceded by feature test: 5-8 cycles (1) simple 32-bit increment:4-7 cycles (2) correct 32-bit increment (addl to mem): 5.5-7 cycles (3) inlined critical section: 8.5 cycles (4) better inlined critical section:7 cycles (5) correct unsigned 32-bit inc of 64-bit counter: 4-7 cycles (6) "improve" previous to allow immediate operand: 5+ cycles correct signed 32-bit inc of 64-bit counter:8.5-9 cycles (7) correct 64-bit inc of 64-bit counter: 8-9 cycles (8) -current method (cmpxchg8b): 18 cycles corei7 (freefall) has about the same timing as Athlon64, but core2 (ref10-i386) is 3-4 cycles slower for the tests that use cmpxchg. You only tested 32 bit, right ? Note that core2-class machines have at least one cycle penalty for decoding any instruction with REX prefix. Yes, since the 64-bit case works more or less correctly. I tested the 32-bit binary on 64-bit systems. (4) The critical section method is quite fast when inlined. (5) The critical section method is even faster when optimized. This is what should be used if you don't want the complications for the daemon. Oops, I forgot that critical sections are much slower in -current than in my version. They probably take 20-40 cycles for the best case, and can't easily be tested in userland since they disable interrupts in hardware. My versions disable interrupts in software. The critical sections do not disable the interrupts. Only the thread local counter is incremented. Leaving the section could be complicated though. Yes, as I noticed later, it was only an old version of FreeBSD that disabled interrupts. The critical section method (or disabling interrupts, which is probably faster) only works for the non-SMP case, but old CPUs that don't have cmpxchg8b probably don't support SMP. Further tests confirm that incl and incq are pipelined normally on at least corei7 and core2. In the loop test, freefall can do 4 independent addq's to memory faster than it can do 1 :-). It can do 6 independent addq's to memory in the same time that it can do 1. After that, the loop overhead prevents geting the complete bandwidth of the memory system. However, 6 addq's to the same memory location take a little more than 6 times longer than 1. Multiple increments of the same counter one after the other are probably rare, but the counter API makes it harder to coaelsce them if they occur, and the implementation using independent asms ensures that the compiler cannot coalesce them. I think that the naive looping on Core i7+ to measure the latencies of the instructions simply does not work. The Nehalem hardware started to provide the loop detector for the instruction queue inside the decoder, which essentially makes the short loops executed from the microopcode form. We never use counters in the tight loops. The test results show it working perfectly in the test environment. Any loop detection just does a better job of running all the loop instructions in parallel with the instructions being timed. But even old CPUs can run all the loop instructions in parallel with high-latency instructions like add-to-memory. Also, add-to-memory instructions have strict ordering requirements on x86. The order must be load-modify-store. That's for storing to a single location. This prevents the loop running in parallel with itself, so its latency timing works especially simply. However, the test environment isn't normal. It is important to understand that some of the time is latency that doesn't matter in the normal environment. But in the complicated versions with cmpxchg8b's or critical sections, there are lots of extra instructions and ordering requirements whose latencies can't be hidden. Bruce ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r252032 - head/sys/amd64/include
On Sat, Jun 22, 2013 at 06:58:15PM +1000, Bruce Evans wrote: > So the i386 version be simply "addl; adcl" to memory. Each store in > this is atomic at the per-CPU level. If there is no carry, then the > separate stores are equivalent to adding separate nonnegative values and > the counter value is valid at all times. If there is carry, then the > separate stores are equivalent to adding a negative value followed by > a larger positive value. The counter transiently goes backwards, but > no information is lost and the counter is eventually set to the correct > forward-going value. This is quite interesting idea, but I still did not decided if it acceptable. The issue is that we could add the carry to the other processor counter, if the preemption kicks in at right time between two instructions. I did not found any argument why would it be wrong, the races for fetch operation seems to be the same with either local update method. On the other hand, for debugging it would be quite confusing state of the counters array to observe. pgpBRbpz1zfRK.pgp Description: PGP signature
Re: svn commit: r252032 - head/sys/amd64/include
On Sat, Jun 22, 2013 at 01:37:58PM +1000, Bruce Evans wrote: > On Sat, 22 Jun 2013, I wrote: > > > ... > > Here are considerably expanded tests, with noninline tests dropped. > > Summary of times on Athlon64: > > > > simple increment: 4-7 cycles (1) > > simple increment preceded by feature test: 5-8 cycles (1) > > simple 32-bit increment:4-7 cycles (2) > > correct 32-bit increment (addl to mem): 5.5-7 cycles (3) > > inlined critical section: 8.5 cycles (4) > > better inlined critical section:7 cycles (5) > > correct unsigned 32-bit inc of 64-bit counter: 4-7 cycles (6) > > "improve" previous to allow immediate operand: 5+ cycles > > correct signed 32-bit inc of 64-bit counter:8.5-9 cycles (7) > > correct 64-bit inc of 64-bit counter: 8-9 cycles (8) > > -current method (cmpxchg8b): 18 cycles > > corei7 (freefall) has about the same timing as Athlon64, but core2 > (ref10-i386) is 3-4 cycles slower for the tests that use cmpxchg. You only tested 32 bit, right ? Note that core2-class machines have at least one cycle penalty for decoding any instruction with REX prefix. > > > (4) The critical section method is quite fast when inlined. > > (5) The critical section method is even faster when optimized. This is > >what should be used if you don't want the complications for the > >daemon. > > Oops, I forgot that critical sections are much slower in -current than > in my version. They probably take 20-40 cycles for the best case, and > can't easily be tested in userland since they disable interrupts in > hardware. My versions disable interrupts in software. The critical sections do not disable the interrupts. Only the thread local counter is incremented. Leaving the section could be complicated though. > > > ... > > % % static inline void > > % alt_counter_u64_add(counter_u64_t c, int64_t inc) > > % { > > % #if 1 > > % /* 8.5 cycles on A64. */ > > % td->td_critnest++; > > % __asm __volatile("addl %1,%%ds:%0" : "=m,m" (*c) : "?i,r" (inc)); > > % td->td_critnest++; > > Oops, one increment should be a decrement. > > > % #elif 1 > > % /* 7 cycles on A64. */ > > % uint32_t ocritnest; > > % > > % ocritnest = td->td_critnest; > > % td->td_critnest = ocritnest + 1; > > % __asm __volatile("addl %1,%%ds:%0" : "=m,m" (*c) : "?i,r" (inc)); > > % td->td_critnest = ocritnest; > > % #elif 0 > > Even in my version, I have to check for unmasked interrupts when td_critnest > is reduced to 0. At least the test for being reduced to 0 can be very fast, > since the reduced value is loaded early and can be tested early. > > Further tests confirm that incl and incq are pipelined normally on at > least corei7 and core2. In the loop test, freefall can do 4 independent > addq's to memory faster than it can do 1 :-). It can do 6 independent > addq's to memory in the same time that it can do 1. After that, the > loop overhead prevents geting the complete bandwidth of the memory > system. However, 6 addq's to the same memory location take a little > more than 6 times longer than 1. Multiple increments of the same counter > one after the other are probably rare, but the counter API makes it harder > to coaelsce them if they occur, and the implementation using independent > asms ensures that the compiler cannot coalesce them. I think that the naive looping on Core i7+ to measure the latencies of the instructions simply does not work. The Nehalem hardware started to provide the loop detector for the instruction queue inside the decoder, which essentially makes the short loops executed from the microopcode form. We never use counters in the tight loops. pgpNJb5wXELLy.pgp Description: PGP signature
Re: svn commit: r252032 - head/sys/amd64/include
On Sat, 22 Jun 2013, I wrote: On Sat, 22 Jun 2013, I wrote: ... Here are considerably expanded tests, with noninline tests dropped. Summary of times on Athlon64: simple increment: 4-7 cycles (1) simple increment preceded by feature test: 5-8 cycles (1) simple 32-bit increment:4-7 cycles (2) correct 32-bit increment (addl to mem): 5.5-7 cycles (3) inlined critical section: 8.5 cycles (4) better inlined critical section:7 cycles (5) correct unsigned 32-bit inc of 64-bit counter: 4-7 cycles (6) "improve" previous to allow immediate operand: 5+ cycles correct signed 32-bit inc of 64-bit counter:8.5-9 cycles (7) correct 64-bit inc of 64-bit counter: 8-9 cycles (8) -current method (cmpxchg8b): 18 cycles corei7 (freefall) has about the same timing as Athlon64, but core2 (ref10-i386) is 3-4 cycles slower for the tests that use cmpxchg. (4) The critical section method is quite fast when inlined. (5) The critical section method is even faster when optimized. This is what should be used if you don't want the complications for the daemon. Oops, I forgot that critical sections are much slower in -current than in my version. They probably take 20-40 cycles for the best case, and can't easily be tested in userland since they disable interrupts in hardware. My versions disable interrupts in software. Re-oops, I was looking at the wrong tree. -current is similar to my version here. Anyway, almost all this is moot. The counter fetch is inherently non-atomic, on i386 so it is pointless to do the counter update atomically at the per-CPU level. Losing the race from non-atomic counter updates is very unlikely since the race window is about 2 nsec wide and occurs only on approximately every 4 billionth update. Losing the race in the fetch is a little more likely. The latter might be handled by a heuristic, and anything for this works for the update too. It seems to be good enough to detect counters going backwards and ignore the new value then. That takes more space for the record of the previous values than I like. Non-atomic fetches from subr_counter.c: % void % counter_u64_zero(counter_u64_t c) % { % int i; % % for (i = 0; i < mp_ncpus; i++) % *(uint64_t *)((char *)c + sizeof(struct pcpu) * i) = 0; % } Non-atomic stores are more of a problem. Races in stores can create permanently invalid counter values. Perhaps i386 memory ordering is strong enough to help. But I think it doesn't help much. It would locked cmpxchg8b in the i386 update and maybe here too to avoid the races, and the lock would defeat the point of pcpu counters. % % uint64_t % counter_u64_fetch(counter_u64_t c) % { % uint64_t r; % int i; % % r = 0; % for (i = 0; i < mp_ncpus; i++) % r += *(uint64_t *)((char *)c + sizeof(struct pcpu) * i); % % return (r); % } This code depends on 64-bit loads being atomic. They just aren't on 32-bit arches. The stores for the updates may go out in any order, and might be stale. The reads here may be in any order, especially since this is written in C. We don't care much about stale values, but want monotonically increasing values. Negative counter increments really shouldn't be supported, since they cause complications everywhere. Here they break the heuristic that a negative-going fetched value means a lost race. So the i386 version be simply "addl; adcl" to memory. Each store in this is atomic at the per-CPU level. If there is no carry, then the separate stores are equivalent to adding separate nonnegative values and the counter value is valid at all times. If there is carry, then the separate stores are equivalent to adding a negative value followed by a larger positive value. The counter transiently goes backwards, but no information is lost and the counter is eventually set to the correct forward-going value. Bruce ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r252032 - head/sys/amd64/include
On Sat, 22 Jun 2013, I wrote: ... Here are considerably expanded tests, with noninline tests dropped. Summary of times on Athlon64: simple increment: 4-7 cycles (1) simple increment preceded by feature test: 5-8 cycles (1) simple 32-bit increment:4-7 cycles (2) correct 32-bit increment (addl to mem): 5.5-7 cycles (3) inlined critical section: 8.5 cycles (4) better inlined critical section:7 cycles (5) correct unsigned 32-bit inc of 64-bit counter: 4-7 cycles (6) "improve" previous to allow immediate operand: 5+ cycles correct signed 32-bit inc of 64-bit counter:8.5-9 cycles (7) correct 64-bit inc of 64-bit counter: 8-9 cycles (8) -current method (cmpxchg8b): 18 cycles corei7 (freefall) has about the same timing as Athlon64, but core2 (ref10-i386) is 3-4 cycles slower for the tests that use cmpxchg. (4) The critical section method is quite fast when inlined. (5) The critical section method is even faster when optimized. This is what should be used if you don't want the complications for the daemon. Oops, I forgot that critical sections are much slower in -current than in my version. They probably take 20-40 cycles for the best case, and can't easily be tested in userland since they disable interrupts in hardware. My versions disable interrupts in software. ... % % static inline void % alt_counter_u64_add(counter_u64_t c, int64_t inc) % { % #if 1 % /* 8.5 cycles on A64. */ % td->td_critnest++; % __asm __volatile("addl %1,%%ds:%0" : "=m,m" (*c) : "?i,r" (inc)); % td->td_critnest++; Oops, one increment should be a decrement. % #elif 1 % /* 7 cycles on A64. */ % uint32_t ocritnest; % % ocritnest = td->td_critnest; % td->td_critnest = ocritnest + 1; % __asm __volatile("addl %1,%%ds:%0" : "=m,m" (*c) : "?i,r" (inc)); % td->td_critnest = ocritnest; % #elif 0 Even in my version, I have to check for unmasked interrupts when td_critnest is reduced to 0. At least the test for being reduced to 0 can be very fast, since the reduced value is loaded early and can be tested early. Further tests confirm that incl and incq are pipelined normally on at least corei7 and core2. In the loop test, freefall can do 4 independent addq's to memory faster than it can do 1 :-). It can do 6 independent addq's to memory in the same time that it can do 1. After that, the loop overhead prevents geting the complete bandwidth of the memory system. However, 6 addq's to the same memory location take a little more than 6 times longer than 1. Multiple increments of the same counter one after the other are probably rare, but the counter API makes it harder to coaelsce them if they occur, and the implementation using independent asms ensures that the compiler cannot coalesce them. Bruce ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r252032 - head/sys/amd64/include
On Fri, 21 Jun 2013, Gleb Smirnoff wrote: On Fri, Jun 21, 2013 at 09:02:36PM +1000, Bruce Evans wrote: B> Not if it is a 32-bit increment on 32-bit systems, as it should be. B> B> I said to use a daemon to convert small (16 or 32 bit) counters into B> larger (32 or 64 bit) ones. It is almost as efficient to call the B> accumulation code directly: Please trim quotes more. [... lots of technical details] May be you are right and an implementation with a "daemon" would be more efficient for i386. But daemon isn't needed at all on 64-bit platforms. So I doubt it is worth coding it. Of course it would be more efficient on i386. Probably it would be more efficient on amd64 too. The main counter array would be half the size, so takes less cache, less write buffers, and smaller instructions. Frankly speaking, we already do not consider the i386 as a platform to build high performance solutions. The need for counter(9) in TCP/IP stack was demonstrated on modern amd64 hardware. Thus, our motivation was to make: I need to complain about lost performance in i386 more then :-). i386 is faster for anything that doesn't need large memory. - a lossless and faster implementation for amd64 - a lossless implementation for !amd64 - if we have an option to avoid critical section on some arch, good let's avoid it. I already gave an implementation to avoid the critical section. But using the critical section seems to be the second best method. The best method is using 32-bit counters and a daemon. Here are considerably expanded tests, with noninline tests dropped. Summary of times on Athlon64: simple increment: 4-7 cycles (1) simple increment preceded by feature test: 5-8 cycles (1) simple 32-bit increment:4-7 cycles (2) correct 32-bit increment (addl to mem): 5.5-7 cycles (3) inlined critical section: 8.5 cycles (4) better inlined critical section:7 cycles (5) correct unsigned 32-bit inc of 64-bit counter: 4-7 cycles (6) "improve" previous to allow immediate operand: 5+ cycles correct signed 32-bit inc of 64-bit counter:8.5-9 cycles (7) correct 64-bit inc of 64-bit counter: 8-9 cycles (8) -current method (cmpxchg8b): 18 cycles (1) Unusable due to races. (2) It's no faster to do a simple increment of a 32-bit counter than a 64-bit one! This indicates delicate pipelining/memory access behaviour. The Athlon64 timing for addl to memory is 4 cycles latency. The tests show this being achieved. But since it is only latency and an Athlon64 can do something like 1 64-bit store and 1 64-bit load per cycle (or 2 64-bit loads but not 2 64-bit stores per cycle?), there is plenty of time to do the adcl to memory and also the loop overhead in not-quite-the-same 4 cycles, so that the loop throughput is 1 per 4 cycles. In practical/non-loop use, you would care even less about the store latancy. Compilers often mess this up to get 7 cycles instead of 4. -O2 generally gives worst results, by unrolling the loop. Since the rolled loop can excecute in 4 cycles, unrolling it can only pessimize it. (3) The single 32-bit addl, combined with a daemon, is what should be used. But compilers generated worse code for this than the similar but more branchy code in (6). (4) The critical section method is quite fast when inlined. (5) The critical section method is even faster when optimized. This is what should be used if you don't want the complications for the daemon. (6) Call the daemon directly ("daemon" code is now just the current code in an extern function. The extern function is a dummy that is never called in the tests. Just to simulate the branch prediction penalties). Some restrictions on the increment. (7) Same as (6), with fewer restrictions on the increment. This uses cmpxchg just like the -current method uses cmpxchg8b, provided a 32-bit increment is sufficient. Races are detected as the low bits changing. We don't care if the high bits change while we are changing the low bits. (8) Same as (7), with no restrictions on the increment. The full version somehow turned out to have a better fastest case than (7). This is without even an optimization for constant increments that would eliminate the runtime check in most cases. The extra code for the runtime check makes things faster :-). % #include % #include % % static inline void % counter_64_inc_8b(volatile uint64_t *p, int64_t inc) % { % % __asm __volatile( % "movl %%ds:(%%esi),%%eax\n\t" % "movl %%ds:4(%%esi),%%edx\n" % "1:\n\t" % "movl %%eax,%%ebx\n\t" % "movl %%edx,%%ecx\n\t" % "addl (%%edi),%%ebx\n\t" % "adcl 4(%%edi),%%ecx\n\t" % "cmpxchg8b %%ds:(%%esi)\n\t" % "jnz 1b" % : % : "S" (p), "D" (&inc)
Re: svn commit: r252032 - head/sys/amd64/include
Bruce, On Fri, Jun 21, 2013 at 09:02:36PM +1000, Bruce Evans wrote: B> Not if it is a 32-bit increment on 32-bit systems, as it should be. B> B> I said to use a daemon to convert small (16 or 32 bit) counters into B> larger (32 or 64 bit) ones. It is almost as efficient to call the B> accumulation code directly: B> B> addl%1,%fs:(%0) B> jns 1f B> callcounter_accum B> 1: B> B> This calls the accumulation code when the counter goes above 0x8000. B> The accumulation code then uses cmpxchg8b or a critical section to B> atomically transfer everything in the small counter to the large B> counter. Note that the small counter remains valid at all times unless B> it is incrememnted too fast. Code that reports the counters must also B> use cmpxch8b or a critical section to read them atomically before B> summing them. The transfer window between when the small counter B> reaches 0x8000 and when it overflows is very large provided large B> increments are never made (2 increments of 0x8000 would break it B> immediately, but an average increment of 1 takes about a minute to B> reach 0x8000 even if called from a loop doing nothing except the B> increment). Only 32-bit increments that are not too large are B> supported. Negative increments are handed by a heuristic in the B> accumulator. E.g., an increment of -1 just works unless it causes the B> small counter's value to go from 0 to 0xf. 0x is B> detected by the sign test. It must be interpreted as -1 to be added B> to the large counter, not 0x. The full heuristic would be B> something like: B> B> if (small_counter >= 0xf000) { B> /* Subtract from large counter. */ B> /* Actually do this atomically, of course. */ B> large_counter += (int32_t)small_counter; B> small_counter = 0; B> /* B> * Add here sanity checks that the large counter never goes B> * negative. This can only be afforded because this code B> * is rarely called (and is not inline). B> */ B> } else if (small_counter < 0x9000) { B> /* Add to large counter. B> large_counter += small_counter; B> small_counter = 0; B> } else { B> panic("transfer window too small"); B> } B> B> > Entering critical section means modifying curthread, which is again B> > a %gs based load & store. Exiting critical section has the same cost. B> > Thus, we assume that doing a direct %gs based update on the counter B> > is cheaper than critical_enter(); counter += foo; critical_exit(); B> B> Critical sections actually only modifiy td->td_critnest, so they do B> a single %fs-based load and 2 %ds-based modifications (it's %fs for B> i386 and %gs for amd64). So it's far from clear that the direct B> update is faster than the generic i386 code. Avoiding the generic B> i386 code also takes an extra load, test and branch. The generic B> code reduces to something like the following instructions after inlining B> the critical section: B> B> movl%gs:CURTHREAD,%ebx B> inclTD_CRITNEST(%ebx) B> movlcounter_arg,%edi B> movlinc_lo,%eax B> movlinc_hi,%edx B> addl%eax,%fs:(%edi) B> adcl%edx,%fs:4(%edi) B> 1: B> declTD_CRITNEST(%ebx) B> B> while the non-generic code for the best case when cmpxchg8b is supported is B> something like: B> B> testl $CPUID_CX8,cpu_feature B> je 2f B> movlcounter_arg,%esi B> movlptr_to_inc,%edi # might need more insns to init *ptr B> // 8 instructions here from the asm string. B> jmp 3f # actually rearrange to jump in other case B> 2: B> // Use critical section. B> 3: B> B> B> So using cmpxchg8b takes at least 4 more instructions than using a B> critical section. The incl/decl instructions on td_critnest are rather B> slow, however. According to my test, they take 6 cycles each. (I B> think they can be reduced to load; modify; store; ... store back B> original value which should take more like 9 cycles.) This probably B> applies to the addl and adcl that do the work too. 12 cycles for them, B> and just 2 more cycles (pipelined) for cmpxchg8b and all the other B> setup instructions. B> B> Now, how does the performance aspect work? Is it by cmpxch8b causing B> less contention than direct addl to memory? If so, then hopefully B> plain cmpxchg does the same. B> B> The non-cmpxchg version of the i386 code can probably be improved by B> adding more instructions to it, to avoid doing the adcl to memory B> in most cases. Even if you support 64-bit increments, they will rarely B> be used, so the branches for this will be predicted well: B> B> // Next 4 instructions as above: B> movlcounter_arg,%edi B> movlinc_lo,%eax B> movlinc_hi,%edx B> addl%eax,%fs:(%edi)
Re: svn commit: r252032 - head/sys/amd64/include
On Fri, 21 Jun 2013, Gleb Smirnoff wrote: On Fri, Jun 21, 2013 at 09:04:34AM +1000, Bruce Evans wrote: B> >> The i386 version of the counter asm doesn't support the immediate B> >> constraint for technical reasons. 64 bit counters are too large and B> >> slow to use on i386, especially when they are implemented as they are B> >> without races. B> > B> > Actual testing showed that it is only about twice as slow as a direct B> > increment. With the enclosed test program (a userland version hacked B> > on a bit to avoid pcpu), on ref10-i386 the times are: ... Yes, for a single threaded userland program using "+=" is faster than all the magic that counter(9) does. But when multiple threads need to access one counter "+=" fails both with precision and with performance. No. But I forgot about the performance aspect. Using "+=" upon a per-CPU counter is racy, since += is compiled into "load", "increment", "store" sequence and if we are not in a critical section, then this is racy. We might be removed from CPU between load and store. Not if it is a 32-bit increment on 32-bit systems, as it should be. I said to use a daemon to convert small (16 or 32 bit) counters into larger (32 or 64 bit) ones. It is almost as efficient to call the accumulation code directly: addl%1,%fs:(%0) jns 1f callcounter_accum 1: This calls the accumulation code when the counter goes above 0x8000. The accumulation code then uses cmpxchg8b or a critical section to atomically transfer everything in the small counter to the large counter. Note that the small counter remains valid at all times unless it is incrememnted too fast. Code that reports the counters must also use cmpxch8b or a critical section to read them atomically before summing them. The transfer window between when the small counter reaches 0x8000 and when it overflows is very large provided large increments are never made (2 increments of 0x8000 would break it immediately, but an average increment of 1 takes about a minute to reach 0x8000 even if called from a loop doing nothing except the increment). Only 32-bit increments that are not too large are supported. Negative increments are handed by a heuristic in the accumulator. E.g., an increment of -1 just works unless it causes the small counter's value to go from 0 to 0xf. 0x is detected by the sign test. It must be interpreted as -1 to be added to the large counter, not 0x. The full heuristic would be something like: if (small_counter >= 0xf000) { /* Subtract from large counter. */ /* Actually do this atomically, of course. */ large_counter += (int32_t)small_counter; small_counter = 0; /* * Add here sanity checks that the large counter never goes * negative. This can only be afforded because this code * is rarely called (and is not inline). */ } else if (small_counter < 0x9000) { /* Add to large counter. large_counter += small_counter; small_counter = 0; } else { panic("transfer window too small"); } Entering critical section means modifying curthread, which is again a %gs based load & store. Exiting critical section has the same cost. Thus, we assume that doing a direct %gs based update on the counter is cheaper than critical_enter(); counter += foo; critical_exit(); Critical sections actually only modifiy td->td_critnest, so they do a single %fs-based load and 2 %ds-based modifications (it's %fs for i386 and %gs for amd64). So it's far from clear that the direct update is faster than the generic i386 code. Avoiding the generic i386 code also takes an extra load, test and branch. The generic code reduces to something like the following instructions after inlining the critical section: movl%gs:CURTHREAD,%ebx inclTD_CRITNEST(%ebx) movlcounter_arg,%edi movlinc_lo,%eax movlinc_hi,%edx addl%eax,%fs:(%edi) adcl%edx,%fs:4(%edi) 1: declTD_CRITNEST(%ebx) while the non-generic code for the best case when cmpxchg8b is supported is something like: testl $CPUID_CX8,cpu_feature je 2f movlcounter_arg,%esi movlptr_to_inc,%edi # might need more insns to init *ptr // 8 instructions here from the asm string. jmp 3f # actually rearrange to jump in other case 2: // Use critical section. 3: So using cmpxchg8b takes at least 4 more instructions than using a critical section. The incl/decl instructions on td_critnest are rather slow, however. According to my test, they take 6 cycles each. (I think they can be reduced to load; modify; store; ... store back original value which should take more like 9 cycle
Re: svn commit: r252032 - head/sys/amd64/include
Bruce, On Fri, Jun 21, 2013 at 09:04:34AM +1000, Bruce Evans wrote: B> >> The i386 version of the counter asm doesn't support the immediate B> >> constraint for technical reasons. 64 bit counters are too large and B> >> slow to use on i386, especially when they are implemented as they are B> >> without races. B> > B> > Actual testing showed that it is only about twice as slow as a direct B> > increment. With the enclosed test program (a userland version hacked B> > on a bit to avoid pcpu), on ref10-i386 the times are: B> > - loop overhead:1 cycle B> > - direct unlocked increment of a uint32_t: 6 cycles B> > - direct unlocked increment of a uint64_t: 7 cycles B> > - non-inline function unlocked increment of a uint64_t: 7.5 cycles B> > - counter_u64_add(): 14 cycles B> > - non-inline counter_u64_add():18 cycles B> > ... B> B> Actually enclosing the test program: B> B> % #include B> % #include B> % B> % static inline void B> % counter_64_inc_8b(volatile uint64_t *p, int64_t inc) B> % { B> % B> %__asm __volatile( B> %"movl %%ds:(%%esi),%%eax\n\t" B> %"movl %%ds:4(%%esi),%%edx\n" B> % "1:\n\t" B> %"movl %%eax,%%ebx\n\t" B> %"movl %%edx,%%ecx\n\t" B> %"addl (%%edi),%%ebx\n\t" B> %"adcl 4(%%edi),%%ecx\n\t" B> %"cmpxchg8b %%ds:(%%esi)\n\t" B> %"jnz1b" B> %: B> %: "S" (p), "D" (&inc) B> %: "memory", "cc", "eax", "edx", "ebx", "ecx"); B> % } B> % B> % uint32_t cpu_feature = 1; B> % B> % typedef volatile uint64_t *counter_u64_t; B> % B> % static void B> % #if 1 B> % inline B> % #else B> % __noinline B> % #endif B> % counter_u64_add(counter_u64_t c, int64_t inc) B> % { B> % B> % #if 1 B> %if ((cpu_feature & 1) == 1) { B> %counter_64_inc_8b(c, inc); B> %} B> % #elif 0 B> %if ((cpu_feature & 1) == 1) { B> %*c += inc; B> %} B> % #else B> %*c += inc; B> % #endif B> % } B> % B> % uint64_t mycounter[1]; B> % B> % int B> % main(void) B> % { B> %unsigned i; B> % B> %for (i = 0; i < 1861955704; i++)/* sysctl -n machdep.tsc_freq */ B> %counter_u64_add(mycounter, 1); B> %printf("%ju\n", (uintmax_t)mycounter[0]); B> % } Yes, for a single threaded userland program using "+=" is faster than all the magic that counter(9) does. But when multiple threads need to access one counter "+=" fails both with precision and with performance. Using "+=" upon a per-CPU counter is racy, since += is compiled into "load", "increment", "store" sequence and if we are not in a critical section, then this is racy. We might be removed from CPU between load and store. Entering critical section means modifying curthread, which is again a %gs based load & store. Exiting critical section has the same cost. Thus, we assume that doing a direct %gs based update on the counter is cheaper than critical_enter(); counter += foo; critical_exit(); -- Totus tuus, Glebius. ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r252032 - head/sys/amd64/include
On Fri, Jun 21, 2013 at 12:15:24PM +1000, Lawrence Stewart wrote: > Hi Kostik, > > On 06/21/13 00:30, Konstantin Belousov wrote: > > Author: kib > > Date: Thu Jun 20 14:30:04 2013 > > New Revision: 252032 > > URL: http://svnweb.freebsd.org/changeset/base/252032 > > > > Log: > > Allow immediate operand. > > > > Sponsored by: The FreeBSD Foundation > > > > Modified: > > head/sys/amd64/include/counter.h > > > > Modified: head/sys/amd64/include/counter.h > > == > > --- head/sys/amd64/include/counter.hThu Jun 20 14:20:03 2013 > > (r252031) > > +++ head/sys/amd64/include/counter.hThu Jun 20 14:30:04 2013 > > (r252032) > > @@ -44,7 +44,7 @@ counter_u64_add(counter_u64_t c, int64_t > > > > __asm __volatile("addq\t%1,%%gs:(%0)" > > : > > - : "r" ((char *)c - (char *)&__pcpu[0]), "r" (inc) > > + : "r" ((char *)c - (char *)&__pcpu[0]), "ri" (inc) > > : "memory", "cc"); > > } > > For mere mortals like myself, a verbose explanation of what this does, > why it's necessary and what problem(s) it solves (if any) would be most > helpful :) It does what was written in the commit message. The addq instructions is allowed to take the immediate operand, besides the register, for the increment value. For the typical case of incrementing by the constant 1, before the commit, the emited code was like movl$1,%ecx addq%ecx,%gs(%rdx) now it could be addq$1,%gs(%rdx) Mostly aestetic, also slightly lowering the registers pressure. pgpW1HQa_iUd4.pgp Description: PGP signature
Re: svn commit: r252032 - head/sys/amd64/include
Hi Kostik, On 06/21/13 00:30, Konstantin Belousov wrote: > Author: kib > Date: Thu Jun 20 14:30:04 2013 > New Revision: 252032 > URL: http://svnweb.freebsd.org/changeset/base/252032 > > Log: > Allow immediate operand. > > Sponsored by: The FreeBSD Foundation > > Modified: > head/sys/amd64/include/counter.h > > Modified: head/sys/amd64/include/counter.h > == > --- head/sys/amd64/include/counter.h Thu Jun 20 14:20:03 2013 > (r252031) > +++ head/sys/amd64/include/counter.h Thu Jun 20 14:30:04 2013 > (r252032) > @@ -44,7 +44,7 @@ counter_u64_add(counter_u64_t c, int64_t > > __asm __volatile("addq\t%1,%%gs:(%0)" > : > - : "r" ((char *)c - (char *)&__pcpu[0]), "r" (inc) > + : "r" ((char *)c - (char *)&__pcpu[0]), "ri" (inc) > : "memory", "cc"); > } For mere mortals like myself, a verbose explanation of what this does, why it's necessary and what problem(s) it solves (if any) would be most helpful :) Cheers, Lawrence ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r252032 - head/sys/amd64/include
On Fri, 21 Jun 2013, Bruce Evans wrote: On Fri, 21 Jun 2013, I wrote: On Thu, 20 Jun 2013, Konstantin Belousov wrote: ... @@ -44,7 +44,7 @@ counter_u64_add(counter_u64_t c, int64_t ... The i386 version of the counter asm doesn't support the immediate constraint for technical reasons. 64 bit counters are too large and slow to use on i386, especially when they are implemented as they are without races. Actual testing showed that it is only about twice as slow as a direct increment. With the enclosed test program (a userland version hacked on a bit to avoid pcpu), on ref10-i386 the times are: - loop overhead:1 cycle - direct unlocked increment of a uint32_t: 6 cycles - direct unlocked increment of a uint64_t: 7 cycles - non-inline function unlocked increment of a uint64_t: 7.5 cycles - counter_u64_add(): 14 cycles - non-inline counter_u64_add():18 cycles ... Actually enclosing the test program: % #include % #include % % static inline void % counter_64_inc_8b(volatile uint64_t *p, int64_t inc) % { % % __asm __volatile( % "movl %%ds:(%%esi),%%eax\n\t" % "movl %%ds:4(%%esi),%%edx\n" % "1:\n\t" % "movl %%eax,%%ebx\n\t" % "movl %%edx,%%ecx\n\t" % "addl (%%edi),%%ebx\n\t" % "adcl 4(%%edi),%%ecx\n\t" % "cmpxchg8b %%ds:(%%esi)\n\t" % "jnz 1b" % : % : "S" (p), "D" (&inc) % : "memory", "cc", "eax", "edx", "ebx", "ecx"); % } % % uint32_t cpu_feature = 1; % % typedef volatile uint64_t *counter_u64_t; % % static void % #if 1 % inline % #else % __noinline % #endif % counter_u64_add(counter_u64_t c, int64_t inc) % { % % #if 1 % if ((cpu_feature & 1) == 1) { % counter_64_inc_8b(c, inc); % } % #elif 0 % if ((cpu_feature & 1) == 1) { % *c += inc; % } % #else % *c += inc; % #endif % } % % uint64_t mycounter[1]; % % int % main(void) % { % unsigned i; % % for (i = 0; i < 1861955704; i++) /* sysctl -n machdep.tsc_freq */ % counter_u64_add(mycounter, 1); % printf("%ju\n", (uintmax_t)mycounter[0]); % } Bruce ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r252032 - head/sys/amd64/include
On Fri, 21 Jun 2013, I wrote: On Thu, 20 Jun 2013, Konstantin Belousov wrote: ... @@ -44,7 +44,7 @@ counter_u64_add(counter_u64_t c, int64_t ... The i386 version of the counter asm doesn't support the immediate constraint for technical reasons. 64 bit counters are too large and slow to use on i386, especially when they are implemented as they are without races. Actual testing showed that it is only about twice as slow as a direct increment. With the enclosed test program (a userland version hacked on a bit to avoid pcpu), on ref10-i386 the times are: - loop overhead:1 cycle - direct unlocked increment of a uint32_t: 6 cycles - direct unlocked increment of a uint64_t: 7 cycles - non-inline function unlocked increment of a uint64_t: 7.5 cycles - counter_u64_add(): 14 cycles - non-inline counter_u64_add():18 cycles Add many more when critical_enter()/exit() is needed. I thought that a direct increment of a uint32_t took only 3 cycles. This is the documented time for i486. 4 cycles latency is documented for AthlonxXP/64. The carry check for incrementing a uint64_t is pipelined on most modern i386, so it adds very little to this. Nevertheless, the correct implementation of counters, once you have the complexity of a counter API and can't just do counter++, is to use small counters and run a daemon to accumulate them in larger counters before they overflow. pcpu accesses should allow simple counter++ accesses to work for the smaller counters (except their address is in pcpu space). But I don't see how sysctl accesses can work without lots of context switches to reach strictly per-CPU context. The current accumulation of pcpu counters in places like vcnt() doesn't do that -- it accesses pcpu counters for other CPUs, so has races. The races are more serious for accumulating counters into larger ones. Then the smaller ones need to be cleared atomically with copying them. The accumulation daemon(s) can run per-CPU more easily than sysctls, since the daemons don't need to run as frequently as sysctls might. Bruce ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r252032 - head/sys/amd64/include
On Thu, 20 Jun 2013, Konstantin Belousov wrote: Log: Allow immediate operand. .. Modified: head/sys/amd64/include/counter.h == --- head/sys/amd64/include/counter.hThu Jun 20 14:20:03 2013 (r252031) +++ head/sys/amd64/include/counter.hThu Jun 20 14:30:04 2013 (r252032) @@ -44,7 +44,7 @@ counter_u64_add(counter_u64_t c, int64_t __asm __volatile("addq\t%1,%%gs:(%0)" : - : "r" ((char *)c - (char *)&__pcpu[0]), "r" (inc) + : "r" ((char *)c - (char *)&__pcpu[0]), "ri" (inc) : "memory", "cc"); } I don't like the quality of these asms. pcpu.h on both amd64 and i386 uses a hackish output operand and no memory clobber, and no cc clobber. The output operand is hackish because the variable is in a different address space. But since the variable is in a different address space, the memory clobber is very unlikely to protect anything -- it could only protect accesses to the variable without using %gs, and it is practically impossible to have such accesses in scope, since we are using %gs because it is the only way that we have in scope to access the variable. The i386 version is much larger and uglier. It has mounds of style bugs. See {amd64.i386}/include/atomic.h for normal style of multi-line asms. I don't like that either. I like a style with backslash-newlines, hard tabs and soft newlines ('...\n\$') used in i386/isa/npx.c. The backslash-newlines are and hard tabs less ugly and more readable than double quotes and soft tabs ('"...\n\t$"'). Escapes in string constants are unfortunately necessary since gcc broke its extension of allowing hard newlines in string constants. The old good way, using the gcc extension, allowed copying extern asm: directly into inline asm (code for a dummy loop): asm(" movl$100,%%ecx 1: decl%%ecx jne 1b "); Without the extension, this has to be uglified and requires a lot of editing to convert. Using backslash-newline and the same indentation for the source and generated asm file, but with trailing tabs to line up: asm(" \n\ movl$100,%%ecx \n\ 1: \n\ decl%%ecx \n\ jne 1b \n\ ");/* clobbers omitted */ The style in machine/atomic.h adds lots of quotes and breaks the formatting of the generated asm file by omitting the newlines, and has to add another ugliness to recover from that -- now it needs semicolons instead of newlines: asm( " movl$100,%%ecx; " "1:" /* should use newline/semi */ " decl%%ecx ; " " jne 1b " ); /* might be on previous line */ This requires much more editing. Finally, the style in i386/include/counter.h changes the semicolons to soft newlines to improve the output and removes most of the formatting in the source but adds soft tabs to give the indentation in the output in another way: asm( "movl $100,%%ecx\n\t" "1:\n\t" "decl %%ecx\n\t" "jne 1b\n\t" ); The i386 version of the counter asm doesn't support the immediate constraint for technical reasons. 64 bit counters are too large and slow to use on i386, especially when they are implemented as they are without races. Bruce ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"