Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount
On 07/15/2013 07:47 PM, Thomas Gleixner wrote: On Mon, 15 Jul 2013, Waiman Long wrote: On 07/15/2013 10:41 AM, Thomas Gleixner wrote: On Mon, 8 Jul 2013, Waiman Long wrote: Sigh. GENERIC means, that you use the generic implementation, ARCH means the architecture has a private implementation of that code. The generic implementation can use arch specific includes and if there is none we simply fallback to the asm-generic one. I used the ARCH+GENERIC to mean using the generic code but with arch specific include. And what's the point of that? I just explained it to you that you do not need the ARCH=y and GENERIC=y at all. As I said in my previous mail, I can remove the ARCH+GENERIC option. > Let's start with a simple version because it IS simple and easy to analyze and debug and then incrementally build improvements on it instead of creating an heuristics monster in the first place, i.e.: if (!spin_is_locked(>lock)&&try_cmpxchg_once(lr)) return 0; return 1; Take numbers for this on a zoo of different machines: Intel and AMD, old and new. Then you can add an incremental patch on that, which add loops and hoops. Along with numbers on the same zoo of machines. I originally tried to do a cmpxchg without waiting and there was practically no performance gain. I believe that as soon as Well, you did not see a difference on your particular machine. Still we don't have an idea how all of this works on a set of different machines. There is a world beside 8 socket servers. I understand that. I can live with try_cmpxchg_once, though doing it twice will give a slightly better performance. However, without I asked you several times now to explain and document the whole thing with numbers instead of your handwaving "slightly better performance" arguments. I will provide performance data for 1 and 2 retries in my next patch version. waiting for the lock to be free, this patch won't do much good. To keep it simple, I can remove the ability to do customization while doing cmpxchg once and wait until the lock is free. Please let me know if this is acceptable to you. No, it's not acceptable at all if you are not able to provide data for 1,2,4,8 socket machines (from single core to your precious big boxes). It's that simple. We are not accepting patches which optimize for a single use case and might negatively affect 99,% of the existing users which have no access to this kind of hardware unless proven otherwise. I did provide performance data for 1,2,4 and 8 socket configurations in my commit message. I used numactl to simulate different socket configuration by forcing the code to use only a subset of total number of sockets. I know that is not ideal, but I think it should be close enough. I will provide performance data on a more common 2 socket test machine that I have. Yes, I don't provide data for single-thread use case. I will also provide that data in my next version by measuring the average time for doing low-level reference count update using lock and lockless update like what I had done for the qrwlock patch. For single thread case, I don't believe any real workload will show any appreciable difference in performance due to the differing reference count update mechanisms. Also what's the approach to tune that? Running some random testbench and monitor the results for various settings? If that's the case we better have a that as variables with generic initial values in the code, which can be modified by sysctl. As I said above, I can remove the customization. I may reintroduce user customization using sysctl as you suggested in the a follow up patch after this one is merged. And I asked for a step by step approach in the first review, but you decided to ignore that. And now you think that it's accetable for you as long as you get what you want. That's not the way it works, really. I am trying to provide what you are asking for while at the same time meet my own need. + getnstimeofday(); + ns = (tv2.tv_sec - tv1.tv_sec) * NSEC_PER_SEC + +(tv2.tv_nsec - tv1.tv_nsec); + pr_info("lockref wait loop time = %lu ns\n", ns); Using getnstimeofday() for timestamping and spamming the logs with printouts? You can't be serious about that? q> > > > Thats what tracepoints are for. Tracing is the only way to get proper numbers which take preemption, interrupts etc. into account without hurting runtime performace. The _SHOW_WAIT_LOOP_TIME is for debugging and instrumentation purpose only during development cycle. It is not supposed to be turned on at production system. I will document that in the code. No, no, no! Again: That's what tracepoints are for. This kind of debugging is completely pointless. Tracepoints are designed to be low overhead and can be enabled on production systems. Your debugging depends on slow timestamps against CLOCK_REALTIME and an even slower output via
Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount
On Mon, 15 Jul 2013, Waiman Long wrote: > On 07/15/2013 10:41 AM, Thomas Gleixner wrote: > > On Mon, 8 Jul 2013, Waiman Long wrote: > > Sigh. GENERIC means, that you use the generic implementation, ARCH > > means the architecture has a private implementation of that code. > > > > The generic implementation can use arch specific includes and if there > > is none we simply fallback to the asm-generic one. > > I used the ARCH+GENERIC to mean using the generic code but with arch specific > include. And what's the point of that? I just explained it to you that you do not need the ARCH=y and GENERIC=y at all. > > > Let's start with a simple version because it IS simple and easy > > > > to analyze and debug and then incrementally build improvements on it > > > > instead of creating an heuristics monster in the first place, i.e.: > > > > > > > > if (!spin_is_locked(>lock)&& try_cmpxchg_once(lr)) > > > > return 0; > > > > return 1; > > > > > > > > Take numbers for this on a zoo of different machines: Intel and AMD, > > > > old and new. > > > > > > > > Then you can add an incremental patch on that, which add loops and > > > > hoops. Along with numbers on the same zoo of machines. > > > I originally tried to do a cmpxchg without waiting and there was > > > practically no performance gain. I believe that as soon as > > Well, you did not see a difference on your particular machine. Still > > we don't have an idea how all of this works on a set of different > > machines. There is a world beside 8 socket servers. > > I understand that. I can live with try_cmpxchg_once, though doing it > twice will give a slightly better performance. However, without I asked you several times now to explain and document the whole thing with numbers instead of your handwaving "slightly better performance" arguments. I CANNOT verify that at all simply because I have no access to a 8 socket machine. Hint: Ask your boss to send me one :) > waiting for the lock to be free, this patch won't do much good. To > keep it simple, I can remove the ability to do customization while > doing cmpxchg once and wait until the lock is free. Please let me > know if this is acceptable to you. No, it's not acceptable at all if you are not able to provide data for 1,2,4,8 socket machines (from single core to your precious big boxes). It's that simple. We are not accepting patches which optimize for a single use case and might negatively affect 99,% of the existing users which have no access to this kind of hardware unless proven otherwise. As I told you versus the qrwlocks: You are missing to explain WHY it improves your particular workloads and WHY this change wont affect other workloads. All I can see from your explanations is: Works better on 8 socket big iron. I have yet to see any numbers that this wont affect the vast majority of users. > > Also what's the approach to tune that? Running some random testbench > > and monitor the results for various settings? > > > > If that's the case we better have a that as variables with generic > > initial values in the code, which can be modified by sysctl. > > As I said above, I can remove the customization. I may reintroduce user > customization using sysctl as you suggested in the a follow up patch after > this one is merged. And I asked for a step by step approach in the first review, but you decided to ignore that. And now you think that it's accetable for you as long as you get what you want. That's not the way it works, really. > > > > > + getnstimeofday(); > > > > > + ns = (tv2.tv_sec - tv1.tv_sec) * NSEC_PER_SEC + > > > > > + (tv2.tv_nsec - tv1.tv_nsec); > > > > > + pr_info("lockref wait loop time = %lu ns\n", ns); > > > > Using getnstimeofday() for timestamping and spamming the logs with > > > > printouts? You can't be serious about that? > > > > q > > > > Thats what tracepoints are for. Tracing is the only way to get proper > > > > numbers which take preemption, interrupts etc. into account without > > > > hurting runtime performace. > > > The _SHOW_WAIT_LOOP_TIME is for debugging and instrumentation purpose only > > > during development cycle. It is not supposed to be turned on at production > > > system. I will document that in the code. > > No, no, no! Again: That's what tracepoints are for. > > > > This kind of debugging is completely pointless. Tracepoints are > > designed to be low overhead and can be enabled on production > > systems. > > > > Your debugging depends on slow timestamps against CLOCK_REALTIME and > > an even slower output via printk. How useful is that if you want to > > really instrument the behaviour of this code? > > This code is not critical and I can certainly remove it. Did you even try to understand what I wrote? I did not ask you to remove instrumentation. I asked you to use useful instrumentation instead of some totally useless crap. Thanks, tglx -- To
Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount
On 07/15/2013 10:41 AM, Thomas Gleixner wrote: On Mon, 8 Jul 2013, Waiman Long wrote: On 07/05/2013 02:59 PM, Thomas Gleixner wrote: On Fri, 5 Jul 2013, Waiman Long wrote: I think it is OK to add the GENERIC option, but I would like to make available a slightly different set of options: 1) Always take the lock 2) Use the generic implementation with the default parameters 3) Use the generic implementation with a customized set of parameters 4) Use an architecture specific implementation. 2) set only GENERIC_SPINLOCK_REFCOUNT 3) set both GENERIC_SPINLOCK_REFCOUNT and ARCH_SPINLOCK_REFCOUNT 4) set only ARCH_SPINLOCK_REFCOUNT The customized parameters will be set in the "asm/spinlock_refcount.h" file. Currently ,there is 2 parameters that can be customized for each architecture: 1) How much time will the function wait until the lock is free 2) How many attempts to do a lockless cmpxchg to update reference count Sigh. GENERIC means, that you use the generic implementation, ARCH means the architecture has a private implementation of that code. The generic implementation can use arch specific includes and if there is none we simply fallback to the asm-generic one. I used the ARCH+GENERIC to mean using the generic code but with arch specific include. > Let's start with a simple version because it IS simple and easy to analyze and debug and then incrementally build improvements on it instead of creating an heuristics monster in the first place, i.e.: if (!spin_is_locked(>lock)&& try_cmpxchg_once(lr)) return 0; return 1; Take numbers for this on a zoo of different machines: Intel and AMD, old and new. Then you can add an incremental patch on that, which add loops and hoops. Along with numbers on the same zoo of machines. I originally tried to do a cmpxchg without waiting and there was practically no performance gain. I believe that as soon as Well, you did not see a difference on your particular machine. Still we don't have an idea how all of this works on a set of different machines. There is a world beside 8 socket servers. I understand that. I can live with try_cmpxchg_once, though doing it twice will give a slightly better performance. However, without waiting for the lock to be free, this patch won't do much good. To keep it simple, I can remove the ability to do customization while doing cmpxchg once and wait until the lock is free. Please let me know if this is acceptable to you. contention happens, it will force all the upcoming reference count update threads to take the lock eliminating any potential performance gain that we can have. To make it simple, I can change the default to wait indefinitely until the lock is free instead of looping a certain number of times, but I still like the option of letting each architecture to decide how many times they want to try. I agree that the actual waiting time even for one architecture is depending on the actual CPU chip that is being used. I have done some experiment on that. As long as the count is large enough, increasing the loop count doesn't have any significant impact on performance any more. The main reason for having a finite time is to avoid having the waiting thread to wait virtually forever if the lock happens to be busy for a long time. Again, if we make this tunable then we still want documentation for the behaviour on small, medium and large machines. Also what's the approach to tune that? Running some random testbench and monitor the results for various settings? If that's the case we better have a that as variables with generic initial values in the code, which can be modified by sysctl. As I said above, I can remove the customization. I may reintroduce user customization using sysctl as you suggested in the a follow up patch after this one is merged. + getnstimeofday(); + ns = (tv2.tv_sec - tv1.tv_sec) * NSEC_PER_SEC + +(tv2.tv_nsec - tv1.tv_nsec); + pr_info("lockref wait loop time = %lu ns\n", ns); Using getnstimeofday() for timestamping and spamming the logs with printouts? You can't be serious about that? Thats what tracepoints are for. Tracing is the only way to get proper numbers which take preemption, interrupts etc. into account without hurting runtime performace. The _SHOW_WAIT_LOOP_TIME is for debugging and instrumentation purpose only during development cycle. It is not supposed to be turned on at production system. I will document that in the code. No, no, no! Again: That's what tracepoints are for. This kind of debugging is completely pointless. Tracepoints are designed to be low overhead and can be enabled on production systems. Your debugging depends on slow timestamps against CLOCK_REALTIME and an even slower output via printk. How useful is that if you want to really instrument the behaviour of this code? This code is not critical and I can certainly remove it. Regards, Longman -- To unsubscribe
Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount
On Mon, 8 Jul 2013, Waiman Long wrote: > On 07/05/2013 02:59 PM, Thomas Gleixner wrote: > > On Fri, 5 Jul 2013, Waiman Long wrote: > I think it is OK to add the GENERIC option, but I would like to make available > a slightly different set of options: > 1) Always take the lock > 2) Use the generic implementation with the default parameters > 3) Use the generic implementation with a customized set of parameters > 4) Use an architecture specific implementation. > > 2) set only GENERIC_SPINLOCK_REFCOUNT > 3) set both GENERIC_SPINLOCK_REFCOUNT and ARCH_SPINLOCK_REFCOUNT > 4) set only ARCH_SPINLOCK_REFCOUNT > > The customized parameters will be set in the "asm/spinlock_refcount.h" file. > Currently ,there is 2 parameters that can be customized for each architecture: > 1) How much time will the function wait until the lock is free > 2) How many attempts to do a lockless cmpxchg to update reference count Sigh. GENERIC means, that you use the generic implementation, ARCH means the architecture has a private implementation of that code. The generic implementation can use arch specific includes and if there is none we simply fallback to the asm-generic one. > Let's start with a simple version because it IS simple and easy > > to analyze and debug and then incrementally build improvements on it > > instead of creating an heuristics monster in the first place, i.e.: > > > > if (!spin_is_locked(>lock)&& try_cmpxchg_once(lr)) > >return 0; > > return 1; > > > > Take numbers for this on a zoo of different machines: Intel and AMD, > > old and new. > > > > Then you can add an incremental patch on that, which add loops and > > hoops. Along with numbers on the same zoo of machines. > > I originally tried to do a cmpxchg without waiting and there was > practically no performance gain. I believe that as soon as Well, you did not see a difference on your particular machine. Still we don't have an idea how all of this works on a set of different machines. There is a world beside 8 socket servers. > contention happens, it will force all the upcoming reference count > update threads to take the lock eliminating any potential > performance gain that we can have. To make it simple, I can change > the default to wait indefinitely until the lock is free instead of > looping a certain number of times, but I still like the option of > letting each architecture to decide how many times they want to > try. I agree that the actual waiting time even for one architecture > is depending on the actual CPU chip that is being used. I have done > some experiment on that. As long as the count is large enough, > increasing the loop count doesn't have any significant impact on > performance any more. The main reason for having a finite time is to > avoid having the waiting thread to wait virtually forever if the > lock happens to be busy for a long time. Again, if we make this tunable then we still want documentation for the behaviour on small, medium and large machines. Also what's the approach to tune that? Running some random testbench and monitor the results for various settings? If that's the case we better have a that as variables with generic initial values in the code, which can be modified by sysctl. > > > + getnstimeofday(); > > > + ns = (tv2.tv_sec - tv1.tv_sec) * NSEC_PER_SEC + > > > + (tv2.tv_nsec - tv1.tv_nsec); > > > + pr_info("lockref wait loop time = %lu ns\n", ns); > > Using getnstimeofday() for timestamping and spamming the logs with > > printouts? You can't be serious about that? > > > > Thats what tracepoints are for. Tracing is the only way to get proper > > numbers which take preemption, interrupts etc. into account without > > hurting runtime performace. > > The _SHOW_WAIT_LOOP_TIME is for debugging and instrumentation purpose only > during development cycle. It is not supposed to be turned on at production > system. I will document that in the code. No, no, no! Again: That's what tracepoints are for. This kind of debugging is completely pointless. Tracepoints are designed to be low overhead and can be enabled on production systems. Your debugging depends on slow timestamps against CLOCK_REALTIME and an even slower output via printk. How useful is that if you want to really instrument the behaviour of this code? Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount
On Mon, 8 Jul 2013, Waiman Long wrote: On 07/05/2013 02:59 PM, Thomas Gleixner wrote: On Fri, 5 Jul 2013, Waiman Long wrote: I think it is OK to add the GENERIC option, but I would like to make available a slightly different set of options: 1) Always take the lock 2) Use the generic implementation with the default parameters 3) Use the generic implementation with a customized set of parameters 4) Use an architecture specific implementation. 2) set only GENERIC_SPINLOCK_REFCOUNT 3) set both GENERIC_SPINLOCK_REFCOUNT and ARCH_SPINLOCK_REFCOUNT 4) set only ARCH_SPINLOCK_REFCOUNT The customized parameters will be set in the asm/spinlock_refcount.h file. Currently ,there is 2 parameters that can be customized for each architecture: 1) How much time will the function wait until the lock is free 2) How many attempts to do a lockless cmpxchg to update reference count Sigh. GENERIC means, that you use the generic implementation, ARCH means the architecture has a private implementation of that code. The generic implementation can use arch specific includes and if there is none we simply fallback to the asm-generic one. Let's start with a simple version because it IS simple and easy to analyze and debug and then incrementally build improvements on it instead of creating an heuristics monster in the first place, i.e.: if (!spin_is_locked(lr-lock) try_cmpxchg_once(lr)) return 0; return 1; Take numbers for this on a zoo of different machines: Intel and AMD, old and new. Then you can add an incremental patch on that, which add loops and hoops. Along with numbers on the same zoo of machines. I originally tried to do a cmpxchg without waiting and there was practically no performance gain. I believe that as soon as Well, you did not see a difference on your particular machine. Still we don't have an idea how all of this works on a set of different machines. There is a world beside 8 socket servers. contention happens, it will force all the upcoming reference count update threads to take the lock eliminating any potential performance gain that we can have. To make it simple, I can change the default to wait indefinitely until the lock is free instead of looping a certain number of times, but I still like the option of letting each architecture to decide how many times they want to try. I agree that the actual waiting time even for one architecture is depending on the actual CPU chip that is being used. I have done some experiment on that. As long as the count is large enough, increasing the loop count doesn't have any significant impact on performance any more. The main reason for having a finite time is to avoid having the waiting thread to wait virtually forever if the lock happens to be busy for a long time. Again, if we make this tunable then we still want documentation for the behaviour on small, medium and large machines. Also what's the approach to tune that? Running some random testbench and monitor the results for various settings? If that's the case we better have a that as variables with generic initial values in the code, which can be modified by sysctl. + getnstimeofday(tv2); + ns = (tv2.tv_sec - tv1.tv_sec) * NSEC_PER_SEC + + (tv2.tv_nsec - tv1.tv_nsec); + pr_info(lockref wait loop time = %lu ns\n, ns); Using getnstimeofday() for timestamping and spamming the logs with printouts? You can't be serious about that? Thats what tracepoints are for. Tracing is the only way to get proper numbers which take preemption, interrupts etc. into account without hurting runtime performace. The _SHOW_WAIT_LOOP_TIME is for debugging and instrumentation purpose only during development cycle. It is not supposed to be turned on at production system. I will document that in the code. No, no, no! Again: That's what tracepoints are for. This kind of debugging is completely pointless. Tracepoints are designed to be low overhead and can be enabled on production systems. Your debugging depends on slow timestamps against CLOCK_REALTIME and an even slower output via printk. How useful is that if you want to really instrument the behaviour of this code? Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount
On 07/15/2013 10:41 AM, Thomas Gleixner wrote: On Mon, 8 Jul 2013, Waiman Long wrote: On 07/05/2013 02:59 PM, Thomas Gleixner wrote: On Fri, 5 Jul 2013, Waiman Long wrote: I think it is OK to add the GENERIC option, but I would like to make available a slightly different set of options: 1) Always take the lock 2) Use the generic implementation with the default parameters 3) Use the generic implementation with a customized set of parameters 4) Use an architecture specific implementation. 2) set only GENERIC_SPINLOCK_REFCOUNT 3) set both GENERIC_SPINLOCK_REFCOUNT and ARCH_SPINLOCK_REFCOUNT 4) set only ARCH_SPINLOCK_REFCOUNT The customized parameters will be set in the asm/spinlock_refcount.h file. Currently ,there is 2 parameters that can be customized for each architecture: 1) How much time will the function wait until the lock is free 2) How many attempts to do a lockless cmpxchg to update reference count Sigh. GENERIC means, that you use the generic implementation, ARCH means the architecture has a private implementation of that code. The generic implementation can use arch specific includes and if there is none we simply fallback to the asm-generic one. I used the ARCH+GENERIC to mean using the generic code but with arch specific include. Let's start with a simple version because it IS simple and easy to analyze and debug and then incrementally build improvements on it instead of creating an heuristics monster in the first place, i.e.: if (!spin_is_locked(lr-lock) try_cmpxchg_once(lr)) return 0; return 1; Take numbers for this on a zoo of different machines: Intel and AMD, old and new. Then you can add an incremental patch on that, which add loops and hoops. Along with numbers on the same zoo of machines. I originally tried to do a cmpxchg without waiting and there was practically no performance gain. I believe that as soon as Well, you did not see a difference on your particular machine. Still we don't have an idea how all of this works on a set of different machines. There is a world beside 8 socket servers. I understand that. I can live with try_cmpxchg_once, though doing it twice will give a slightly better performance. However, without waiting for the lock to be free, this patch won't do much good. To keep it simple, I can remove the ability to do customization while doing cmpxchg once and wait until the lock is free. Please let me know if this is acceptable to you. contention happens, it will force all the upcoming reference count update threads to take the lock eliminating any potential performance gain that we can have. To make it simple, I can change the default to wait indefinitely until the lock is free instead of looping a certain number of times, but I still like the option of letting each architecture to decide how many times they want to try. I agree that the actual waiting time even for one architecture is depending on the actual CPU chip that is being used. I have done some experiment on that. As long as the count is large enough, increasing the loop count doesn't have any significant impact on performance any more. The main reason for having a finite time is to avoid having the waiting thread to wait virtually forever if the lock happens to be busy for a long time. Again, if we make this tunable then we still want documentation for the behaviour on small, medium and large machines. Also what's the approach to tune that? Running some random testbench and monitor the results for various settings? If that's the case we better have a that as variables with generic initial values in the code, which can be modified by sysctl. As I said above, I can remove the customization. I may reintroduce user customization using sysctl as you suggested in the a follow up patch after this one is merged. + getnstimeofday(tv2); + ns = (tv2.tv_sec - tv1.tv_sec) * NSEC_PER_SEC + +(tv2.tv_nsec - tv1.tv_nsec); + pr_info(lockref wait loop time = %lu ns\n, ns); Using getnstimeofday() for timestamping and spamming the logs with printouts? You can't be serious about that? Thats what tracepoints are for. Tracing is the only way to get proper numbers which take preemption, interrupts etc. into account without hurting runtime performace. The _SHOW_WAIT_LOOP_TIME is for debugging and instrumentation purpose only during development cycle. It is not supposed to be turned on at production system. I will document that in the code. No, no, no! Again: That's what tracepoints are for. This kind of debugging is completely pointless. Tracepoints are designed to be low overhead and can be enabled on production systems. Your debugging depends on slow timestamps against CLOCK_REALTIME and an even slower output via printk. How useful is that if you want to really instrument the behaviour of this code? This code is not critical and I can certainly remove it. Regards, Longman -- To unsubscribe from
Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount
On Mon, 15 Jul 2013, Waiman Long wrote: On 07/15/2013 10:41 AM, Thomas Gleixner wrote: On Mon, 8 Jul 2013, Waiman Long wrote: Sigh. GENERIC means, that you use the generic implementation, ARCH means the architecture has a private implementation of that code. The generic implementation can use arch specific includes and if there is none we simply fallback to the asm-generic one. I used the ARCH+GENERIC to mean using the generic code but with arch specific include. And what's the point of that? I just explained it to you that you do not need the ARCH=y and GENERIC=y at all. Let's start with a simple version because it IS simple and easy to analyze and debug and then incrementally build improvements on it instead of creating an heuristics monster in the first place, i.e.: if (!spin_is_locked(lr-lock) try_cmpxchg_once(lr)) return 0; return 1; Take numbers for this on a zoo of different machines: Intel and AMD, old and new. Then you can add an incremental patch on that, which add loops and hoops. Along with numbers on the same zoo of machines. I originally tried to do a cmpxchg without waiting and there was practically no performance gain. I believe that as soon as Well, you did not see a difference on your particular machine. Still we don't have an idea how all of this works on a set of different machines. There is a world beside 8 socket servers. I understand that. I can live with try_cmpxchg_once, though doing it twice will give a slightly better performance. However, without I asked you several times now to explain and document the whole thing with numbers instead of your handwaving slightly better performance arguments. I CANNOT verify that at all simply because I have no access to a 8 socket machine. Hint: Ask your boss to send me one :) waiting for the lock to be free, this patch won't do much good. To keep it simple, I can remove the ability to do customization while doing cmpxchg once and wait until the lock is free. Please let me know if this is acceptable to you. No, it's not acceptable at all if you are not able to provide data for 1,2,4,8 socket machines (from single core to your precious big boxes). It's that simple. We are not accepting patches which optimize for a single use case and might negatively affect 99,% of the existing users which have no access to this kind of hardware unless proven otherwise. As I told you versus the qrwlocks: You are missing to explain WHY it improves your particular workloads and WHY this change wont affect other workloads. All I can see from your explanations is: Works better on 8 socket big iron. I have yet to see any numbers that this wont affect the vast majority of users. Also what's the approach to tune that? Running some random testbench and monitor the results for various settings? If that's the case we better have a that as variables with generic initial values in the code, which can be modified by sysctl. As I said above, I can remove the customization. I may reintroduce user customization using sysctl as you suggested in the a follow up patch after this one is merged. And I asked for a step by step approach in the first review, but you decided to ignore that. And now you think that it's accetable for you as long as you get what you want. That's not the way it works, really. + getnstimeofday(tv2); + ns = (tv2.tv_sec - tv1.tv_sec) * NSEC_PER_SEC + + (tv2.tv_nsec - tv1.tv_nsec); + pr_info(lockref wait loop time = %lu ns\n, ns); Using getnstimeofday() for timestamping and spamming the logs with printouts? You can't be serious about that? q Thats what tracepoints are for. Tracing is the only way to get proper numbers which take preemption, interrupts etc. into account without hurting runtime performace. The _SHOW_WAIT_LOOP_TIME is for debugging and instrumentation purpose only during development cycle. It is not supposed to be turned on at production system. I will document that in the code. No, no, no! Again: That's what tracepoints are for. This kind of debugging is completely pointless. Tracepoints are designed to be low overhead and can be enabled on production systems. Your debugging depends on slow timestamps against CLOCK_REALTIME and an even slower output via printk. How useful is that if you want to really instrument the behaviour of this code? This code is not critical and I can certainly remove it. Did you even try to understand what I wrote? I did not ask you to remove instrumentation. I asked you to use useful instrumentation instead of some totally useless crap. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount
On 07/15/2013 07:47 PM, Thomas Gleixner wrote: On Mon, 15 Jul 2013, Waiman Long wrote: On 07/15/2013 10:41 AM, Thomas Gleixner wrote: On Mon, 8 Jul 2013, Waiman Long wrote: Sigh. GENERIC means, that you use the generic implementation, ARCH means the architecture has a private implementation of that code. The generic implementation can use arch specific includes and if there is none we simply fallback to the asm-generic one. I used the ARCH+GENERIC to mean using the generic code but with arch specific include. And what's the point of that? I just explained it to you that you do not need the ARCH=y and GENERIC=y at all. As I said in my previous mail, I can remove the ARCH+GENERIC option. Let's start with a simple version because it IS simple and easy to analyze and debug and then incrementally build improvements on it instead of creating an heuristics monster in the first place, i.e.: if (!spin_is_locked(lr-lock)try_cmpxchg_once(lr)) return 0; return 1; Take numbers for this on a zoo of different machines: Intel and AMD, old and new. Then you can add an incremental patch on that, which add loops and hoops. Along with numbers on the same zoo of machines. I originally tried to do a cmpxchg without waiting and there was practically no performance gain. I believe that as soon as Well, you did not see a difference on your particular machine. Still we don't have an idea how all of this works on a set of different machines. There is a world beside 8 socket servers. I understand that. I can live with try_cmpxchg_once, though doing it twice will give a slightly better performance. However, without I asked you several times now to explain and document the whole thing with numbers instead of your handwaving slightly better performance arguments. I will provide performance data for 1 and 2 retries in my next patch version. waiting for the lock to be free, this patch won't do much good. To keep it simple, I can remove the ability to do customization while doing cmpxchg once and wait until the lock is free. Please let me know if this is acceptable to you. No, it's not acceptable at all if you are not able to provide data for 1,2,4,8 socket machines (from single core to your precious big boxes). It's that simple. We are not accepting patches which optimize for a single use case and might negatively affect 99,% of the existing users which have no access to this kind of hardware unless proven otherwise. I did provide performance data for 1,2,4 and 8 socket configurations in my commit message. I used numactl to simulate different socket configuration by forcing the code to use only a subset of total number of sockets. I know that is not ideal, but I think it should be close enough. I will provide performance data on a more common 2 socket test machine that I have. Yes, I don't provide data for single-thread use case. I will also provide that data in my next version by measuring the average time for doing low-level reference count update using lock and lockless update like what I had done for the qrwlock patch. For single thread case, I don't believe any real workload will show any appreciable difference in performance due to the differing reference count update mechanisms. Also what's the approach to tune that? Running some random testbench and monitor the results for various settings? If that's the case we better have a that as variables with generic initial values in the code, which can be modified by sysctl. As I said above, I can remove the customization. I may reintroduce user customization using sysctl as you suggested in the a follow up patch after this one is merged. And I asked for a step by step approach in the first review, but you decided to ignore that. And now you think that it's accetable for you as long as you get what you want. That's not the way it works, really. I am trying to provide what you are asking for while at the same time meet my own need. + getnstimeofday(tv2); + ns = (tv2.tv_sec - tv1.tv_sec) * NSEC_PER_SEC + +(tv2.tv_nsec - tv1.tv_nsec); + pr_info(lockref wait loop time = %lu ns\n, ns); Using getnstimeofday() for timestamping and spamming the logs with printouts? You can't be serious about that? qThats what tracepoints are for. Tracing is the only way to get proper numbers which take preemption, interrupts etc. into account without hurting runtime performace. The _SHOW_WAIT_LOOP_TIME is for debugging and instrumentation purpose only during development cycle. It is not supposed to be turned on at production system. I will document that in the code. No, no, no! Again: That's what tracepoints are for. This kind of debugging is completely pointless. Tracepoints are designed to be low overhead and can be enabled on production systems. Your debugging depends on slow timestamps against CLOCK_REALTIME and an even slower output via
Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount
On 07/05/2013 02:59 PM, Thomas Gleixner wrote: On Fri, 5 Jul 2013, Waiman Long wrote: + * If the spinlock& reference count optimization feature is disabled, + * the spinlock and reference count are accessed separately on its own. + */ +struct lockref { + unsigned int refcnt;/* Reference count */ + spinlock_t lock; +}; + +/* + * Struct lockref helper functions + */ +/* Function documentation starts with /** I will fix that. + * lockref_get - increments reference count while not locked This should be: Increments reference count unconditionally. Will change that. + * @lockcnt: pointer to lockref structure + */ +static __always_inline void +lockref_get(struct lockref *lockcnt) Please avoid these line breaks if the line fits in 80 chars. Will make the change. +{ + spin_lock(>lock); + lockcnt->refcnt++; + spin_unlock(>lock); +} +/* + * lockref_put_or_locked - decrements count unless count<= 1 before decrement + *otherwise the lock will be taken lockref_put_or_lock please Docbook cannot work with multiline comments for the function name. So make that shorter and add a longer explanation below the @argument docs. Will fix that. + * @lockcnt: pointer to lockref structure + * Return: 1 if count updated successfully or 0 if count<= 1 and lock taken + */ +static __always_inline int +lockref_put_or_locked(struct lockref *lockcnt) +{ + spin_lock(>lock); + if (likely(lockcnt->refcnt> 1)) { + lockcnt->refcnt--; + spin_unlock(>lock); + return 1; + } + return 0; /* Count is 1& lock taken */ Please no tail comments. They are horrible to parse and it's obvious from the code. Will remove the tail comments. +} + +#endif /* CONFIG_SPINLOCK_REFCOUNT */ +#endif /* __LINUX_SPINLOCK_REFCOUNT_H */ diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks index 44511d1..d1f8670 100644 --- a/kernel/Kconfig.locks +++ b/kernel/Kconfig.locks @@ -223,3 +223,8 @@ endif config MUTEX_SPIN_ON_OWNER def_bool y depends on SMP&& !DEBUG_MUTEXES + +config SPINLOCK_REFCOUNT + def_bool y + depends on ARCH_SPINLOCK_REFCOUNT&& SMP This looks wrong. We want three options: 1) Always take the lock 2) Use the generic implementation 3) Use an architecture specific implementation So we want config GENERIC_SPINLOCK_REFCOUNT bool config ARCH_SPINLOCK_REFCOUNT bool config SPINLOCK_REFCOUNT def_bool y depends on GENERIC_SPINLOCK_REFCOUNT || ARCH_SPINLOCK_REFCOUNT depends on . So an architectire can select GENERIC_SPINLOCK_REFCOUNT to get the generic implementation or ARCH_SPINLOCK_REFCOUNT if it provides its own special version. And lib/spinlock_refcount.o depends on GENERIC_SPINLOCK_REFCOUNT I think it is OK to add the GENERIC option, but I would like to make available a slightly different set of options: 1) Always take the lock 2) Use the generic implementation with the default parameters 3) Use the generic implementation with a customized set of parameters 4) Use an architecture specific implementation. 2) set only GENERIC_SPINLOCK_REFCOUNT 3) set both GENERIC_SPINLOCK_REFCOUNT and ARCH_SPINLOCK_REFCOUNT 4) set only ARCH_SPINLOCK_REFCOUNT The customized parameters will be set in the "asm/spinlock_refcount.h" file. Currently ,there is 2 parameters that can be customized for each architecture: 1) How much time will the function wait until the lock is free 2) How many attempts to do a lockless cmpxchg to update reference count +/* + * The maximum number of waiting spins when the lock was acquiring before + * trying to attempt a lockless update. The purpose of the timeout is to + * limit the amount of unfairness to the thread that is doing the reference + * count update. Otherwise, it is theoretically possible for that thread to + * be starved for a really long time causing all kind of problems. If it + * times out and the lock is still not free, the code will fall back to the + * traditional way of queuing up to acquire the lock before updating the count. + * + * The actual time spent in the wait loop very much depends on the CPU being + * used. On a 2.4GHz Westmere CPU, the execution time of a PAUSE instruction + * (cpu_relax) in a 4k loop is about 16us. The lock checking and looping + * overhead is about 8us. With 4 cpu_relax's in the loop, it will wait + * about 72us before the count reaches 0. With cacheline contention, the + * wait time can go up to 3x as much (about 210us). Having multiple + * cpu_relax's in the wait loop does seem to reduce cacheline contention + * somewhat and give slightly better performance. + * + * The preset timeout value is rather arbitrary and really depends on the CPU + * being used. If customization is needed, we could add a config variable for + * that. The exact value is not that important. A longer wait time will + * increase the
Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount
On 07/05/2013 02:59 PM, Thomas Gleixner wrote: On Fri, 5 Jul 2013, Waiman Long wrote: + * If the spinlock reference count optimization feature is disabled, + * the spinlock and reference count are accessed separately on its own. + */ +struct lockref { + unsigned int refcnt;/* Reference count */ + spinlock_t lock; +}; + +/* + * Struct lockref helper functions + */ +/* Function documentation starts with /** I will fix that. + * lockref_get - increments reference count while not locked This should be: Increments reference count unconditionally. Will change that. + * @lockcnt: pointer to lockref structure + */ +static __always_inline void +lockref_get(struct lockref *lockcnt) Please avoid these line breaks if the line fits in 80 chars. Will make the change. +{ + spin_lock(lockcnt-lock); + lockcnt-refcnt++; + spin_unlock(lockcnt-lock); +} +/* + * lockref_put_or_locked - decrements count unless count= 1 before decrement + *otherwise the lock will be taken lockref_put_or_lock please Docbook cannot work with multiline comments for the function name. So make that shorter and add a longer explanation below the @argument docs. Will fix that. + * @lockcnt: pointer to lockref structure + * Return: 1 if count updated successfully or 0 if count= 1 and lock taken + */ +static __always_inline int +lockref_put_or_locked(struct lockref *lockcnt) +{ + spin_lock(lockcnt-lock); + if (likely(lockcnt-refcnt 1)) { + lockcnt-refcnt--; + spin_unlock(lockcnt-lock); + return 1; + } + return 0; /* Count is 1 lock taken */ Please no tail comments. They are horrible to parse and it's obvious from the code. Will remove the tail comments. +} + +#endif /* CONFIG_SPINLOCK_REFCOUNT */ +#endif /* __LINUX_SPINLOCK_REFCOUNT_H */ diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks index 44511d1..d1f8670 100644 --- a/kernel/Kconfig.locks +++ b/kernel/Kconfig.locks @@ -223,3 +223,8 @@ endif config MUTEX_SPIN_ON_OWNER def_bool y depends on SMP !DEBUG_MUTEXES + +config SPINLOCK_REFCOUNT + def_bool y + depends on ARCH_SPINLOCK_REFCOUNT SMP This looks wrong. We want three options: 1) Always take the lock 2) Use the generic implementation 3) Use an architecture specific implementation So we want config GENERIC_SPINLOCK_REFCOUNT bool config ARCH_SPINLOCK_REFCOUNT bool config SPINLOCK_REFCOUNT def_bool y depends on GENERIC_SPINLOCK_REFCOUNT || ARCH_SPINLOCK_REFCOUNT depends on . So an architectire can select GENERIC_SPINLOCK_REFCOUNT to get the generic implementation or ARCH_SPINLOCK_REFCOUNT if it provides its own special version. And lib/spinlock_refcount.o depends on GENERIC_SPINLOCK_REFCOUNT I think it is OK to add the GENERIC option, but I would like to make available a slightly different set of options: 1) Always take the lock 2) Use the generic implementation with the default parameters 3) Use the generic implementation with a customized set of parameters 4) Use an architecture specific implementation. 2) set only GENERIC_SPINLOCK_REFCOUNT 3) set both GENERIC_SPINLOCK_REFCOUNT and ARCH_SPINLOCK_REFCOUNT 4) set only ARCH_SPINLOCK_REFCOUNT The customized parameters will be set in the asm/spinlock_refcount.h file. Currently ,there is 2 parameters that can be customized for each architecture: 1) How much time will the function wait until the lock is free 2) How many attempts to do a lockless cmpxchg to update reference count +/* + * The maximum number of waiting spins when the lock was acquiring before + * trying to attempt a lockless update. The purpose of the timeout is to + * limit the amount of unfairness to the thread that is doing the reference + * count update. Otherwise, it is theoretically possible for that thread to + * be starved for a really long time causing all kind of problems. If it + * times out and the lock is still not free, the code will fall back to the + * traditional way of queuing up to acquire the lock before updating the count. + * + * The actual time spent in the wait loop very much depends on the CPU being + * used. On a 2.4GHz Westmere CPU, the execution time of a PAUSE instruction + * (cpu_relax) in a 4k loop is about 16us. The lock checking and looping + * overhead is about 8us. With 4 cpu_relax's in the loop, it will wait + * about 72us before the count reaches 0. With cacheline contention, the + * wait time can go up to 3x as much (about 210us). Having multiple + * cpu_relax's in the wait loop does seem to reduce cacheline contention + * somewhat and give slightly better performance. + * + * The preset timeout value is rather arbitrary and really depends on the CPU + * being used. If customization is needed, we could add a config variable for + * that. The exact value is not that important. A longer wait time will + *
Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount
On Fri, 5 Jul 2013, Waiman Long wrote: > + * If the spinlock & reference count optimization feature is disabled, > + * the spinlock and reference count are accessed separately on its own. > + */ > +struct lockref { > + unsigned int refcnt;/* Reference count */ > + spinlock_t lock; > +}; > + > +/* > + * Struct lockref helper functions > + */ > +/* Function documentation starts with /** > + * lockref_get - increments reference count while not locked This should be: Increments reference count unconditionally. > + * @lockcnt: pointer to lockref structure > + */ > +static __always_inline void > +lockref_get(struct lockref *lockcnt) Please avoid these line breaks if the line fits in 80 chars. > +{ > + spin_lock(>lock); > + lockcnt->refcnt++; > + spin_unlock(>lock); > +} > +/* > + * lockref_put_or_locked - decrements count unless count <= 1 before > decrement > + * otherwise the lock will be taken lockref_put_or_lock please Docbook cannot work with multiline comments for the function name. So make that shorter and add a longer explanation below the @argument docs. > + * @lockcnt: pointer to lockref structure > + * Return: 1 if count updated successfully or 0 if count <= 1 and lock taken > + */ > +static __always_inline int > +lockref_put_or_locked(struct lockref *lockcnt) > +{ > + spin_lock(>lock); > + if (likely(lockcnt->refcnt > 1)) { > + lockcnt->refcnt--; > + spin_unlock(>lock); > + return 1; > + } > + return 0; /* Count is 1 & lock taken */ Please no tail comments. They are horrible to parse and it's obvious from the code. > +} > + > +#endif /* CONFIG_SPINLOCK_REFCOUNT */ > +#endif /* __LINUX_SPINLOCK_REFCOUNT_H */ > diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks > index 44511d1..d1f8670 100644 > --- a/kernel/Kconfig.locks > +++ b/kernel/Kconfig.locks > @@ -223,3 +223,8 @@ endif > config MUTEX_SPIN_ON_OWNER > def_bool y > depends on SMP && !DEBUG_MUTEXES > + > +config SPINLOCK_REFCOUNT > + def_bool y > + depends on ARCH_SPINLOCK_REFCOUNT && SMP This looks wrong. We want three options: 1) Always take the lock 2) Use the generic implementation 3) Use an architecture specific implementation So we want config GENERIC_SPINLOCK_REFCOUNT bool config ARCH_SPINLOCK_REFCOUNT bool config SPINLOCK_REFCOUNT def_bool y depends on GENERIC_SPINLOCK_REFCOUNT || ARCH_SPINLOCK_REFCOUNT depends on . So an architectire can select GENERIC_SPINLOCK_REFCOUNT to get the generic implementation or ARCH_SPINLOCK_REFCOUNT if it provides its own special version. And lib/spinlock_refcount.o depends on GENERIC_SPINLOCK_REFCOUNT > +/* > + * The maximum number of waiting spins when the lock was acquiring before > + * trying to attempt a lockless update. The purpose of the timeout is to > + * limit the amount of unfairness to the thread that is doing the reference > + * count update. Otherwise, it is theoretically possible for that thread to > + * be starved for a really long time causing all kind of problems. If it > + * times out and the lock is still not free, the code will fall back to the > + * traditional way of queuing up to acquire the lock before updating the > count. > + * > + * The actual time spent in the wait loop very much depends on the CPU being > + * used. On a 2.4GHz Westmere CPU, the execution time of a PAUSE instruction > + * (cpu_relax) in a 4k loop is about 16us. The lock checking and looping > + * overhead is about 8us. With 4 cpu_relax's in the loop, it will wait > + * about 72us before the count reaches 0. With cacheline contention, the > + * wait time can go up to 3x as much (about 210us). Having multiple > + * cpu_relax's in the wait loop does seem to reduce cacheline contention > + * somewhat and give slightly better performance. > + * > + * The preset timeout value is rather arbitrary and really depends on the CPU > + * being used. If customization is needed, we could add a config variable for > + * that. The exact value is not that important. A longer wait time will > + * increase the chance of doing a lockless update, but it results in more > + * unfairness to the waiting thread and vice versa. As I told you before it's a horrible idea. This is completely depending on the CPU, the cache implementation and whatever. These heuristic can never be made correct. Their are prone to fail and in the worst case you end up with a regression on some systems. Let's start with a simple version because it IS simple and easy to analyze and debug and then incrementally build improvements on it instead of creating an heuristics monster in the first place, i.e.: if (!spin_is_locked(>lock) && try_cmpxchg_once(lr)) return 0; return 1; Take numbers for this on a zoo of different machines: Intel and AMD, old and new. Then you can add an incremental patch on that, which add
[PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount
This patch introduces a new set of spinlock_refcount.h header files to be included by kernel codes that want to do a faster lockless update of reference count protected by a spinlock. The new lockref structure consists of just the spinlock and the reference count data. Helper functions are defined in the new header file to access the content of the new structure. There is a generic structure defined for all architecture, but each architecture can also optionally define its own structure and use its own helper functions. Two new config parameters are introduced: 1. SPINLOCK_REFCOUNT 2. ARCH_SPINLOCK_REFCOUNT The first one is defined in the kernel/Kconfig.locks which is used to enable or disable the faster lockless reference count update optimization. The second one has to be defined in each of the architecture's Kconfig file to enable the optimization for that architecture. Therefore, each architecture has to opt-in for this optimization or it won't get it. This allows each architecture plenty of time to test it out before deciding to use it or replace it with a better architecture specific solution. This optimization won't work for non-SMP system or when spinlock debugging is turned on. As a result, it is turned off each any of them is true. It also won't work for full preempt-RT and so should be turned on in this case. The current patch allows 3 levels of access to the new lockref structure: 1. The lockless update optimization is turned off (SPINLOCK_REFCOUNT=n). 2. The lockless update optimization is turned on and the generic version is used (SPINLOCK_REFCOUNT=y and ARCH_SPINLOCK_REFCOUNT=y). 3. The lockless update optimization is turned on and the architecture provides its own version. To maximize the chance of doing lockless update in the generic version, the inlined __lockref_add_unless() function will wait for a certain amount of time if the lock is not free before trying to do the update. The new code also attempts to do lockless atomic update twice before falling back to the old code path of acquiring a lock before doing the update. It is because there will still be some fair amount of contention with only one attempt. Signed-off-by: Waiman Long --- include/asm-generic/spinlock_refcount.h | 46 ++ include/linux/spinlock_refcount.h | 107 ++ kernel/Kconfig.locks|5 + lib/Makefile|2 + lib/spinlock_refcount.c | 229 +++ 5 files changed, 389 insertions(+), 0 deletions(-) create mode 100644 include/asm-generic/spinlock_refcount.h create mode 100644 include/linux/spinlock_refcount.h create mode 100644 lib/spinlock_refcount.c diff --git a/include/asm-generic/spinlock_refcount.h b/include/asm-generic/spinlock_refcount.h new file mode 100644 index 000..469b96b --- /dev/null +++ b/include/asm-generic/spinlock_refcount.h @@ -0,0 +1,46 @@ +/* + * Spinlock with reference count combo + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * (c) Copyright 2013 Hewlett-Packard Development Company, L.P. + * + * Authors: Waiman Long + */ +#ifndef __ASM_GENERIC_SPINLOCK_REFCOUNT_H +#define __ASM_GENERIC_SPINLOCK_REFCOUNT_H + +/* + * The lockref structure defines a combined spinlock with reference count + * data structure to be embedded in a larger structure. The combined data + * structure is always 8-byte aligned. So proper placement of this structure + * in the larger embedding data structure is needed to ensure that there is + * no hole in it. + */ +struct __aligned(sizeof(u64)) lockref { + union { + u64 __lock_count; + struct { + unsigned intrefcnt; /* Reference count */ + spinlock_t lock; + }; + }; +}; + +/* + * Struct lockref helper functions + */ +extern void lockref_get(struct lockref *lockcnt); +extern int lockref_put(struct lockref *lockcnt); +extern int lockref_get_not_zero(struct lockref *lockcnt); +extern int lockref_put_or_locked(struct lockref *lockcnt); + +#endif /* __ASM_GENERIC_SPINLOCK_REFCOUNT_H */ diff --git a/include/linux/spinlock_refcount.h b/include/linux/spinlock_refcount.h new file mode 100644 index 000..28b0b89 --- /dev/null +++ b/include/linux/spinlock_refcount.h @@ -0,0 +1,107 @@ +/* + * Spinlock with reference count combo + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License
[PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount
This patch introduces a new set of spinlock_refcount.h header files to be included by kernel codes that want to do a faster lockless update of reference count protected by a spinlock. The new lockref structure consists of just the spinlock and the reference count data. Helper functions are defined in the new linux/spinlock_refcount.h header file to access the content of the new structure. There is a generic structure defined for all architecture, but each architecture can also optionally define its own structure and use its own helper functions. Two new config parameters are introduced: 1. SPINLOCK_REFCOUNT 2. ARCH_SPINLOCK_REFCOUNT The first one is defined in the kernel/Kconfig.locks which is used to enable or disable the faster lockless reference count update optimization. The second one has to be defined in each of the architecture's Kconfig file to enable the optimization for that architecture. Therefore, each architecture has to opt-in for this optimization or it won't get it. This allows each architecture plenty of time to test it out before deciding to use it or replace it with a better architecture specific solution. This optimization won't work for non-SMP system or when spinlock debugging is turned on. As a result, it is turned off each any of them is true. It also won't work for full preempt-RT and so should be turned on in this case. The current patch allows 3 levels of access to the new lockref structure: 1. The lockless update optimization is turned off (SPINLOCK_REFCOUNT=n). 2. The lockless update optimization is turned on and the generic version is used (SPINLOCK_REFCOUNT=y and ARCH_SPINLOCK_REFCOUNT=y). 3. The lockless update optimization is turned on and the architecture provides its own version. To maximize the chance of doing lockless update in the generic version, the inlined __lockref_add_unless() function will wait for a certain amount of time if the lock is not free before trying to do the update. The new code also attempts to do lockless atomic update twice before falling back to the old code path of acquiring a lock before doing the update. It is because there will still be some fair amount of contention with only one attempt. Signed-off-by: Waiman Long waiman.l...@hp.com --- include/asm-generic/spinlock_refcount.h | 46 ++ include/linux/spinlock_refcount.h | 107 ++ kernel/Kconfig.locks|5 + lib/Makefile|2 + lib/spinlock_refcount.c | 229 +++ 5 files changed, 389 insertions(+), 0 deletions(-) create mode 100644 include/asm-generic/spinlock_refcount.h create mode 100644 include/linux/spinlock_refcount.h create mode 100644 lib/spinlock_refcount.c diff --git a/include/asm-generic/spinlock_refcount.h b/include/asm-generic/spinlock_refcount.h new file mode 100644 index 000..469b96b --- /dev/null +++ b/include/asm-generic/spinlock_refcount.h @@ -0,0 +1,46 @@ +/* + * Spinlock with reference count combo + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * (c) Copyright 2013 Hewlett-Packard Development Company, L.P. + * + * Authors: Waiman Long waiman.l...@hp.com + */ +#ifndef __ASM_GENERIC_SPINLOCK_REFCOUNT_H +#define __ASM_GENERIC_SPINLOCK_REFCOUNT_H + +/* + * The lockref structure defines a combined spinlock with reference count + * data structure to be embedded in a larger structure. The combined data + * structure is always 8-byte aligned. So proper placement of this structure + * in the larger embedding data structure is needed to ensure that there is + * no hole in it. + */ +struct __aligned(sizeof(u64)) lockref { + union { + u64 __lock_count; + struct { + unsigned intrefcnt; /* Reference count */ + spinlock_t lock; + }; + }; +}; + +/* + * Struct lockref helper functions + */ +extern void lockref_get(struct lockref *lockcnt); +extern int lockref_put(struct lockref *lockcnt); +extern int lockref_get_not_zero(struct lockref *lockcnt); +extern int lockref_put_or_locked(struct lockref *lockcnt); + +#endif /* __ASM_GENERIC_SPINLOCK_REFCOUNT_H */ diff --git a/include/linux/spinlock_refcount.h b/include/linux/spinlock_refcount.h new file mode 100644 index 000..28b0b89 --- /dev/null +++ b/include/linux/spinlock_refcount.h @@ -0,0 +1,107 @@ +/* + * Spinlock with reference count combo + * + * This program is free software; you can redistribute it and/or
Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount
On Fri, 5 Jul 2013, Waiman Long wrote: + * If the spinlock reference count optimization feature is disabled, + * the spinlock and reference count are accessed separately on its own. + */ +struct lockref { + unsigned int refcnt;/* Reference count */ + spinlock_t lock; +}; + +/* + * Struct lockref helper functions + */ +/* Function documentation starts with /** + * lockref_get - increments reference count while not locked This should be: Increments reference count unconditionally. + * @lockcnt: pointer to lockref structure + */ +static __always_inline void +lockref_get(struct lockref *lockcnt) Please avoid these line breaks if the line fits in 80 chars. +{ + spin_lock(lockcnt-lock); + lockcnt-refcnt++; + spin_unlock(lockcnt-lock); +} +/* + * lockref_put_or_locked - decrements count unless count = 1 before decrement + * otherwise the lock will be taken lockref_put_or_lock please Docbook cannot work with multiline comments for the function name. So make that shorter and add a longer explanation below the @argument docs. + * @lockcnt: pointer to lockref structure + * Return: 1 if count updated successfully or 0 if count = 1 and lock taken + */ +static __always_inline int +lockref_put_or_locked(struct lockref *lockcnt) +{ + spin_lock(lockcnt-lock); + if (likely(lockcnt-refcnt 1)) { + lockcnt-refcnt--; + spin_unlock(lockcnt-lock); + return 1; + } + return 0; /* Count is 1 lock taken */ Please no tail comments. They are horrible to parse and it's obvious from the code. +} + +#endif /* CONFIG_SPINLOCK_REFCOUNT */ +#endif /* __LINUX_SPINLOCK_REFCOUNT_H */ diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks index 44511d1..d1f8670 100644 --- a/kernel/Kconfig.locks +++ b/kernel/Kconfig.locks @@ -223,3 +223,8 @@ endif config MUTEX_SPIN_ON_OWNER def_bool y depends on SMP !DEBUG_MUTEXES + +config SPINLOCK_REFCOUNT + def_bool y + depends on ARCH_SPINLOCK_REFCOUNT SMP This looks wrong. We want three options: 1) Always take the lock 2) Use the generic implementation 3) Use an architecture specific implementation So we want config GENERIC_SPINLOCK_REFCOUNT bool config ARCH_SPINLOCK_REFCOUNT bool config SPINLOCK_REFCOUNT def_bool y depends on GENERIC_SPINLOCK_REFCOUNT || ARCH_SPINLOCK_REFCOUNT depends on . So an architectire can select GENERIC_SPINLOCK_REFCOUNT to get the generic implementation or ARCH_SPINLOCK_REFCOUNT if it provides its own special version. And lib/spinlock_refcount.o depends on GENERIC_SPINLOCK_REFCOUNT +/* + * The maximum number of waiting spins when the lock was acquiring before + * trying to attempt a lockless update. The purpose of the timeout is to + * limit the amount of unfairness to the thread that is doing the reference + * count update. Otherwise, it is theoretically possible for that thread to + * be starved for a really long time causing all kind of problems. If it + * times out and the lock is still not free, the code will fall back to the + * traditional way of queuing up to acquire the lock before updating the count. + * + * The actual time spent in the wait loop very much depends on the CPU being + * used. On a 2.4GHz Westmere CPU, the execution time of a PAUSE instruction + * (cpu_relax) in a 4k loop is about 16us. The lock checking and looping + * overhead is about 8us. With 4 cpu_relax's in the loop, it will wait + * about 72us before the count reaches 0. With cacheline contention, the + * wait time can go up to 3x as much (about 210us). Having multiple + * cpu_relax's in the wait loop does seem to reduce cacheline contention + * somewhat and give slightly better performance. + * + * The preset timeout value is rather arbitrary and really depends on the CPU + * being used. If customization is needed, we could add a config variable for + * that. The exact value is not that important. A longer wait time will + * increase the chance of doing a lockless update, but it results in more + * unfairness to the waiting thread and vice versa. As I told you before it's a horrible idea. This is completely depending on the CPU, the cache implementation and whatever. These heuristic can never be made correct. Their are prone to fail and in the worst case you end up with a regression on some systems. Let's start with a simple version because it IS simple and easy to analyze and debug and then incrementally build improvements on it instead of creating an heuristics monster in the first place, i.e.: if (!spin_is_locked(lr-lock) try_cmpxchg_once(lr)) return 0; return 1; Take numbers for this on a zoo of different machines: Intel and AMD, old and new. Then you can add an incremental patch on that, which add loops and hoops. Along with numbers on the same zoo of machines.