Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount

2013-07-15 Thread Waiman Long

On 07/15/2013 07:47 PM, Thomas Gleixner wrote:

On Mon, 15 Jul 2013, Waiman Long wrote:

On 07/15/2013 10:41 AM, Thomas Gleixner wrote:

On Mon, 8 Jul 2013, Waiman Long wrote:
Sigh. GENERIC means, that you use the generic implementation, ARCH
means the architecture has a private implementation of that code.

The generic implementation can use arch specific includes and if there
is none we simply fallback to the asm-generic one.

I used the ARCH+GENERIC to mean using the generic code but with arch specific
include.

And what's the point of that? I just explained it to you that you do
not need the ARCH=y and GENERIC=y at all.


As I said in my previous mail, I can remove the ARCH+GENERIC option.


   >   Let's start with a simple version because it IS simple and easy

to analyze and debug and then incrementally build improvements on it
instead of creating an heuristics monster in the first place, i.e.:

  if (!spin_is_locked(>lock)&&try_cmpxchg_once(lr))
 return 0;
  return 1;

Take numbers for this on a zoo of different machines: Intel and AMD,
old and new.

Then you can add an incremental patch on that, which add loops and
hoops. Along with numbers on the same zoo of machines.

I originally tried to do a cmpxchg without waiting and there was
practically no performance gain. I believe that as soon as

Well, you did not see a difference on your particular machine. Still
we don't have an idea how all of this works on a set of different
machines. There is a world beside 8 socket servers.

I understand that. I can live with try_cmpxchg_once, though doing it
twice will give a slightly better performance. However, without

I asked you several times now to explain and document the whole thing
with numbers instead of your handwaving "slightly better performance"
arguments.


I will provide performance data for 1 and 2 retries in my next patch 
version.





waiting for the lock to be free, this patch won't do much good. To
keep it simple, I can remove the ability to do customization while
doing cmpxchg once and wait until the lock is free. Please let me
know if this is acceptable to you.

No, it's not acceptable at all if you are not able to provide data for
1,2,4,8 socket machines (from single core to your precious big
boxes). It's that simple. We are not accepting patches which optimize
for a single use case and might negatively affect 99,% of the
existing users which have no access to this kind of hardware unless
proven otherwise.


I did provide performance data for 1,2,4 and 8 socket configurations in 
my commit message. I used numactl to simulate different socket 
configuration by forcing the code to use only a subset of total number 
of sockets. I know that is not ideal, but I think it should be close 
enough. I will provide performance data on a more common 2 socket test 
machine that I have.


Yes, I don't provide data for single-thread use case. I will also 
provide that data in my next version by measuring the average time for 
doing low-level reference count update using lock and lockless update 
like what I had done for the qrwlock patch. For single thread case, I 
don't believe any real workload will show any appreciable difference in 
performance due to the differing reference count update mechanisms.



Also what's the approach to tune that? Running some random testbench
and monitor the results for various settings?

If that's the case we better have a that as variables with generic
initial values in the code, which can be modified by sysctl.

As I said above, I can remove the customization. I may reintroduce user
customization using sysctl as you suggested in the a follow up patch after
this one is merged.

And I asked for a step by step approach in the first review, but you
decided to ignore that. And now you think that it's accetable for you
as long as you get what you want. That's not the way it works, really.


I am trying to provide what you are asking for while at the same time 
meet my own need.



+   getnstimeofday();
+   ns = (tv2.tv_sec - tv1.tv_sec) * NSEC_PER_SEC +
+(tv2.tv_nsec - tv1.tv_nsec);
+   pr_info("lockref wait loop time = %lu ns\n", ns);

Using getnstimeofday() for timestamping and spamming the logs with
printouts? You can't be serious about that?


q>  >  >  >  Thats what tracepoints are for. Tracing is the only way to get 
proper

numbers which take preemption, interrupts etc. into account without
hurting runtime performace.

The _SHOW_WAIT_LOOP_TIME is for debugging and instrumentation purpose only
during development cycle. It is not supposed to be turned on at production
system. I will document that in the code.

No, no, no! Again: That's what tracepoints are for.

This kind of debugging is completely pointless. Tracepoints are
designed to be low overhead and can be enabled on production
systems.

Your debugging depends on slow timestamps against CLOCK_REALTIME and
an even slower output via 

Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount

2013-07-15 Thread Thomas Gleixner
On Mon, 15 Jul 2013, Waiman Long wrote:
> On 07/15/2013 10:41 AM, Thomas Gleixner wrote:
> > On Mon, 8 Jul 2013, Waiman Long wrote:
> > Sigh. GENERIC means, that you use the generic implementation, ARCH
> > means the architecture has a private implementation of that code.
> > 
> > The generic implementation can use arch specific includes and if there
> > is none we simply fallback to the asm-generic one.
> 
> I used the ARCH+GENERIC to mean using the generic code but with arch specific
> include.

And what's the point of that? I just explained it to you that you do
not need the ARCH=y and GENERIC=y at all.
 
> >   >  Let's start with a simple version because it IS simple and easy
> > > > to analyze and debug and then incrementally build improvements on it
> > > > instead of creating an heuristics monster in the first place, i.e.:
> > > > 
> > > >  if (!spin_is_locked(>lock)&&   try_cmpxchg_once(lr))
> > > > return 0;
> > > >  return 1;
> > > > 
> > > > Take numbers for this on a zoo of different machines: Intel and AMD,
> > > > old and new.
> > > > 
> > > > Then you can add an incremental patch on that, which add loops and
> > > > hoops. Along with numbers on the same zoo of machines.
> > > I originally tried to do a cmpxchg without waiting and there was
> > > practically no performance gain. I believe that as soon as
> > Well, you did not see a difference on your particular machine. Still
> > we don't have an idea how all of this works on a set of different
> > machines. There is a world beside 8 socket servers.
> 
> I understand that. I can live with try_cmpxchg_once, though doing it
> twice will give a slightly better performance. However, without

I asked you several times now to explain and document the whole thing
with numbers instead of your handwaving "slightly better performance"
arguments.

I CANNOT verify that at all simply because I have no access to a 8
socket machine. Hint: Ask your boss to send me one :)

> waiting for the lock to be free, this patch won't do much good. To
> keep it simple, I can remove the ability to do customization while
> doing cmpxchg once and wait until the lock is free. Please let me
> know if this is acceptable to you.

No, it's not acceptable at all if you are not able to provide data for
1,2,4,8 socket machines (from single core to your precious big
boxes). It's that simple. We are not accepting patches which optimize
for a single use case and might negatively affect 99,% of the
existing users which have no access to this kind of hardware unless
proven otherwise.

As I told you versus the qrwlocks: You are missing to explain WHY it
improves your particular workloads and WHY this change wont affect
other workloads. All I can see from your explanations is: Works better
on 8 socket big iron. I have yet to see any numbers that this wont
affect the vast majority of users.

> > Also what's the approach to tune that? Running some random testbench
> > and monitor the results for various settings?
> > 
> > If that's the case we better have a that as variables with generic
> > initial values in the code, which can be modified by sysctl.
> 
> As I said above, I can remove the customization. I may reintroduce user
> customization using sysctl as you suggested in the a follow up patch after
> this one is merged.

And I asked for a step by step approach in the first review, but you
decided to ignore that. And now you think that it's accetable for you
as long as you get what you want. That's not the way it works, really.

> > > > > + getnstimeofday();
> > > > > + ns = (tv2.tv_sec - tv1.tv_sec) * NSEC_PER_SEC +
> > > > > +  (tv2.tv_nsec - tv1.tv_nsec);
> > > > > + pr_info("lockref wait loop time = %lu ns\n", ns);
> > > > Using getnstimeofday() for timestamping and spamming the logs with
> > > > printouts? You can't be serious about that?
> > > > 
q > > > > Thats what tracepoints are for. Tracing is the only way to get proper
> > > > numbers which take preemption, interrupts etc. into account without
> > > > hurting runtime performace.
> > > The _SHOW_WAIT_LOOP_TIME is for debugging and instrumentation purpose only
> > > during development cycle. It is not supposed to be turned on at production
> > > system. I will document that in the code.
> > No, no, no! Again: That's what tracepoints are for.
> > 
> > This kind of debugging is completely pointless. Tracepoints are
> > designed to be low overhead and can be enabled on production
> > systems.
> > 
> > Your debugging depends on slow timestamps against CLOCK_REALTIME and
> > an even slower output via printk. How useful is that if you want to
> > really instrument the behaviour of this code?
> 
> This code is not critical and I can certainly remove it.

Did you even try to understand what I wrote? I did not ask you to
remove instrumentation. I asked you to use useful instrumentation
instead of some totally useless crap.

Thanks,

tglx


--
To 

Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount

2013-07-15 Thread Waiman Long

On 07/15/2013 10:41 AM, Thomas Gleixner wrote:

On Mon, 8 Jul 2013, Waiman Long wrote:

On 07/05/2013 02:59 PM, Thomas Gleixner wrote:

On Fri, 5 Jul 2013, Waiman Long wrote:

I think it is OK to add the GENERIC option, but I would like to make available
a slightly different set of options:
1) Always take the lock
2) Use the generic implementation with the default parameters
3) Use the generic implementation with a customized set of parameters
4) Use an architecture specific implementation.

2) set only GENERIC_SPINLOCK_REFCOUNT
3) set both GENERIC_SPINLOCK_REFCOUNT and ARCH_SPINLOCK_REFCOUNT
4) set only ARCH_SPINLOCK_REFCOUNT

The customized parameters will be set in the "asm/spinlock_refcount.h" file.
Currently ,there is 2 parameters that can be customized for each architecture:
1) How much time will the function wait until the lock is free
2) How many attempts to do a lockless cmpxchg to update reference count

Sigh. GENERIC means, that you use the generic implementation, ARCH
means the architecture has a private implementation of that code.

The generic implementation can use arch specific includes and if there
is none we simply fallback to the asm-generic one.


I used the ARCH+GENERIC to mean using the generic code but with arch 
specific include.



  >  Let's start with a simple version because it IS simple and easy

to analyze and debug and then incrementally build improvements on it
instead of creating an heuristics monster in the first place, i.e.:

 if (!spin_is_locked(>lock)&&   try_cmpxchg_once(lr))
return 0;
 return 1;

Take numbers for this on a zoo of different machines: Intel and AMD,
old and new.

Then you can add an incremental patch on that, which add loops and
hoops. Along with numbers on the same zoo of machines.

I originally tried to do a cmpxchg without waiting and there was
practically no performance gain. I believe that as soon as

Well, you did not see a difference on your particular machine. Still
we don't have an idea how all of this works on a set of different
machines. There is a world beside 8 socket servers.


I understand that. I can live with try_cmpxchg_once, though doing it 
twice will give a slightly better performance. However, without waiting 
for the lock to be free, this patch won't do much good. To keep it 
simple, I can remove the ability to do customization while doing cmpxchg 
once and wait until the lock is free. Please let me know if this is 
acceptable to you.



contention happens, it will force all the upcoming reference count
update threads to take the lock eliminating any potential
performance gain that we can have. To make it simple, I can change
the default to wait indefinitely until the lock is free instead of
looping a certain number of times, but I still like the option of
letting each architecture to decide how many times they want to
try. I agree that the actual waiting time even for one architecture
is depending on the actual CPU chip that is being used. I have done
some experiment on that. As long as the count is large enough,
increasing the loop count doesn't have any significant impact on
performance any more. The main reason for having a finite time is to
avoid having the waiting thread to wait virtually forever if the
lock happens to be busy for a long time.

Again, if we make this tunable then we still want documentation for
the behaviour on small, medium and large machines.

Also what's the approach to tune that? Running some random testbench
and monitor the results for various settings?

If that's the case we better have a that as variables with generic
initial values in the code, which can be modified by sysctl.


As I said above, I can remove the customization. I may reintroduce user 
customization using sysctl as you suggested in the a follow up patch 
after this one is merged.





+   getnstimeofday();
+   ns = (tv2.tv_sec - tv1.tv_sec) * NSEC_PER_SEC +
+(tv2.tv_nsec - tv1.tv_nsec);
+   pr_info("lockref wait loop time = %lu ns\n", ns);

Using getnstimeofday() for timestamping and spamming the logs with
printouts? You can't be serious about that?

Thats what tracepoints are for. Tracing is the only way to get proper
numbers which take preemption, interrupts etc. into account without
hurting runtime performace.

The _SHOW_WAIT_LOOP_TIME is for debugging and instrumentation purpose only
during development cycle. It is not supposed to be turned on at production
system. I will document that in the code.

No, no, no! Again: That's what tracepoints are for.

This kind of debugging is completely pointless. Tracepoints are
designed to be low overhead and can be enabled on production
systems.

Your debugging depends on slow timestamps against CLOCK_REALTIME and
an even slower output via printk. How useful is that if you want to
really instrument the behaviour of this code?


This code is not critical and I can certainly remove it.

Regards,
Longman
--
To unsubscribe 

Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount

2013-07-15 Thread Thomas Gleixner
On Mon, 8 Jul 2013, Waiman Long wrote:
> On 07/05/2013 02:59 PM, Thomas Gleixner wrote:
> > On Fri, 5 Jul 2013, Waiman Long wrote:
> I think it is OK to add the GENERIC option, but I would like to make available
> a slightly different set of options:
> 1) Always take the lock
> 2) Use the generic implementation with the default parameters
> 3) Use the generic implementation with a customized set of parameters
> 4) Use an architecture specific implementation.
> 
> 2) set only GENERIC_SPINLOCK_REFCOUNT
> 3) set both GENERIC_SPINLOCK_REFCOUNT and ARCH_SPINLOCK_REFCOUNT
> 4) set only ARCH_SPINLOCK_REFCOUNT
> 
> The customized parameters will be set in the "asm/spinlock_refcount.h" file.
> Currently ,there is 2 parameters that can be customized for each architecture:
> 1) How much time will the function wait until the lock is free
> 2) How many attempts to do a lockless cmpxchg to update reference count

Sigh. GENERIC means, that you use the generic implementation, ARCH
means the architecture has a private implementation of that code.

The generic implementation can use arch specific includes and if there
is none we simply fallback to the asm-generic one.

 > Let's start with a simple version because it IS simple and easy
> > to analyze and debug and then incrementally build improvements on it
> > instead of creating an heuristics monster in the first place, i.e.:
> > 
> > if (!spin_is_locked(>lock)&&  try_cmpxchg_once(lr))
> >return 0;
> > return 1;
> > 
> > Take numbers for this on a zoo of different machines: Intel and AMD,
> > old and new.
> > 
> > Then you can add an incremental patch on that, which add loops and
> > hoops. Along with numbers on the same zoo of machines.
> 
> I originally tried to do a cmpxchg without waiting and there was
> practically no performance gain. I believe that as soon as

Well, you did not see a difference on your particular machine. Still
we don't have an idea how all of this works on a set of different
machines. There is a world beside 8 socket servers.

> contention happens, it will force all the upcoming reference count
> update threads to take the lock eliminating any potential
> performance gain that we can have. To make it simple, I can change
> the default to wait indefinitely until the lock is free instead of
> looping a certain number of times, but I still like the option of
> letting each architecture to decide how many times they want to
> try. I agree that the actual waiting time even for one architecture
> is depending on the actual CPU chip that is being used. I have done
> some experiment on that. As long as the count is large enough,
> increasing the loop count doesn't have any significant impact on
> performance any more. The main reason for having a finite time is to
> avoid having the waiting thread to wait virtually forever if the
> lock happens to be busy for a long time.

Again, if we make this tunable then we still want documentation for
the behaviour on small, medium and large machines.

Also what's the approach to tune that? Running some random testbench
and monitor the results for various settings?

If that's the case we better have a that as variables with generic
initial values in the code, which can be modified by sysctl.


> > > + getnstimeofday();
> > > + ns = (tv2.tv_sec - tv1.tv_sec) * NSEC_PER_SEC +
> > > +  (tv2.tv_nsec - tv1.tv_nsec);
> > > + pr_info("lockref wait loop time = %lu ns\n", ns);
> > Using getnstimeofday() for timestamping and spamming the logs with
> > printouts? You can't be serious about that?
> > 
> > Thats what tracepoints are for. Tracing is the only way to get proper
> > numbers which take preemption, interrupts etc. into account without
> > hurting runtime performace.
> 
> The _SHOW_WAIT_LOOP_TIME is for debugging and instrumentation purpose only
> during development cycle. It is not supposed to be turned on at production
> system. I will document that in the code.

No, no, no! Again: That's what tracepoints are for.

This kind of debugging is completely pointless. Tracepoints are
designed to be low overhead and can be enabled on production
systems.

Your debugging depends on slow timestamps against CLOCK_REALTIME and
an even slower output via printk. How useful is that if you want to
really instrument the behaviour of this code?

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount

2013-07-15 Thread Thomas Gleixner
On Mon, 8 Jul 2013, Waiman Long wrote:
 On 07/05/2013 02:59 PM, Thomas Gleixner wrote:
  On Fri, 5 Jul 2013, Waiman Long wrote:
 I think it is OK to add the GENERIC option, but I would like to make available
 a slightly different set of options:
 1) Always take the lock
 2) Use the generic implementation with the default parameters
 3) Use the generic implementation with a customized set of parameters
 4) Use an architecture specific implementation.
 
 2) set only GENERIC_SPINLOCK_REFCOUNT
 3) set both GENERIC_SPINLOCK_REFCOUNT and ARCH_SPINLOCK_REFCOUNT
 4) set only ARCH_SPINLOCK_REFCOUNT
 
 The customized parameters will be set in the asm/spinlock_refcount.h file.
 Currently ,there is 2 parameters that can be customized for each architecture:
 1) How much time will the function wait until the lock is free
 2) How many attempts to do a lockless cmpxchg to update reference count

Sigh. GENERIC means, that you use the generic implementation, ARCH
means the architecture has a private implementation of that code.

The generic implementation can use arch specific includes and if there
is none we simply fallback to the asm-generic one.

  Let's start with a simple version because it IS simple and easy
  to analyze and debug and then incrementally build improvements on it
  instead of creating an heuristics monster in the first place, i.e.:
  
  if (!spin_is_locked(lr-lock)  try_cmpxchg_once(lr))
 return 0;
  return 1;
  
  Take numbers for this on a zoo of different machines: Intel and AMD,
  old and new.
  
  Then you can add an incremental patch on that, which add loops and
  hoops. Along with numbers on the same zoo of machines.
 
 I originally tried to do a cmpxchg without waiting and there was
 practically no performance gain. I believe that as soon as

Well, you did not see a difference on your particular machine. Still
we don't have an idea how all of this works on a set of different
machines. There is a world beside 8 socket servers.

 contention happens, it will force all the upcoming reference count
 update threads to take the lock eliminating any potential
 performance gain that we can have. To make it simple, I can change
 the default to wait indefinitely until the lock is free instead of
 looping a certain number of times, but I still like the option of
 letting each architecture to decide how many times they want to
 try. I agree that the actual waiting time even for one architecture
 is depending on the actual CPU chip that is being used. I have done
 some experiment on that. As long as the count is large enough,
 increasing the loop count doesn't have any significant impact on
 performance any more. The main reason for having a finite time is to
 avoid having the waiting thread to wait virtually forever if the
 lock happens to be busy for a long time.

Again, if we make this tunable then we still want documentation for
the behaviour on small, medium and large machines.

Also what's the approach to tune that? Running some random testbench
and monitor the results for various settings?

If that's the case we better have a that as variables with generic
initial values in the code, which can be modified by sysctl.


   + getnstimeofday(tv2);
   + ns = (tv2.tv_sec - tv1.tv_sec) * NSEC_PER_SEC +
   +  (tv2.tv_nsec - tv1.tv_nsec);
   + pr_info(lockref wait loop time = %lu ns\n, ns);
  Using getnstimeofday() for timestamping and spamming the logs with
  printouts? You can't be serious about that?
  
  Thats what tracepoints are for. Tracing is the only way to get proper
  numbers which take preemption, interrupts etc. into account without
  hurting runtime performace.
 
 The _SHOW_WAIT_LOOP_TIME is for debugging and instrumentation purpose only
 during development cycle. It is not supposed to be turned on at production
 system. I will document that in the code.

No, no, no! Again: That's what tracepoints are for.

This kind of debugging is completely pointless. Tracepoints are
designed to be low overhead and can be enabled on production
systems.

Your debugging depends on slow timestamps against CLOCK_REALTIME and
an even slower output via printk. How useful is that if you want to
really instrument the behaviour of this code?

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount

2013-07-15 Thread Waiman Long

On 07/15/2013 10:41 AM, Thomas Gleixner wrote:

On Mon, 8 Jul 2013, Waiman Long wrote:

On 07/05/2013 02:59 PM, Thomas Gleixner wrote:

On Fri, 5 Jul 2013, Waiman Long wrote:

I think it is OK to add the GENERIC option, but I would like to make available
a slightly different set of options:
1) Always take the lock
2) Use the generic implementation with the default parameters
3) Use the generic implementation with a customized set of parameters
4) Use an architecture specific implementation.

2) set only GENERIC_SPINLOCK_REFCOUNT
3) set both GENERIC_SPINLOCK_REFCOUNT and ARCH_SPINLOCK_REFCOUNT
4) set only ARCH_SPINLOCK_REFCOUNT

The customized parameters will be set in the asm/spinlock_refcount.h file.
Currently ,there is 2 parameters that can be customized for each architecture:
1) How much time will the function wait until the lock is free
2) How many attempts to do a lockless cmpxchg to update reference count

Sigh. GENERIC means, that you use the generic implementation, ARCH
means the architecture has a private implementation of that code.

The generic implementation can use arch specific includes and if there
is none we simply fallback to the asm-generic one.


I used the ARCH+GENERIC to mean using the generic code but with arch 
specific include.



Let's start with a simple version because it IS simple and easy

to analyze and debug and then incrementally build improvements on it
instead of creating an heuristics monster in the first place, i.e.:

 if (!spin_is_locked(lr-lock)   try_cmpxchg_once(lr))
return 0;
 return 1;

Take numbers for this on a zoo of different machines: Intel and AMD,
old and new.

Then you can add an incremental patch on that, which add loops and
hoops. Along with numbers on the same zoo of machines.

I originally tried to do a cmpxchg without waiting and there was
practically no performance gain. I believe that as soon as

Well, you did not see a difference on your particular machine. Still
we don't have an idea how all of this works on a set of different
machines. There is a world beside 8 socket servers.


I understand that. I can live with try_cmpxchg_once, though doing it 
twice will give a slightly better performance. However, without waiting 
for the lock to be free, this patch won't do much good. To keep it 
simple, I can remove the ability to do customization while doing cmpxchg 
once and wait until the lock is free. Please let me know if this is 
acceptable to you.



contention happens, it will force all the upcoming reference count
update threads to take the lock eliminating any potential
performance gain that we can have. To make it simple, I can change
the default to wait indefinitely until the lock is free instead of
looping a certain number of times, but I still like the option of
letting each architecture to decide how many times they want to
try. I agree that the actual waiting time even for one architecture
is depending on the actual CPU chip that is being used. I have done
some experiment on that. As long as the count is large enough,
increasing the loop count doesn't have any significant impact on
performance any more. The main reason for having a finite time is to
avoid having the waiting thread to wait virtually forever if the
lock happens to be busy for a long time.

Again, if we make this tunable then we still want documentation for
the behaviour on small, medium and large machines.

Also what's the approach to tune that? Running some random testbench
and monitor the results for various settings?

If that's the case we better have a that as variables with generic
initial values in the code, which can be modified by sysctl.


As I said above, I can remove the customization. I may reintroduce user 
customization using sysctl as you suggested in the a follow up patch 
after this one is merged.





+   getnstimeofday(tv2);
+   ns = (tv2.tv_sec - tv1.tv_sec) * NSEC_PER_SEC +
+(tv2.tv_nsec - tv1.tv_nsec);
+   pr_info(lockref wait loop time = %lu ns\n, ns);

Using getnstimeofday() for timestamping and spamming the logs with
printouts? You can't be serious about that?

Thats what tracepoints are for. Tracing is the only way to get proper
numbers which take preemption, interrupts etc. into account without
hurting runtime performace.

The _SHOW_WAIT_LOOP_TIME is for debugging and instrumentation purpose only
during development cycle. It is not supposed to be turned on at production
system. I will document that in the code.

No, no, no! Again: That's what tracepoints are for.

This kind of debugging is completely pointless. Tracepoints are
designed to be low overhead and can be enabled on production
systems.

Your debugging depends on slow timestamps against CLOCK_REALTIME and
an even slower output via printk. How useful is that if you want to
really instrument the behaviour of this code?


This code is not critical and I can certainly remove it.

Regards,
Longman
--
To unsubscribe from 

Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount

2013-07-15 Thread Thomas Gleixner
On Mon, 15 Jul 2013, Waiman Long wrote:
 On 07/15/2013 10:41 AM, Thomas Gleixner wrote:
  On Mon, 8 Jul 2013, Waiman Long wrote:
  Sigh. GENERIC means, that you use the generic implementation, ARCH
  means the architecture has a private implementation of that code.
  
  The generic implementation can use arch specific includes and if there
  is none we simply fallback to the asm-generic one.
 
 I used the ARCH+GENERIC to mean using the generic code but with arch specific
 include.

And what's the point of that? I just explained it to you that you do
not need the ARCH=y and GENERIC=y at all.
 
  Let's start with a simple version because it IS simple and easy
to analyze and debug and then incrementally build improvements on it
instead of creating an heuristics monster in the first place, i.e.:

 if (!spin_is_locked(lr-lock)   try_cmpxchg_once(lr))
return 0;
 return 1;

Take numbers for this on a zoo of different machines: Intel and AMD,
old and new.

Then you can add an incremental patch on that, which add loops and
hoops. Along with numbers on the same zoo of machines.
   I originally tried to do a cmpxchg without waiting and there was
   practically no performance gain. I believe that as soon as
  Well, you did not see a difference on your particular machine. Still
  we don't have an idea how all of this works on a set of different
  machines. There is a world beside 8 socket servers.
 
 I understand that. I can live with try_cmpxchg_once, though doing it
 twice will give a slightly better performance. However, without

I asked you several times now to explain and document the whole thing
with numbers instead of your handwaving slightly better performance
arguments.

I CANNOT verify that at all simply because I have no access to a 8
socket machine. Hint: Ask your boss to send me one :)

 waiting for the lock to be free, this patch won't do much good. To
 keep it simple, I can remove the ability to do customization while
 doing cmpxchg once and wait until the lock is free. Please let me
 know if this is acceptable to you.

No, it's not acceptable at all if you are not able to provide data for
1,2,4,8 socket machines (from single core to your precious big
boxes). It's that simple. We are not accepting patches which optimize
for a single use case and might negatively affect 99,% of the
existing users which have no access to this kind of hardware unless
proven otherwise.

As I told you versus the qrwlocks: You are missing to explain WHY it
improves your particular workloads and WHY this change wont affect
other workloads. All I can see from your explanations is: Works better
on 8 socket big iron. I have yet to see any numbers that this wont
affect the vast majority of users.

  Also what's the approach to tune that? Running some random testbench
  and monitor the results for various settings?
  
  If that's the case we better have a that as variables with generic
  initial values in the code, which can be modified by sysctl.
 
 As I said above, I can remove the customization. I may reintroduce user
 customization using sysctl as you suggested in the a follow up patch after
 this one is merged.

And I asked for a step by step approach in the first review, but you
decided to ignore that. And now you think that it's accetable for you
as long as you get what you want. That's not the way it works, really.

 + getnstimeofday(tv2);
 + ns = (tv2.tv_sec - tv1.tv_sec) * NSEC_PER_SEC +
 +  (tv2.tv_nsec - tv1.tv_nsec);
 + pr_info(lockref wait loop time = %lu ns\n, ns);
Using getnstimeofday() for timestamping and spamming the logs with
printouts? You can't be serious about that?

q Thats what tracepoints are for. Tracing is the only way to get proper
numbers which take preemption, interrupts etc. into account without
hurting runtime performace.
   The _SHOW_WAIT_LOOP_TIME is for debugging and instrumentation purpose only
   during development cycle. It is not supposed to be turned on at production
   system. I will document that in the code.
  No, no, no! Again: That's what tracepoints are for.
  
  This kind of debugging is completely pointless. Tracepoints are
  designed to be low overhead and can be enabled on production
  systems.
  
  Your debugging depends on slow timestamps against CLOCK_REALTIME and
  an even slower output via printk. How useful is that if you want to
  really instrument the behaviour of this code?
 
 This code is not critical and I can certainly remove it.

Did you even try to understand what I wrote? I did not ask you to
remove instrumentation. I asked you to use useful instrumentation
instead of some totally useless crap.

Thanks,

tglx


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount

2013-07-15 Thread Waiman Long

On 07/15/2013 07:47 PM, Thomas Gleixner wrote:

On Mon, 15 Jul 2013, Waiman Long wrote:

On 07/15/2013 10:41 AM, Thomas Gleixner wrote:

On Mon, 8 Jul 2013, Waiman Long wrote:
Sigh. GENERIC means, that you use the generic implementation, ARCH
means the architecture has a private implementation of that code.

The generic implementation can use arch specific includes and if there
is none we simply fallback to the asm-generic one.

I used the ARCH+GENERIC to mean using the generic code but with arch specific
include.

And what's the point of that? I just explained it to you that you do
not need the ARCH=y and GENERIC=y at all.


As I said in my previous mail, I can remove the ARCH+GENERIC option.


  Let's start with a simple version because it IS simple and easy

to analyze and debug and then incrementally build improvements on it
instead of creating an heuristics monster in the first place, i.e.:

  if (!spin_is_locked(lr-lock)try_cmpxchg_once(lr))
 return 0;
  return 1;

Take numbers for this on a zoo of different machines: Intel and AMD,
old and new.

Then you can add an incremental patch on that, which add loops and
hoops. Along with numbers on the same zoo of machines.

I originally tried to do a cmpxchg without waiting and there was
practically no performance gain. I believe that as soon as

Well, you did not see a difference on your particular machine. Still
we don't have an idea how all of this works on a set of different
machines. There is a world beside 8 socket servers.

I understand that. I can live with try_cmpxchg_once, though doing it
twice will give a slightly better performance. However, without

I asked you several times now to explain and document the whole thing
with numbers instead of your handwaving slightly better performance
arguments.


I will provide performance data for 1 and 2 retries in my next patch 
version.





waiting for the lock to be free, this patch won't do much good. To
keep it simple, I can remove the ability to do customization while
doing cmpxchg once and wait until the lock is free. Please let me
know if this is acceptable to you.

No, it's not acceptable at all if you are not able to provide data for
1,2,4,8 socket machines (from single core to your precious big
boxes). It's that simple. We are not accepting patches which optimize
for a single use case and might negatively affect 99,% of the
existing users which have no access to this kind of hardware unless
proven otherwise.


I did provide performance data for 1,2,4 and 8 socket configurations in 
my commit message. I used numactl to simulate different socket 
configuration by forcing the code to use only a subset of total number 
of sockets. I know that is not ideal, but I think it should be close 
enough. I will provide performance data on a more common 2 socket test 
machine that I have.


Yes, I don't provide data for single-thread use case. I will also 
provide that data in my next version by measuring the average time for 
doing low-level reference count update using lock and lockless update 
like what I had done for the qrwlock patch. For single thread case, I 
don't believe any real workload will show any appreciable difference in 
performance due to the differing reference count update mechanisms.



Also what's the approach to tune that? Running some random testbench
and monitor the results for various settings?

If that's the case we better have a that as variables with generic
initial values in the code, which can be modified by sysctl.

As I said above, I can remove the customization. I may reintroduce user
customization using sysctl as you suggested in the a follow up patch after
this one is merged.

And I asked for a step by step approach in the first review, but you
decided to ignore that. And now you think that it's accetable for you
as long as you get what you want. That's not the way it works, really.


I am trying to provide what you are asking for while at the same time 
meet my own need.



+   getnstimeofday(tv2);
+   ns = (tv2.tv_sec - tv1.tv_sec) * NSEC_PER_SEC +
+(tv2.tv_nsec - tv1.tv_nsec);
+   pr_info(lockref wait loop time = %lu ns\n, ns);

Using getnstimeofday() for timestamping and spamming the logs with
printouts? You can't be serious about that?


qThats what tracepoints are for. Tracing is the only way to get 
proper

numbers which take preemption, interrupts etc. into account without
hurting runtime performace.

The _SHOW_WAIT_LOOP_TIME is for debugging and instrumentation purpose only
during development cycle. It is not supposed to be turned on at production
system. I will document that in the code.

No, no, no! Again: That's what tracepoints are for.

This kind of debugging is completely pointless. Tracepoints are
designed to be low overhead and can be enabled on production
systems.

Your debugging depends on slow timestamps against CLOCK_REALTIME and
an even slower output via 

Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount

2013-07-08 Thread Waiman Long

On 07/05/2013 02:59 PM, Thomas Gleixner wrote:

On Fri, 5 Jul 2013, Waiman Long wrote:

+ * If the spinlock&  reference count optimization feature is disabled,
+ * the spinlock and reference count are accessed separately on its own.
+ */
+struct lockref {
+   unsigned int refcnt;/* Reference count */
+   spinlock_t   lock;
+};
+
+/*
+ * Struct lockref helper functions
+ */
+/*

Function documentation starts with /**


I will fix that.


+ * lockref_get - increments reference count while not locked

This should be: Increments reference count unconditionally.


Will change that.


+ * @lockcnt: pointer to lockref structure
+ */
+static __always_inline void
+lockref_get(struct lockref *lockcnt)

Please avoid these line breaks if the line fits in 80 chars.


Will make the change.


+{
+   spin_lock(>lock);
+   lockcnt->refcnt++;
+   spin_unlock(>lock);
+}
+/*
+ * lockref_put_or_locked - decrements count unless count<= 1 before decrement
+ *otherwise the lock will be taken

   lockref_put_or_lock please

Docbook cannot work with multiline comments for the function name.

So make that shorter and add a longer explanation below the @argument
docs.


Will fix that.


+ * @lockcnt: pointer to lockref structure
+ * Return: 1 if count updated successfully or 0 if count<= 1 and lock taken
+ */
+static __always_inline int
+lockref_put_or_locked(struct lockref *lockcnt)
+{
+   spin_lock(>lock);
+   if (likely(lockcnt->refcnt>  1)) {
+   lockcnt->refcnt--;
+   spin_unlock(>lock);
+   return 1;
+   }
+   return 0;   /* Count is 1&  lock taken */

Please no tail comments. They are horrible to parse and it's obvious
from the code.


Will remove the tail comments.


+}
+
+#endif /* CONFIG_SPINLOCK_REFCOUNT */
+#endif /* __LINUX_SPINLOCK_REFCOUNT_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 44511d1..d1f8670 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -223,3 +223,8 @@ endif
  config MUTEX_SPIN_ON_OWNER
def_bool y
depends on SMP&&  !DEBUG_MUTEXES
+
+config SPINLOCK_REFCOUNT
+   def_bool y
+   depends on ARCH_SPINLOCK_REFCOUNT&&  SMP

This looks wrong. We want three options:

1) Always take the lock
2) Use the generic implementation
3) Use an architecture specific implementation

So we want

config GENERIC_SPINLOCK_REFCOUNT
  bool

config ARCH_SPINLOCK_REFCOUNT
  bool

config SPINLOCK_REFCOUNT
  def_bool y
  depends on GENERIC_SPINLOCK_REFCOUNT || ARCH_SPINLOCK_REFCOUNT
  depends on .

So an architectire can select GENERIC_SPINLOCK_REFCOUNT to get the
generic implementation or ARCH_SPINLOCK_REFCOUNT if it provides its
own special version.

And lib/spinlock_refcount.o depends on GENERIC_SPINLOCK_REFCOUNT


I think it is OK to add the GENERIC option, but I would like to make 
available a slightly different set of options:

1) Always take the lock
2) Use the generic implementation with the default parameters
3) Use the generic implementation with a customized set of parameters
4) Use an architecture specific implementation.

2) set only GENERIC_SPINLOCK_REFCOUNT
3) set both GENERIC_SPINLOCK_REFCOUNT and ARCH_SPINLOCK_REFCOUNT
4) set only ARCH_SPINLOCK_REFCOUNT

The customized parameters will be set in the "asm/spinlock_refcount.h" 
file. Currently ,there is 2 parameters that can be customized for each 
architecture:

1) How much time will the function wait until the lock is free
2) How many attempts to do a lockless cmpxchg to update reference count


+/*
+ * The maximum number of waiting spins when the lock was acquiring before
+ * trying to attempt a lockless update. The purpose of the timeout is to
+ * limit the amount of unfairness to the thread that is doing the reference
+ * count update. Otherwise, it is theoretically possible for that thread to
+ * be starved for a really long time causing all kind of problems. If it
+ * times out and the lock is still not free, the code will fall back to the
+ * traditional way of queuing up to acquire the lock before updating the count.
+ *
+ * The actual time spent in the wait loop very much depends on the CPU being
+ * used. On a 2.4GHz Westmere CPU, the execution time of a PAUSE instruction
+ * (cpu_relax) in a 4k loop is about 16us. The lock checking and looping
+ * overhead is about 8us. With 4 cpu_relax's in the loop, it will wait
+ * about 72us before the count reaches 0. With cacheline contention, the
+ * wait time can go up to 3x as much (about 210us). Having multiple
+ * cpu_relax's in the wait loop does seem to reduce cacheline contention
+ * somewhat and give slightly better performance.
+ *
+ * The preset timeout value is rather arbitrary and really depends on the CPU
+ * being used. If customization is needed, we could add a config variable for
+ * that. The exact value is not that important. A longer wait time will
+ * increase the 

Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount

2013-07-08 Thread Waiman Long

On 07/05/2013 02:59 PM, Thomas Gleixner wrote:

On Fri, 5 Jul 2013, Waiman Long wrote:

+ * If the spinlock  reference count optimization feature is disabled,
+ * the spinlock and reference count are accessed separately on its own.
+ */
+struct lockref {
+   unsigned int refcnt;/* Reference count */
+   spinlock_t   lock;
+};
+
+/*
+ * Struct lockref helper functions
+ */
+/*

Function documentation starts with /**


I will fix that.


+ * lockref_get - increments reference count while not locked

This should be: Increments reference count unconditionally.


Will change that.


+ * @lockcnt: pointer to lockref structure
+ */
+static __always_inline void
+lockref_get(struct lockref *lockcnt)

Please avoid these line breaks if the line fits in 80 chars.


Will make the change.


+{
+   spin_lock(lockcnt-lock);
+   lockcnt-refcnt++;
+   spin_unlock(lockcnt-lock);
+}
+/*
+ * lockref_put_or_locked - decrements count unless count= 1 before decrement
+ *otherwise the lock will be taken

   lockref_put_or_lock please

Docbook cannot work with multiline comments for the function name.

So make that shorter and add a longer explanation below the @argument
docs.


Will fix that.


+ * @lockcnt: pointer to lockref structure
+ * Return: 1 if count updated successfully or 0 if count= 1 and lock taken
+ */
+static __always_inline int
+lockref_put_or_locked(struct lockref *lockcnt)
+{
+   spin_lock(lockcnt-lock);
+   if (likely(lockcnt-refcnt  1)) {
+   lockcnt-refcnt--;
+   spin_unlock(lockcnt-lock);
+   return 1;
+   }
+   return 0;   /* Count is 1  lock taken */

Please no tail comments. They are horrible to parse and it's obvious
from the code.


Will remove the tail comments.


+}
+
+#endif /* CONFIG_SPINLOCK_REFCOUNT */
+#endif /* __LINUX_SPINLOCK_REFCOUNT_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 44511d1..d1f8670 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -223,3 +223,8 @@ endif
  config MUTEX_SPIN_ON_OWNER
def_bool y
depends on SMP  !DEBUG_MUTEXES
+
+config SPINLOCK_REFCOUNT
+   def_bool y
+   depends on ARCH_SPINLOCK_REFCOUNT  SMP

This looks wrong. We want three options:

1) Always take the lock
2) Use the generic implementation
3) Use an architecture specific implementation

So we want

config GENERIC_SPINLOCK_REFCOUNT
  bool

config ARCH_SPINLOCK_REFCOUNT
  bool

config SPINLOCK_REFCOUNT
  def_bool y
  depends on GENERIC_SPINLOCK_REFCOUNT || ARCH_SPINLOCK_REFCOUNT
  depends on .

So an architectire can select GENERIC_SPINLOCK_REFCOUNT to get the
generic implementation or ARCH_SPINLOCK_REFCOUNT if it provides its
own special version.

And lib/spinlock_refcount.o depends on GENERIC_SPINLOCK_REFCOUNT


I think it is OK to add the GENERIC option, but I would like to make 
available a slightly different set of options:

1) Always take the lock
2) Use the generic implementation with the default parameters
3) Use the generic implementation with a customized set of parameters
4) Use an architecture specific implementation.

2) set only GENERIC_SPINLOCK_REFCOUNT
3) set both GENERIC_SPINLOCK_REFCOUNT and ARCH_SPINLOCK_REFCOUNT
4) set only ARCH_SPINLOCK_REFCOUNT

The customized parameters will be set in the asm/spinlock_refcount.h 
file. Currently ,there is 2 parameters that can be customized for each 
architecture:

1) How much time will the function wait until the lock is free
2) How many attempts to do a lockless cmpxchg to update reference count


+/*
+ * The maximum number of waiting spins when the lock was acquiring before
+ * trying to attempt a lockless update. The purpose of the timeout is to
+ * limit the amount of unfairness to the thread that is doing the reference
+ * count update. Otherwise, it is theoretically possible for that thread to
+ * be starved for a really long time causing all kind of problems. If it
+ * times out and the lock is still not free, the code will fall back to the
+ * traditional way of queuing up to acquire the lock before updating the count.
+ *
+ * The actual time spent in the wait loop very much depends on the CPU being
+ * used. On a 2.4GHz Westmere CPU, the execution time of a PAUSE instruction
+ * (cpu_relax) in a 4k loop is about 16us. The lock checking and looping
+ * overhead is about 8us. With 4 cpu_relax's in the loop, it will wait
+ * about 72us before the count reaches 0. With cacheline contention, the
+ * wait time can go up to 3x as much (about 210us). Having multiple
+ * cpu_relax's in the wait loop does seem to reduce cacheline contention
+ * somewhat and give slightly better performance.
+ *
+ * The preset timeout value is rather arbitrary and really depends on the CPU
+ * being used. If customization is needed, we could add a config variable for
+ * that. The exact value is not that important. A longer wait time will
+ * 

Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount

2013-07-05 Thread Thomas Gleixner
On Fri, 5 Jul 2013, Waiman Long wrote:
> + * If the spinlock & reference count optimization feature is disabled,
> + * the spinlock and reference count are accessed separately on its own.
> + */
> +struct lockref {
> + unsigned int refcnt;/* Reference count */
> + spinlock_t   lock;
> +};
> +
> +/*
> + * Struct lockref helper functions
> + */
> +/*

Function documentation starts with /**

> + * lockref_get - increments reference count while not locked

This should be: Increments reference count unconditionally.

> + * @lockcnt: pointer to lockref structure
> + */
> +static __always_inline void
> +lockref_get(struct lockref *lockcnt)

Please avoid these line breaks if the line fits in 80 chars.

> +{
> + spin_lock(>lock);
> + lockcnt->refcnt++;
> + spin_unlock(>lock);
> +}

> +/*
> + * lockref_put_or_locked - decrements count unless count <= 1 before 
> decrement
> + *  otherwise the lock will be taken

  lockref_put_or_lock please

Docbook cannot work with multiline comments for the function name.

So make that shorter and add a longer explanation below the @argument
docs.

> + * @lockcnt: pointer to lockref structure
> + * Return: 1 if count updated successfully or 0 if count <= 1 and lock taken
> + */
> +static __always_inline int
> +lockref_put_or_locked(struct lockref *lockcnt)
> +{
> + spin_lock(>lock);
> + if (likely(lockcnt->refcnt > 1)) {
> + lockcnt->refcnt--;
> + spin_unlock(>lock);
> + return 1;
> + }
> + return 0;   /* Count is 1 & lock taken */

Please no tail comments. They are horrible to parse and it's obvious
from the code.

> +}
> +
> +#endif /* CONFIG_SPINLOCK_REFCOUNT */
> +#endif /* __LINUX_SPINLOCK_REFCOUNT_H */
> diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
> index 44511d1..d1f8670 100644
> --- a/kernel/Kconfig.locks
> +++ b/kernel/Kconfig.locks
> @@ -223,3 +223,8 @@ endif
>  config MUTEX_SPIN_ON_OWNER
>   def_bool y
>   depends on SMP && !DEBUG_MUTEXES
> +
> +config SPINLOCK_REFCOUNT
> + def_bool y
> + depends on ARCH_SPINLOCK_REFCOUNT && SMP

This looks wrong. We want three options:

1) Always take the lock
2) Use the generic implementation
3) Use an architecture specific implementation

So we want

   config GENERIC_SPINLOCK_REFCOUNT
  bool

   config ARCH_SPINLOCK_REFCOUNT
  bool

   config SPINLOCK_REFCOUNT
  def_bool y
  depends on GENERIC_SPINLOCK_REFCOUNT || ARCH_SPINLOCK_REFCOUNT
  depends on .
  
So an architectire can select GENERIC_SPINLOCK_REFCOUNT to get the
generic implementation or ARCH_SPINLOCK_REFCOUNT if it provides its
own special version.

And lib/spinlock_refcount.o depends on GENERIC_SPINLOCK_REFCOUNT

> +/*
> + * The maximum number of waiting spins when the lock was acquiring before
> + * trying to attempt a lockless update. The purpose of the timeout is to
> + * limit the amount of unfairness to the thread that is doing the reference
> + * count update. Otherwise, it is theoretically possible for that thread to
> + * be starved for a really long time causing all kind of problems. If it
> + * times out and the lock is still not free, the code will fall back to the
> + * traditional way of queuing up to acquire the lock before updating the 
> count.
> + *
> + * The actual time spent in the wait loop very much depends on the CPU being
> + * used. On a 2.4GHz Westmere CPU, the execution time of a PAUSE instruction
> + * (cpu_relax) in a 4k loop is about 16us. The lock checking and looping
> + * overhead is about 8us. With 4 cpu_relax's in the loop, it will wait
> + * about 72us before the count reaches 0. With cacheline contention, the
> + * wait time can go up to 3x as much (about 210us). Having multiple
> + * cpu_relax's in the wait loop does seem to reduce cacheline contention
> + * somewhat and give slightly better performance.
> + *
> + * The preset timeout value is rather arbitrary and really depends on the CPU
> + * being used. If customization is needed, we could add a config variable for
> + * that. The exact value is not that important. A longer wait time will
> + * increase the chance of doing a lockless update, but it results in more
> + * unfairness to the waiting thread and vice versa.

As I told you before it's a horrible idea.

This is completely depending on the CPU, the cache implementation and
whatever. These heuristic can never be made correct. Their are prone
to fail and in the worst case you end up with a regression on some
systems.

Let's start with a simple version because it IS simple and easy
to analyze and debug and then incrementally build improvements on it
instead of creating an heuristics monster in the first place, i.e.:

   if (!spin_is_locked(>lock) && try_cmpxchg_once(lr))
  return 0;
   return 1;

Take numbers for this on a zoo of different machines: Intel and AMD,
old and new.

Then you can add an incremental patch on that, which add 

[PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount

2013-07-05 Thread Waiman Long
This patch introduces a new set of spinlock_refcount.h header files to
be included by kernel codes that want to do a faster lockless update
of reference count protected by a spinlock.

The new lockref structure consists of just the spinlock and the
reference count data. Helper functions are defined in the new
 header file to access the content of
the new structure. There is a generic structure defined for all
architecture, but each architecture can also optionally define its
own structure and use its own helper functions.

Two new config parameters are introduced:
1. SPINLOCK_REFCOUNT
2. ARCH_SPINLOCK_REFCOUNT

The first one is defined in the kernel/Kconfig.locks which is used
to enable or disable the faster lockless reference count update
optimization. The second one has to be defined in each of the
architecture's Kconfig file to enable the optimization for that
architecture. Therefore, each architecture has to opt-in for this
optimization or it won't get it. This allows each architecture plenty
of time to test it out before deciding to use it or replace it with
a better architecture specific solution.

This optimization won't work for non-SMP system or when spinlock
debugging is turned on. As a result, it is turned off each any of
them is true. It also won't work for full preempt-RT and so should
be turned on in this case.

The current patch allows 3 levels of access to the new lockref
structure:

1. The lockless update optimization is turned off (SPINLOCK_REFCOUNT=n).
2. The lockless update optimization is turned on and the generic version
   is used (SPINLOCK_REFCOUNT=y and ARCH_SPINLOCK_REFCOUNT=y).
3. The lockless update optimization is turned on and the architecture
   provides its own version.

To maximize the chance of doing lockless update in the generic version,
the inlined __lockref_add_unless() function will wait for a certain
amount of time if the lock is not free before trying to do the update.

The new code also attempts to do lockless atomic update twice before
falling back to the old code path of acquiring a lock before doing
the update. It is because there will still be some fair amount of
contention with only one attempt.

Signed-off-by: Waiman Long 
---
 include/asm-generic/spinlock_refcount.h |   46 ++
 include/linux/spinlock_refcount.h   |  107 ++
 kernel/Kconfig.locks|5 +
 lib/Makefile|2 +
 lib/spinlock_refcount.c |  229 +++
 5 files changed, 389 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/spinlock_refcount.h
 create mode 100644 include/linux/spinlock_refcount.h
 create mode 100644 lib/spinlock_refcount.c

diff --git a/include/asm-generic/spinlock_refcount.h 
b/include/asm-generic/spinlock_refcount.h
new file mode 100644
index 000..469b96b
--- /dev/null
+++ b/include/asm-generic/spinlock_refcount.h
@@ -0,0 +1,46 @@
+/*
+ * Spinlock with reference count combo
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (c) Copyright 2013 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long 
+ */
+#ifndef __ASM_GENERIC_SPINLOCK_REFCOUNT_H
+#define __ASM_GENERIC_SPINLOCK_REFCOUNT_H
+
+/*
+ * The lockref structure defines a combined spinlock with reference count
+ * data structure to be embedded in a larger structure. The combined data
+ * structure is always 8-byte aligned. So proper placement of this structure
+ * in the larger embedding data structure is needed to ensure that there is
+ * no hole in it.
+ */
+struct __aligned(sizeof(u64)) lockref {
+   union {
+   u64 __lock_count;
+   struct {
+   unsigned intrefcnt; /* Reference count */
+   spinlock_t  lock;
+   };
+   };
+};
+
+/*
+ * Struct lockref helper functions
+ */
+extern void lockref_get(struct lockref *lockcnt);
+extern int  lockref_put(struct lockref *lockcnt);
+extern int  lockref_get_not_zero(struct lockref *lockcnt);
+extern int  lockref_put_or_locked(struct lockref *lockcnt);
+
+#endif /* __ASM_GENERIC_SPINLOCK_REFCOUNT_H */
diff --git a/include/linux/spinlock_refcount.h 
b/include/linux/spinlock_refcount.h
new file mode 100644
index 000..28b0b89
--- /dev/null
+++ b/include/linux/spinlock_refcount.h
@@ -0,0 +1,107 @@
+/*
+ * Spinlock with reference count combo
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License 

[PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount

2013-07-05 Thread Waiman Long
This patch introduces a new set of spinlock_refcount.h header files to
be included by kernel codes that want to do a faster lockless update
of reference count protected by a spinlock.

The new lockref structure consists of just the spinlock and the
reference count data. Helper functions are defined in the new
linux/spinlock_refcount.h header file to access the content of
the new structure. There is a generic structure defined for all
architecture, but each architecture can also optionally define its
own structure and use its own helper functions.

Two new config parameters are introduced:
1. SPINLOCK_REFCOUNT
2. ARCH_SPINLOCK_REFCOUNT

The first one is defined in the kernel/Kconfig.locks which is used
to enable or disable the faster lockless reference count update
optimization. The second one has to be defined in each of the
architecture's Kconfig file to enable the optimization for that
architecture. Therefore, each architecture has to opt-in for this
optimization or it won't get it. This allows each architecture plenty
of time to test it out before deciding to use it or replace it with
a better architecture specific solution.

This optimization won't work for non-SMP system or when spinlock
debugging is turned on. As a result, it is turned off each any of
them is true. It also won't work for full preempt-RT and so should
be turned on in this case.

The current patch allows 3 levels of access to the new lockref
structure:

1. The lockless update optimization is turned off (SPINLOCK_REFCOUNT=n).
2. The lockless update optimization is turned on and the generic version
   is used (SPINLOCK_REFCOUNT=y and ARCH_SPINLOCK_REFCOUNT=y).
3. The lockless update optimization is turned on and the architecture
   provides its own version.

To maximize the chance of doing lockless update in the generic version,
the inlined __lockref_add_unless() function will wait for a certain
amount of time if the lock is not free before trying to do the update.

The new code also attempts to do lockless atomic update twice before
falling back to the old code path of acquiring a lock before doing
the update. It is because there will still be some fair amount of
contention with only one attempt.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 include/asm-generic/spinlock_refcount.h |   46 ++
 include/linux/spinlock_refcount.h   |  107 ++
 kernel/Kconfig.locks|5 +
 lib/Makefile|2 +
 lib/spinlock_refcount.c |  229 +++
 5 files changed, 389 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/spinlock_refcount.h
 create mode 100644 include/linux/spinlock_refcount.h
 create mode 100644 lib/spinlock_refcount.c

diff --git a/include/asm-generic/spinlock_refcount.h 
b/include/asm-generic/spinlock_refcount.h
new file mode 100644
index 000..469b96b
--- /dev/null
+++ b/include/asm-generic/spinlock_refcount.h
@@ -0,0 +1,46 @@
+/*
+ * Spinlock with reference count combo
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (c) Copyright 2013 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long waiman.l...@hp.com
+ */
+#ifndef __ASM_GENERIC_SPINLOCK_REFCOUNT_H
+#define __ASM_GENERIC_SPINLOCK_REFCOUNT_H
+
+/*
+ * The lockref structure defines a combined spinlock with reference count
+ * data structure to be embedded in a larger structure. The combined data
+ * structure is always 8-byte aligned. So proper placement of this structure
+ * in the larger embedding data structure is needed to ensure that there is
+ * no hole in it.
+ */
+struct __aligned(sizeof(u64)) lockref {
+   union {
+   u64 __lock_count;
+   struct {
+   unsigned intrefcnt; /* Reference count */
+   spinlock_t  lock;
+   };
+   };
+};
+
+/*
+ * Struct lockref helper functions
+ */
+extern void lockref_get(struct lockref *lockcnt);
+extern int  lockref_put(struct lockref *lockcnt);
+extern int  lockref_get_not_zero(struct lockref *lockcnt);
+extern int  lockref_put_or_locked(struct lockref *lockcnt);
+
+#endif /* __ASM_GENERIC_SPINLOCK_REFCOUNT_H */
diff --git a/include/linux/spinlock_refcount.h 
b/include/linux/spinlock_refcount.h
new file mode 100644
index 000..28b0b89
--- /dev/null
+++ b/include/linux/spinlock_refcount.h
@@ -0,0 +1,107 @@
+/*
+ * Spinlock with reference count combo
+ *
+ * This program is free software; you can redistribute it and/or 

Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount

2013-07-05 Thread Thomas Gleixner
On Fri, 5 Jul 2013, Waiman Long wrote:
 + * If the spinlock  reference count optimization feature is disabled,
 + * the spinlock and reference count are accessed separately on its own.
 + */
 +struct lockref {
 + unsigned int refcnt;/* Reference count */
 + spinlock_t   lock;
 +};
 +
 +/*
 + * Struct lockref helper functions
 + */
 +/*

Function documentation starts with /**

 + * lockref_get - increments reference count while not locked

This should be: Increments reference count unconditionally.

 + * @lockcnt: pointer to lockref structure
 + */
 +static __always_inline void
 +lockref_get(struct lockref *lockcnt)

Please avoid these line breaks if the line fits in 80 chars.

 +{
 + spin_lock(lockcnt-lock);
 + lockcnt-refcnt++;
 + spin_unlock(lockcnt-lock);
 +}

 +/*
 + * lockref_put_or_locked - decrements count unless count = 1 before 
 decrement
 + *  otherwise the lock will be taken

  lockref_put_or_lock please

Docbook cannot work with multiline comments for the function name.

So make that shorter and add a longer explanation below the @argument
docs.

 + * @lockcnt: pointer to lockref structure
 + * Return: 1 if count updated successfully or 0 if count = 1 and lock taken
 + */
 +static __always_inline int
 +lockref_put_or_locked(struct lockref *lockcnt)
 +{
 + spin_lock(lockcnt-lock);
 + if (likely(lockcnt-refcnt  1)) {
 + lockcnt-refcnt--;
 + spin_unlock(lockcnt-lock);
 + return 1;
 + }
 + return 0;   /* Count is 1  lock taken */

Please no tail comments. They are horrible to parse and it's obvious
from the code.

 +}
 +
 +#endif /* CONFIG_SPINLOCK_REFCOUNT */
 +#endif /* __LINUX_SPINLOCK_REFCOUNT_H */
 diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
 index 44511d1..d1f8670 100644
 --- a/kernel/Kconfig.locks
 +++ b/kernel/Kconfig.locks
 @@ -223,3 +223,8 @@ endif
  config MUTEX_SPIN_ON_OWNER
   def_bool y
   depends on SMP  !DEBUG_MUTEXES
 +
 +config SPINLOCK_REFCOUNT
 + def_bool y
 + depends on ARCH_SPINLOCK_REFCOUNT  SMP

This looks wrong. We want three options:

1) Always take the lock
2) Use the generic implementation
3) Use an architecture specific implementation

So we want

   config GENERIC_SPINLOCK_REFCOUNT
  bool

   config ARCH_SPINLOCK_REFCOUNT
  bool

   config SPINLOCK_REFCOUNT
  def_bool y
  depends on GENERIC_SPINLOCK_REFCOUNT || ARCH_SPINLOCK_REFCOUNT
  depends on .
  
So an architectire can select GENERIC_SPINLOCK_REFCOUNT to get the
generic implementation or ARCH_SPINLOCK_REFCOUNT if it provides its
own special version.

And lib/spinlock_refcount.o depends on GENERIC_SPINLOCK_REFCOUNT

 +/*
 + * The maximum number of waiting spins when the lock was acquiring before
 + * trying to attempt a lockless update. The purpose of the timeout is to
 + * limit the amount of unfairness to the thread that is doing the reference
 + * count update. Otherwise, it is theoretically possible for that thread to
 + * be starved for a really long time causing all kind of problems. If it
 + * times out and the lock is still not free, the code will fall back to the
 + * traditional way of queuing up to acquire the lock before updating the 
 count.
 + *
 + * The actual time spent in the wait loop very much depends on the CPU being
 + * used. On a 2.4GHz Westmere CPU, the execution time of a PAUSE instruction
 + * (cpu_relax) in a 4k loop is about 16us. The lock checking and looping
 + * overhead is about 8us. With 4 cpu_relax's in the loop, it will wait
 + * about 72us before the count reaches 0. With cacheline contention, the
 + * wait time can go up to 3x as much (about 210us). Having multiple
 + * cpu_relax's in the wait loop does seem to reduce cacheline contention
 + * somewhat and give slightly better performance.
 + *
 + * The preset timeout value is rather arbitrary and really depends on the CPU
 + * being used. If customization is needed, we could add a config variable for
 + * that. The exact value is not that important. A longer wait time will
 + * increase the chance of doing a lockless update, but it results in more
 + * unfairness to the waiting thread and vice versa.

As I told you before it's a horrible idea.

This is completely depending on the CPU, the cache implementation and
whatever. These heuristic can never be made correct. Their are prone
to fail and in the worst case you end up with a regression on some
systems.

Let's start with a simple version because it IS simple and easy
to analyze and debug and then incrementally build improvements on it
instead of creating an heuristics monster in the first place, i.e.:

   if (!spin_is_locked(lr-lock)  try_cmpxchg_once(lr))
  return 0;
   return 1;

Take numbers for this on a zoo of different machines: Intel and AMD,
old and new.

Then you can add an incremental patch on that, which add loops and
hoops. Along with numbers on the same zoo of machines.