Re: sem_otime trashing

2013-06-01 Thread Mike Galbraith
On Sat, 2013-06-01 at 21:02 +0200, Manfred Spraul wrote: 
> Hi Rik,
> 
> I finally managed to get EFI boot, i.e. I'm now able to test on my i3 
> (2core+HT).
> 
> With semscale (i.e.: just overhead, perform semop=0 operations), the 
> scalability from 1 to 2 cores is good, but not linear:
> # semscale 10 | grep "interleave 2"
> > Cpus 1, interleave 2 delay 0: 35502103 in 10 secs
> > Cpus 2, interleave 2 delay 0: 53990954 in 10 secs
> ---
>   +53% when adding the 2nd core
> (interleave 2 to force to use different cores)
> 
> Did you consider moving sem_otime into the individual semaphores?
> I did that (gross patch attached), and the performance is significantly 
> better:
> 
> # semscale 10 | grep "interleave 2"
> Cpus 1, interleave 2 delay 0: 35585634 in 10 secs
> Cpus 2, interleave 2 delay 0: 70410230 in 10 secs
>   ---
>  +99% scalability when adding the 2nd core
> 
> Unfortunately I won't be able to read my mails next week, but the effect 
> was too significant not to share it immediately.

64 core box.

Previous numbers: 
vogelweide:/abuild/mike/:[0]# uname -r
3.8.13-rt9-rtm
vogelweide:/abuild/mike/:[0]# ./semop-multi 256 64
cpus 64, threads: 256, semaphores: 64, test duration: 30 secs
total operations: 33553800, ops/sec 1118460

New numbers:
vogelweide:/abuild/mike/:[0]# !./semop-multi
./semop-multi 256 64
cpus 64, threads: 256, semaphores: 64, test duration: 30 secs
total operations: 129474934, ops/sec 4315831

But, box rcu stalled on me.  It's looking like the scalability patches
are a bit racy rcu wise in an -rt kernel (oh dear).  So, build as plain
old PREEMPT again, eliminate -rt funnies.



Previous numbers: 
vogelweide:/abuild/mike/:[0]# ./semop-multi 256 64
cpus 64, threads: 256, semaphores: 64, test duration: 30 secs
total operations: 22053968, ops/sec 735132

vogelweide:/abuild/mike/:[0]# ./osim 64 256 100 0 0
osim 
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 1.858765 seconds for 1000192 loops
per loop execution time: 1.858 usec

New numbers:
vogelweide:/abuild/mike/:[0]# !./semop
./semop-multi 256 64
cpus 64, threads: 256, semaphores: 64, test duration: 30 secs
total operations: 45521478, ops/sec 1517382
vogelweide:/abuild/mike/:[0]# !./osim
./osim 64 256 100 0 0
osim 
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 0.350682 seconds for 1000192 loops
per loop execution time: 0.350 usec

(1.8->0.3?.. box, you ain't a race horse, you're a plow horse)

vogelweide:/abuild/mike/:[0]# ./osim 64 256 100 0 0
osim 
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 0.276405 seconds for 1000192 loops
per loop execution time: 0.276 usec
vogelweide:/abuild/mike/:[0]# ./osim 64 256 100 0 0
osim 
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 0.370041 seconds for 1000192 loops
per loop execution time: 0.369 usec
vogelweide:/abuild/mike/:[0]# ./osim 64 256 100 0 0
osim 
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 0.502396 seconds for 1000192 loops
per loop execution time: 0.502 usec

(runtime)

vogelweide:/abuild/mike/:[0]# ./osim 64 256 1000 0 0
osim 
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 39063 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 3.354423 seconds for 1128 loops
per loop execution time: 0.335 usec
vogelweide:/abuild/mike/:[0]# ./osim 64 256 1 0 0
osim 
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 390625 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 41.180479 seconds for 1 loops
per loop execution time: 0.411 usec

Box likes your idea.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


sem_otime trashing

2013-06-01 Thread Manfred Spraul

Hi Rik,

I finally managed to get EFI boot, i.e. I'm now able to test on my i3 
(2core+HT).


With semscale (i.e.: just overhead, perform semop=0 operations), the 
scalability from 1 to 2 cores is good, but not linear:

# semscale 10 | grep "interleave 2"

Cpus 1, interleave 2 delay 0: 35502103 in 10 secs
Cpus 2, interleave 2 delay 0: 53990954 in 10 secs

---
 +53% when adding the 2nd core
(interleave 2 to force to use different cores)

Did you consider moving sem_otime into the individual semaphores?
I did that (gross patch attached), and the performance is significantly 
better:


# semscale 10 | grep "interleave 2"
Cpus 1, interleave 2 delay 0: 35585634 in 10 secs
Cpus 2, interleave 2 delay 0: 70410230 in 10 secs
 ---
+99% scalability when adding the 2nd core

Unfortunately I won't be able to read my mails next week, but the effect 
was too significant not to share it immediately.


--
Manfred
diff --git a/Makefile b/Makefile
index 73e20db..42137ab 100644
--- a/Makefile
+++ b/Makefile
@@ -1,7 +1,7 @@
 VERSION = 3
 PATCHLEVEL = 10
 SUBLEVEL = 0
-EXTRAVERSION = -rc3
+EXTRAVERSION = -rc3-otime
 NAME = Unicycling Gorilla
 
 # *DOCUMENTATION*
diff --git a/include/linux/sem.h b/include/linux/sem.h
index 55e17f6..976ce3a 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -12,7 +12,6 @@ struct task_struct;
 struct sem_array {
struct kern_ipc_permcacheline_aligned_in_smp
sem_perm;   /* permissions .. see ipc.h */
-   time_t  sem_otime;  /* last semop time */
time_t  sem_ctime;  /* last change time */
struct sem  *sem_base;  /* ptr to first semaphore in 
array */
struct list_headpending_alter;  /* pending operations */
diff --git a/ipc/sem.c b/ipc/sem.c
index 1dbb2fa..e5f000f 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -92,6 +92,7 @@
 
 /* One semaphore structure for each semaphore in the system. */
 struct sem {
+   charfiller[64];
int semval; /* current value */
int sempid; /* pid of last operation */
spinlock_t  lock;   /* spinlock for fine-grained semtimedop */
@@ -99,7 +100,8 @@ struct sem {
/* that alter the semaphore */
struct list_head pending_const; /* pending single-sop operations */
/* that do not alter the semaphore*/
-};
+   time_t  sem_otime;  /* candidate for sem_otime */
+} cacheline_aligned_in_smp;
 
 /* One queue for each sleeping process in the system. */
 struct sem_queue {
@@ -919,8 +921,14 @@ static void do_smart_update(struct sem_array *sma, struct 
sembuf *sops, int nsop
}
}
}
-   if (otime)
-   sma->sem_otime = get_seconds();
+   if (otime) {
+   if (sops == NULL) {
+   sma->sem_base[0].sem_otime = get_seconds();
+   } else {
+   sma->sem_base[sops[0].sem_num].sem_otime =
+   get_seconds();
+   }
+   }
 }
 
 
@@ -1066,6 +1074,21 @@ static unsigned long copy_semid_to_user(void __user 
*buf, struct semid64_ds *in,
}
 }
 
+static time_t get_semotime(struct sem_array *sma)
+{
+   int i;
+   time_t res;
+
+   res = sma->sem_base[0].sem_otime;
+   for (i = 1; i < sma->sem_nsems; i++) {
+   time_t to = sma->sem_base[i].sem_otime;
+
+   if (to > res)
+   res = to;
+   }
+   return res;
+}
+
 static int semctl_nolock(struct ipc_namespace *ns, int semid,
 int cmd, int version, void __user *p)
 {
@@ -1139,9 +1162,9 @@ static int semctl_nolock(struct ipc_namespace *ns, int 
semid,
goto out_unlock;
 
kernel_to_ipc64_perm(>sem_perm, _perm);
-   tbuf.sem_otime  = sma->sem_otime;
-   tbuf.sem_ctime  = sma->sem_ctime;
-   tbuf.sem_nsems  = sma->sem_nsems;
+   tbuf.sem_otime = get_semotime(sma);
+   tbuf.sem_ctime = sma->sem_ctime;
+   tbuf.sem_nsems = sma->sem_nsems;
rcu_read_unlock();
if (copy_semid_to_user(p, , version))
return -EFAULT;
@@ -2029,6 +2052,9 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it)
 {
struct user_namespace *user_ns = seq_user_ns(s);
struct sem_array *sma = it;
+   time_t sem_otime;
+
+   sem_otime = get_semotime(sma);
 
return seq_printf(s,
  "%10d %10d  %4o %10u %5u %5u %5u %5u %10lu %10lu\n",
@@ -2040,7 +2066,7 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it)
  from_kgid_munged(user_ns, sma->sem_perm.gid),
  from_kuid_munged(user_ns, sma->sem_perm.cuid),
 

sem_otime trashing

2013-06-01 Thread Manfred Spraul

Hi Rik,

I finally managed to get EFI boot, i.e. I'm now able to test on my i3 
(2core+HT).


With semscale (i.e.: just overhead, perform semop=0 operations), the 
scalability from 1 to 2 cores is good, but not linear:

# semscale 10 | grep interleave 2

Cpus 1, interleave 2 delay 0: 35502103 in 10 secs
Cpus 2, interleave 2 delay 0: 53990954 in 10 secs

---
 +53% when adding the 2nd core
(interleave 2 to force to use different cores)

Did you consider moving sem_otime into the individual semaphores?
I did that (gross patch attached), and the performance is significantly 
better:


# semscale 10 | grep interleave 2
Cpus 1, interleave 2 delay 0: 35585634 in 10 secs
Cpus 2, interleave 2 delay 0: 70410230 in 10 secs
 ---
+99% scalability when adding the 2nd core

Unfortunately I won't be able to read my mails next week, but the effect 
was too significant not to share it immediately.


--
Manfred
diff --git a/Makefile b/Makefile
index 73e20db..42137ab 100644
--- a/Makefile
+++ b/Makefile
@@ -1,7 +1,7 @@
 VERSION = 3
 PATCHLEVEL = 10
 SUBLEVEL = 0
-EXTRAVERSION = -rc3
+EXTRAVERSION = -rc3-otime
 NAME = Unicycling Gorilla
 
 # *DOCUMENTATION*
diff --git a/include/linux/sem.h b/include/linux/sem.h
index 55e17f6..976ce3a 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -12,7 +12,6 @@ struct task_struct;
 struct sem_array {
struct kern_ipc_permcacheline_aligned_in_smp
sem_perm;   /* permissions .. see ipc.h */
-   time_t  sem_otime;  /* last semop time */
time_t  sem_ctime;  /* last change time */
struct sem  *sem_base;  /* ptr to first semaphore in 
array */
struct list_headpending_alter;  /* pending operations */
diff --git a/ipc/sem.c b/ipc/sem.c
index 1dbb2fa..e5f000f 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -92,6 +92,7 @@
 
 /* One semaphore structure for each semaphore in the system. */
 struct sem {
+   charfiller[64];
int semval; /* current value */
int sempid; /* pid of last operation */
spinlock_t  lock;   /* spinlock for fine-grained semtimedop */
@@ -99,7 +100,8 @@ struct sem {
/* that alter the semaphore */
struct list_head pending_const; /* pending single-sop operations */
/* that do not alter the semaphore*/
-};
+   time_t  sem_otime;  /* candidate for sem_otime */
+} cacheline_aligned_in_smp;
 
 /* One queue for each sleeping process in the system. */
 struct sem_queue {
@@ -919,8 +921,14 @@ static void do_smart_update(struct sem_array *sma, struct 
sembuf *sops, int nsop
}
}
}
-   if (otime)
-   sma-sem_otime = get_seconds();
+   if (otime) {
+   if (sops == NULL) {
+   sma-sem_base[0].sem_otime = get_seconds();
+   } else {
+   sma-sem_base[sops[0].sem_num].sem_otime =
+   get_seconds();
+   }
+   }
 }
 
 
@@ -1066,6 +1074,21 @@ static unsigned long copy_semid_to_user(void __user 
*buf, struct semid64_ds *in,
}
 }
 
+static time_t get_semotime(struct sem_array *sma)
+{
+   int i;
+   time_t res;
+
+   res = sma-sem_base[0].sem_otime;
+   for (i = 1; i  sma-sem_nsems; i++) {
+   time_t to = sma-sem_base[i].sem_otime;
+
+   if (to  res)
+   res = to;
+   }
+   return res;
+}
+
 static int semctl_nolock(struct ipc_namespace *ns, int semid,
 int cmd, int version, void __user *p)
 {
@@ -1139,9 +1162,9 @@ static int semctl_nolock(struct ipc_namespace *ns, int 
semid,
goto out_unlock;
 
kernel_to_ipc64_perm(sma-sem_perm, tbuf.sem_perm);
-   tbuf.sem_otime  = sma-sem_otime;
-   tbuf.sem_ctime  = sma-sem_ctime;
-   tbuf.sem_nsems  = sma-sem_nsems;
+   tbuf.sem_otime = get_semotime(sma);
+   tbuf.sem_ctime = sma-sem_ctime;
+   tbuf.sem_nsems = sma-sem_nsems;
rcu_read_unlock();
if (copy_semid_to_user(p, tbuf, version))
return -EFAULT;
@@ -2029,6 +2052,9 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it)
 {
struct user_namespace *user_ns = seq_user_ns(s);
struct sem_array *sma = it;
+   time_t sem_otime;
+
+   sem_otime = get_semotime(sma);
 
return seq_printf(s,
  %10d %10d  %4o %10u %5u %5u %5u %5u %10lu %10lu\n,
@@ -2040,7 +2066,7 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void 
*it)
  from_kgid_munged(user_ns, sma-sem_perm.gid),
  from_kuid_munged(user_ns, sma-sem_perm.cuid),
   

Re: sem_otime trashing

2013-06-01 Thread Mike Galbraith
On Sat, 2013-06-01 at 21:02 +0200, Manfred Spraul wrote: 
 Hi Rik,
 
 I finally managed to get EFI boot, i.e. I'm now able to test on my i3 
 (2core+HT).
 
 With semscale (i.e.: just overhead, perform semop=0 operations), the 
 scalability from 1 to 2 cores is good, but not linear:
 # semscale 10 | grep interleave 2
  Cpus 1, interleave 2 delay 0: 35502103 in 10 secs
  Cpus 2, interleave 2 delay 0: 53990954 in 10 secs
 ---
   +53% when adding the 2nd core
 (interleave 2 to force to use different cores)
 
 Did you consider moving sem_otime into the individual semaphores?
 I did that (gross patch attached), and the performance is significantly 
 better:
 
 # semscale 10 | grep interleave 2
 Cpus 1, interleave 2 delay 0: 35585634 in 10 secs
 Cpus 2, interleave 2 delay 0: 70410230 in 10 secs
   ---
  +99% scalability when adding the 2nd core
 
 Unfortunately I won't be able to read my mails next week, but the effect 
 was too significant not to share it immediately.

64 core box.

Previous numbers: 
vogelweide:/abuild/mike/:[0]# uname -r
3.8.13-rt9-rtm
vogelweide:/abuild/mike/:[0]# ./semop-multi 256 64
cpus 64, threads: 256, semaphores: 64, test duration: 30 secs
total operations: 33553800, ops/sec 1118460

New numbers:
vogelweide:/abuild/mike/:[0]# !./semop-multi
./semop-multi 256 64
cpus 64, threads: 256, semaphores: 64, test duration: 30 secs
total operations: 129474934, ops/sec 4315831

But, box rcu stalled on me.  It's looking like the scalability patches
are a bit racy rcu wise in an -rt kernel (oh dear).  So, build as plain
old PREEMPT again, eliminate -rt funnies.



Previous numbers: 
vogelweide:/abuild/mike/:[0]# ./semop-multi 256 64
cpus 64, threads: 256, semaphores: 64, test duration: 30 secs
total operations: 22053968, ops/sec 735132

vogelweide:/abuild/mike/:[0]# ./osim 64 256 100 0 0
osim sems tasks loops busy-in busy-out
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 1.858765 seconds for 1000192 loops
per loop execution time: 1.858 usec

New numbers:
vogelweide:/abuild/mike/:[0]# !./semop
./semop-multi 256 64
cpus 64, threads: 256, semaphores: 64, test duration: 30 secs
total operations: 45521478, ops/sec 1517382
vogelweide:/abuild/mike/:[0]# !./osim
./osim 64 256 100 0 0
osim sems tasks loops busy-in busy-out
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 0.350682 seconds for 1000192 loops
per loop execution time: 0.350 usec

(1.8-0.3?.. box, you ain't a race horse, you're a plow horse)

vogelweide:/abuild/mike/:[0]# ./osim 64 256 100 0 0
osim sems tasks loops busy-in busy-out
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 0.276405 seconds for 1000192 loops
per loop execution time: 0.276 usec
vogelweide:/abuild/mike/:[0]# ./osim 64 256 100 0 0
osim sems tasks loops busy-in busy-out
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 0.370041 seconds for 1000192 loops
per loop execution time: 0.369 usec
vogelweide:/abuild/mike/:[0]# ./osim 64 256 100 0 0
osim sems tasks loops busy-in busy-out
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 3907 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 0.502396 seconds for 1000192 loops
per loop execution time: 0.502 usec

(runtime)

vogelweide:/abuild/mike/:[0]# ./osim 64 256 1000 0 0
osim sems tasks loops busy-in busy-out
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 39063 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 3.354423 seconds for 1128 loops
per loop execution time: 0.335 usec
vogelweide:/abuild/mike/:[0]# ./osim 64 256 1 0 0
osim sems tasks loops busy-in busy-out
osim: using a semaphore array with 64 semaphores.
osim: using 256 tasks.
osim: each thread loops 390625 times
osim: each thread busyloops 0 loops outside and 0 loops inside.
total execution time: 41.180479 seconds for 1 loops
per loop execution time: 0.411 usec

Box likes your idea.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/