sem.c: performance improvements, FIFO

Manfred Spraul Fri, 14 Jun 2013 08:49:47 -0700

Hi all,

On 06/10/2013 07:16 PM, Manfred Spraul wrote:

Hi Andrew,


I have cleaned up/improved my updates to sysv sem.
Could you replace my patches in -akpm with this series?

- 1: cacheline align output from ipc_rcu_alloc
- 2: cacheline align semaphore structures
- 3: seperate-wait-for-zero-and-alter-tasks
- 4: Always-use-only-one-queue-for-alter-operations
- 5: Replace the global sem_otime with a distributed otime
- 6: Rename-try_atomic_semop-to-perform_atomic

Just to keep everyone updated:
I have updated my testapp:
https://github.com/manfred-colorfu/ipcscale/blob/master/sem-waitzero.cpp

Something like this gives a nice output:

# sem-waitzero -t 5 -m 0 | grep 'Cpus' | gawk '{printf("%f -%s\n",$7/$2,$0);}' | sort -n -r


The first number is the number of operations per cpu during 5 seconds.

Mike was kind enough to run in on a 32-core (4-socket) Intel system:
- master doesn't scale at all when multiple sockets are used:
    interleave 4: (i.e.: use cpu 0, then 4, then 8 (2nd socket), then 12):
        34,717586.000000 - Cpus 1, interleave 4 delay 0: 34717586 in 5 secs
        24,507337.500000 - Cpus 2, interleave 4 delay 0: 49014675 in 5 secs
         3,487540.000000 - Cpus 3, interleave 4 delay 0: 10462620 in 5 secs
         2,708145.000000 - Cpus 4, interleave 4 delay 0: 10832580 in 5 secs
    interleave 8: (i.e.: use cpu 0, then 8 (2nd socket):
        34,587329.000000 - Cpus 1, interleave 8 delay 0: 34587329 in 5 secs
         7,746981.500000 - Cpus 2, interleave 8 delay 0: 15493963 in 5 secs

- with my patches applied, it scales linearly - but only sometimes
    example for good scaling (18 threads in parallel - linear scaling):

33,928616.111111 - Cpus 18, interleave 8 delay 0: 610715090 in5 secs

    example for bad scaling:
        5,829109.600000 - Cpus 5, interleave 8 delay 0: 29145548 in 5 secs

For me, it looks like a livelock somewhere:
Good example: all threads contribute the same amount to the final result:

Result matrix:
  Thread   0: 33476433
  Thread   1: 33697100
  Thread   2: 33514249
  Thread   3: 33657413
  Thread   4: 33727959
  Thread   5: 33580684
  Thread   6: 33530294
  Thread   7: 33666761
  Thread   8: 33749836
  Thread   9: 32636493
  Thread  10: 33550620
  Thread  11: 33403314
  Thread  12: 33594457
  Thread  13: 33331920
  Thread  14: 33503588
  Thread  15: 33585348
Cpus 16, interleave 8 delay 0: 536206469 in 5 secs

Bad example: one thread is as fast as it should be, others are slow:

Result matrix:
  Thread   0: 31629540
  Thread   1:  5336968
  Thread   2:  6404314
  Thread   3:  9190595
  Thread   4:  9681006
  Thread   5:  9935421
  Thread   6:  9424324
Cpus 7, interleave 8 delay 0: 81602168 in 5 secs


The results are not stable: the same test is sometimes fast, sometimes slow.

I have no idea where the livelock could be and I wasn't able to noticeanything on my i3 laptop.


Thus: Who has an idea?

What I can say is that the livelock can't be in do_smart_update(): Thefunction is never called.


--
    Manfred

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/6] ipc/sem.c: performance improvements, FIFO

Reply via email to