from:"ling . ma"

[RFC PATCH] aliworkqueue: Adaptive lock integration on multi-core platform

2016-04-14 Thread ling . ma . program

From: Ma Ling Wire-latency(RC delay) dominate modern computer performance, conventional serialized works cause cache line ping-pong seriously, the process spend lots of time and power to complete. specially on multi-core platform. However if the serialized works are sent to one core and executed

Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-04-11 Thread Ling Ma

Is it acceptable for performance improvement or more comments on this patch? Thanks Ling 2016-04-05 11:44 GMT+08:00 Ling Ma : > Hi Longman, > >> with some modest increase in performance. That can be hard to justify. Maybe >> you should find other use cases that involve less

Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-04-04 Thread Ling Ma

Hi Longman, > with some modest increase in performance. That can be hard to justify. Maybe > you should find other use cases that involve less changes, but still have > noticeable performance improvement. That will make it easier to be accepted. The attachment is for other use case with the new l

Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-02-03 Thread Ling Ma

8721413 43584658 38097842 43235392 ORGNEW TOTAL 874675292 1005486126 So the data tell us the new mechanism can improve performance 14% ( 1005486126/874675292) , and the operation can be justified fairly. Thanks Ling 2016-02-04 5:42 GMT+08:00 Waiman Long : > On 02/02/

Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-02-02 Thread Ling Ma

44229887 38108158 43142689 37771900 43228168 37652536 43901042 37649114 43172690 37591314 43380004 38539678 43435592 Total 1026910602 1174810406 Thanks Ling 2016-02-03 12:40 GMT+08:00 Ling Ma : > Longman, > > The attachment include user space code(thread.c), and kernel > patch(ali_work

Re: [RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2016-02-02 Thread Ling Ma

version according to your comments. Thanks Ling 2016-01-19 23:36 GMT+08:00 Waiman Long : > On 01/19/2016 03:52 AM, Ling Ma wrote: >> >> Is it acceptable for performance improvement or more comments on this >> patch? >> >> Thanks >> Ling >> >> >

[RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2015-12-31 Thread ling . ma . program

From: Ma Ling Hi ALL, Wire-latency(RC delay) dominate modern computer performance, conventional serialized works cause cache line ping-pong seriously, the process spend lots of time and power to complete. specially on multi-core platform. However if the serialized works are sent to one core and

Re: Improve spinlock performance by moving work to one core

2015-12-06 Thread Ling Ma

hanism soon. Appreciate your comments. Thanks Ling 2015-12-01 4:55 GMT+08:00 Waiman Long : > On 11/30/2015 01:17 AM, Ling Ma wrote: >> >> Any comments, the patch is acceptable ? >> >> Thanks >> Ling >> >> > Ling, > > The core idea of your curr

Re: Improve spinlock performance by moving work to one core

2015-11-29 Thread Ling Ma

Any comments, the patch is acceptable ? Thanks Ling 2015-11-26 17:00 GMT+08:00 Ling Ma : > Run thread.c with clean kernel 4.3.0-rc4, perf top -G also indicates > cache_flusharray and cache_alloc_refill functions spend 25.6% time > on queued_spin_lock_slowpath totally. it means the comp

Re: Improve spinlock performance by moving work to one core

2015-11-26 Thread Ling Ma

Run thread.c with clean kernel 4.3.0-rc4, perf top -G also indicates cache_flusharray and cache_alloc_refill functions spend 25.6% time on queued_spin_lock_slowpath totally. it means the compared data from our spinlock-test.patch is reliable. Thanks Ling 2015-11-26 11:49 GMT+08:00 Ling Ma

Re: Improve spinlock performance by moving work to one core

2015-11-25 Thread Ling Ma

ios to use new spin lock, because bottle-neck is only from one or two scenarios, we only modify them, other lock scenarios will continue to use the lock in qspinlock.h, we must modify the code, otherwise the operation will be hooked in the queued and never be waken up. Thanks Ling 2015-11-26 3

Re: Improve spinlock performance by moving work to one core

2015-11-24 Thread Ling Ma

Any comments about it ? Thanks Ling 2015-11-23 17:41 GMT+08:00 Ling Ma : > Hi Longman, > > Attachments include user space application thread.c and kernel patch > spinlock-test.patch based on kernel 4.3.0-rc4 > > we run thread.c with kernel patch, test original and new spinlo

Re: Improve spinlock performance by moving work to one core

2015-11-23 Thread Ling Ma

+08:00 Waiman Long : > > On 11/05/2015 11:28 PM, Ling Ma wrote: >> >> Longman >> >> Thanks for your suggestion. >> We will look for real scenario to test, and could you please introduce >> some benchmarks on spinlock ? >> >> Regards >> Ling

Re: Improve spinlock performance by moving work to one core

2015-11-05 Thread Ling Ma

Longman Thanks for your suggestion. We will look for real scenario to test, and could you please introduce some benchmarks on spinlock ? Regards Ling > > Your new spinlock code completely change the API and the semantics of the > existing spinlock calls. That requires changes to thousands of pla

Re: Improve spinlock performance by moving work to one core

2015-11-04 Thread Ling Ma

Hi All, (send again for linux-kernel@vger.kernel.org) Spinlock caused cache line ping-pong between cores, we have to spend lots of time to get serialized execution. However if we present the serialized work to one core, it will help us save much time. In the attachment we changed code based on q

Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-20 Thread Ling Ma

> > I did see some performance improvement when I used your test program on a > Haswell-EX system. It seems like the use of cmpxchg has forced the changed > memory values to be visible to other processors earlier. I also ran your > test on an older machine with Westmere-EX processors. This time, I

Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-20 Thread Ling Ma

2015-10-20 17:16 GMT+08:00 Peter Zijlstra : > On Tue, Oct 20, 2015 at 11:24:02AM +0800, Ling Ma wrote: >> 2015-10-19 17:46 GMT+08:00 Peter Zijlstra : >> > On Mon, Oct 19, 2015 at 10:27:22AM +0800, ling.ma.prog...@gmail.com wrote: >> >> From: Ma Ling >> &g

Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-20 Thread Ling Ma

Ok, we will put the spinlock test into the perf bench. Thanks Ling 2015-10-20 16:48 GMT+08:00 Ingo Molnar : > > * Ling Ma wrote: > >> > So it would be nice to create a new user-space spinlock testing facility, >> > via >> > a new 'perf bench spinlock

Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-19 Thread Ling Ma

2015-10-19 17:46 GMT+08:00 Peter Zijlstra : > On Mon, Oct 19, 2015 at 10:27:22AM +0800, ling.ma.prog...@gmail.com wrote: >> From: Ma Ling >> >> All load instructions can run speculatively but they have to follow >> memory order rule in multiple cores as below: >> _x = _y = 0 >> >> Processor 0

Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-19 Thread Ling Ma

2015-10-20 1:18 GMT+08:00 Waiman Long : > On 10/18/2015 10:27 PM, ling.ma.prog...@gmail.com wrote: >> >> From: Ma Ling >> >> All load instructions can run speculatively but they have to follow >> memory order rule in multiple cores as below: >> _x = _y = 0 >> >> Processor 0

Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-19 Thread Ling Ma

2015-10-19 17:46 GMT+08:00 Peter Zijlstra : > On Mon, Oct 19, 2015 at 10:27:22AM +0800, ling.ma.prog...@gmail.com wrote: >> From: Ma Ling >> >> All load instructions can run speculatively but they have to follow >> memory order rule in multiple cores as below: >> _x = _y = 0 >> >> Processor 0

Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-19 Thread Ling Ma

2015-10-19 17:33 GMT+08:00 Peter Zijlstra : > On Mon, Oct 19, 2015 at 10:27:22AM +0800, ling.ma.prog...@gmail.com wrote: >> From: Ma Ling >> >> All load instructions can run speculatively but they have to follow >> memory order rule in multiple cores as below: >> _x = _y = 0 >> >> Processor 0

Re: [RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-19 Thread Ling Ma

> > So it would be nice to create a new user-space spinlock testing facility, via > a > new 'perf bench spinlock' feature or so. That way others can test and validate > your results on different hardware as well. > Attached the spinlock test module . Queued spinlock will run very slowly in user sp

[RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-18 Thread ling . ma . program

From: Ma Ling All load instructions can run speculatively but they have to follow memory order rule in multiple cores as below: _x = _y = 0 Processor 0 Processor 1 mov r1, [ _y] //M1 mov [ _x], 1 //M3 mov r2, [ _x] //M2 mov

Re: [PATCH RFC] x86:Improve memset with general 64bit instruction

2014-04-14 Thread Ling Ma

The kernel version 3.14 shows memcpy, memset occur 19622 and 14189 times respectively. so memset is still important for us, correct? Thanks Ling 2014-04-14 6:03 GMT+08:00, Andi Kleen : > On Sun, Apr 13, 2014 at 11:11:59PM +0800, Ling Ma wrote: >> Any further comments ? > > It w

Re: [PATCH RFC] x86:Improve memset with general 64bit instruction

2014-04-13 Thread Ling Ma

Any further comments ? Thanks Ling 2014-04-08 22:00 GMT+08:00, Ling Ma : > Andi, > > The below is compared result on older machine(cpu info is attached): > That shows new code get better performance up to 1.6x. > > Bytes: ORG_TIME: NEW_TIME: ORG vs NEW: > 7 0.87

Re: [PATCH RFC] x86:Improve memset with general 64bit instruction

2014-04-08 Thread Ling Ma

1.61 356 1.611.021.57 601 1.781.221.45 958 2.041.471.38 10242.071.481.39 20482.802.211.26 Thanks Ling 2014-04-08 0:42 GMT+08:00, Andi Kleen : > ling.ma.prog...@gmail.com writes: > >> From: Ling Ma >> >> In

Re: [PATCH RFC] x86:Improve memset with general 64bit instruction

2014-04-07 Thread Ling Ma

Append test suit after tar, run ./test command please. thanks 2014-04-07 22:50 GMT+08:00, ling.ma.prog...@gmail.com : > From: Ling Ma > > In this patch we manage to reduce miss branch prediction by > avoiding using branch instructions and force destination to be aligned > wit

[PATCH RFC] x86:Improve memset with general 64bit instruction

2014-04-07 Thread ling . ma . program

From: Ling Ma In this patch we manage to reduce miss branch prediction by avoiding using branch instructions and force destination to be aligned with general 64bit instruction. Below compared results shows we improve performance up to 1.8x (We modified test suit from Ondra, send after this

Re: [PATCH V2] [x86]: Compiler Option Os is better on latest x86

2013-01-27 Thread Ling Ma

Hi Ingo Thanks for your correcting. Because thinking of most of 32bit CPU belong to low-end CPUs(smaller cache), they should more emphasize i-cache miss, I chose Os for them. I will test it and send out result ASAP. Regards Ling 2013/1/27, Ingo Molnar : > > * ling.ma.prog...@gmail.com wrote:

[PATCH V2] [x86]: Compiler Option Os is better on latest x86

2013-01-26 Thread ling . ma . program

From: Ma Ling Currently we use O2 as compiler option for better performance, although it will enlarge code size, in modern CPUs larger instructon and unified cache, sophisticated instruction prefetch weaken instruction cache miss, meanwhile flags such as -falign-functions, -falign-jumps, -falig

[PATCH] [x86]: Compiler Option Os is better on latest x86

2013-01-25 Thread ling . ma . program

From: Ma Ling Currently we use O2 as compiler option for better performance, although it will enlarge code size, in modern CPUs larger instructon and unified cache, sophisticated instruction prefetch weaken instruction cache miss, meanwhile flags such as -falign-functions, -falign-jumps, -falig

Re: [Suggestion] [x86]: Compiler Option Os is better on latest x86

2012-12-30 Thread Ling Ma

Hi Ingo, By netperf we did double check on older Nehalem platform too as below: O2 NHM Performance counter stats for 'netperf' (3 runs): 3779.262214 task-clock#0.378 CPUs utilized ( +- 0.37% ) 47,580 context-switches #0.013 M/sec

[Suggestion] [x86]: Compiler Option Os is better on latest x86

2012-12-25 Thread ling . ma . program

From: Ma Ling Currently we use O2 as compiler option for better performance, although it will enlarge code size, in modern CPUs larger instructon and unified cache, sophisticated instruction prefetch weaken instruction cache miss, meanwhile flags such as -falign-functions, -falign-jumps, -falig

Re: [PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line

2012-12-02 Thread Ling Ma

Hi Eric, Attached benchmark test-cwf.c(cc -o test-cwf test-cwf.c), the result shows when last level cache(LLC) miss and CPU fetches data from memory, critical word as first 64bit member in cache line has better performance(costs 158290336 cycles ) than other positions(offset 0x10, costs 164100732

Re: [PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line

2012-11-27 Thread Ling Ma

> networking patches should be sent to netdev. > > (I understand this patch is more a generic one, but at least CC netdev) Ling: OK, this is my first inet patch, I will send to netdev later. > You give no performance numbers for this change... Ling: after I get machine, I will send out test result

[PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line

2012-11-25 Thread ling . ma . program

From: Ma Ling In order to reduce memory latency when last level cache miss occurs, modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or Early Restart(ER) to get data ASAP. For CWF if critical word is first member in cache line, memory feed CPU with critical word, then fill others d

[PATCH RFC V2] [x86] Optimize small size memcpy by avoding long latency from decode stage

2012-10-22 Thread ling . ma . program

From: Ma Ling CISC code has higher instruction density, saving memory and improving i-cache hit rate. However decode become challenge, only one mulitple-uops(2~3)instruction could be decoded in one cycle, and instructions containing more 4 uops(rep movsq/b) have to be handled by MS-ROM, the proc

Re: [PATCH RFC V2] [x86] Optimize small size memcpy by avoding long latency from decode stage

2012-10-22 Thread Ling Ma

Attached memcpy micro benchmark, cpu info ,comparison results between rep movsq/b and memcpy on atom, ivb. Thanks Ling 2012/10/23, ling.ma.prog...@gmail.com : > From: Ma Ling > > CISC code has higher instruction density, saving memory and > improving i-cache hit rate. However decode become chal

[PATCH RFC] [x86] Optimize small size memcpy by avoding long latency from decode stage

2012-10-18 Thread ling . ma . program

From: Ma Ling CISC code has higher instruction density, saving memory and improving i-cache hit rate. However decode become challenge, only one mulitple-uops(2~3)instruction could be decoded in one cycle, and instructions containing more 4 uops(rep movsq/b) have to be handled by MS-ROM, the proc

[PATCH RFC V2 1/2] [x86] Modify comments and clean up code.

2012-10-17 Thread ling . ma

From: Ma Ling Modern CPU use fast-string instruction to accelerate copy performance, by combining data into 128bit, so we modify comments and code style. Signed-off-by: Ma Ling --- In this version, update comments from Borislav Petkov Thanks Ling arch/x86/lib/copy_page_64.S | 120 ++

[PATCH RFC V2 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-17 Thread ling . ma

From: Ma Ling Load and write operation occupy about 35% and 10% respectively for most industry benchmarks. Fetched 16-aligned bytes code include about 4 instructions, implying 1.40(0.35 * 4) load, 0.4 write. Modern CPU support 2 load and 1 write per cycle, throughput from write is bottleneck f

[PATCH RFC 2/2] [x86] Optimize copy_page by re-arranging instruction sequence and saving register

2012-10-10 Thread ling . ma

From: Ma Ling Load and write operation occupy about 35% and 10% respectively for most industry benchmarks. Fetched 16-aligned bytes code include about 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write. Modern CPU support 2 load and 1 write per cycle, so throughput from write is bottlenec

[PATCH RFC 1/2] [x86] Modify comments and clean up code.

2012-10-10 Thread ling . ma

From: Ma Ling Modern CPU use fast-string instruction to accelerate copy performance, by combining data into 128bit, so we modify comments and code style. Signed-off-by: Ma Ling --- arch/x86/lib/copy_page_64.S | 119 +-- 1 files changed, 59 insertions(+

44 matches

Mail list logo