From: Ma Ling
Wire-latency(RC delay) dominate modern computer performance,
conventional serialized works cause cache line ping-pong seriously,
the process spend lots of time and power to complete.
specially on multi-core platform.
However if the serialized works are sent to one core and executed
Is it acceptable for performance improvement or more comments on this patch?
Thanks
Ling
2016-04-05 11:44 GMT+08:00 Ling Ma :
> Hi Longman,
>
>> with some modest increase in performance. That can be hard to justify. Maybe
>> you should find other use cases that involve less
Hi Longman,
> with some modest increase in performance. That can be hard to justify. Maybe
> you should find other use cases that involve less changes, but still have
> noticeable performance improvement. That will make it easier to be accepted.
The attachment is for other use case with the new l
8721413 43584658
38097842 43235392
ORGNEW
TOTAL 874675292 1005486126
So the data tell us the new mechanism can improve performance 14% (
1005486126/874675292) ,
and the operation can be justified fairly.
Thanks
Ling
2016-02-04 5:42 GMT+08:00 Waiman Long :
> On 02/02/
44229887
38108158 43142689
37771900 43228168
37652536 43901042
37649114 43172690
37591314 43380004
38539678 43435592
Total 1026910602 1174810406
Thanks
Ling
2016-02-03 12:40 GMT+08:00 Ling Ma :
> Longman,
>
> The attachment include user space code(thread.c), and kernel
> patch(ali_work
version according to your comments.
Thanks
Ling
2016-01-19 23:36 GMT+08:00 Waiman Long :
> On 01/19/2016 03:52 AM, Ling Ma wrote:
>>
>> Is it acceptable for performance improvement or more comments on this
>> patch?
>>
>> Thanks
>> Ling
>>
>>
>
From: Ma Ling
Hi ALL,
Wire-latency(RC delay) dominate modern computer performance,
conventional serialized works cause cache line ping-pong seriously,
the process spend lots of time and power to complete.
specially on multi-core platform.
However if the serialized works are sent to one core and
hanism soon.
Appreciate your comments.
Thanks
Ling
2015-12-01 4:55 GMT+08:00 Waiman Long :
> On 11/30/2015 01:17 AM, Ling Ma wrote:
>>
>> Any comments, the patch is acceptable ?
>>
>> Thanks
>> Ling
>>
>>
> Ling,
>
> The core idea of your curr
Any comments, the patch is acceptable ?
Thanks
Ling
2015-11-26 17:00 GMT+08:00 Ling Ma :
> Run thread.c with clean kernel 4.3.0-rc4, perf top -G also indicates
> cache_flusharray and cache_alloc_refill functions spend 25.6% time
> on queued_spin_lock_slowpath totally. it means the comp
Run thread.c with clean kernel 4.3.0-rc4, perf top -G also indicates
cache_flusharray and cache_alloc_refill functions spend 25.6% time
on queued_spin_lock_slowpath totally. it means the compared data
from our spinlock-test.patch is reliable.
Thanks
Ling
2015-11-26 11:49 GMT+08:00 Ling Ma
ios to use new spin lock,
because bottle-neck is only from one or two scenarios, we only modify them,
other lock scenarios will continue to use the lock in qspinlock.h, we
must modify the code,
otherwise the operation will be hooked in the queued and never be waken up.
Thanks
Ling
2015-11-26 3
Any comments about it ?
Thanks
Ling
2015-11-23 17:41 GMT+08:00 Ling Ma :
> Hi Longman,
>
> Attachments include user space application thread.c and kernel patch
> spinlock-test.patch based on kernel 4.3.0-rc4
>
> we run thread.c with kernel patch, test original and new spinlo
+08:00 Waiman Long :
>
> On 11/05/2015 11:28 PM, Ling Ma wrote:
>>
>> Longman
>>
>> Thanks for your suggestion.
>> We will look for real scenario to test, and could you please introduce
>> some benchmarks on spinlock ?
>>
>> Regards
>> Ling
Longman
Thanks for your suggestion.
We will look for real scenario to test, and could you please introduce
some benchmarks on spinlock ?
Regards
Ling
>
> Your new spinlock code completely change the API and the semantics of the
> existing spinlock calls. That requires changes to thousands of pla
Hi All,
(send again for linux-kernel@vger.kernel.org)
Spinlock caused cache line ping-pong between cores,
we have to spend lots of time to get serialized execution.
However if we present the serialized work to one core,
it will help us save much time.
In the attachment we changed code based on q
>
> I did see some performance improvement when I used your test program on a
> Haswell-EX system. It seems like the use of cmpxchg has forced the changed
> memory values to be visible to other processors earlier. I also ran your
> test on an older machine with Westmere-EX processors. This time, I
2015-10-20 17:16 GMT+08:00 Peter Zijlstra :
> On Tue, Oct 20, 2015 at 11:24:02AM +0800, Ling Ma wrote:
>> 2015-10-19 17:46 GMT+08:00 Peter Zijlstra :
>> > On Mon, Oct 19, 2015 at 10:27:22AM +0800, ling.ma.prog...@gmail.com wrote:
>> >> From: Ma Ling
>> &g
Ok, we will put the spinlock test into the perf bench.
Thanks
Ling
2015-10-20 16:48 GMT+08:00 Ingo Molnar :
>
> * Ling Ma wrote:
>
>> > So it would be nice to create a new user-space spinlock testing facility,
>> > via
>> > a new 'perf bench spinlock
2015-10-19 17:46 GMT+08:00 Peter Zijlstra :
> On Mon, Oct 19, 2015 at 10:27:22AM +0800, ling.ma.prog...@gmail.com wrote:
>> From: Ma Ling
>>
>> All load instructions can run speculatively but they have to follow
>> memory order rule in multiple cores as below:
>> _x = _y = 0
>>
>> Processor 0
2015-10-20 1:18 GMT+08:00 Waiman Long :
> On 10/18/2015 10:27 PM, ling.ma.prog...@gmail.com wrote:
>>
>> From: Ma Ling
>>
>> All load instructions can run speculatively but they have to follow
>> memory order rule in multiple cores as below:
>> _x = _y = 0
>>
>> Processor 0
2015-10-19 17:46 GMT+08:00 Peter Zijlstra :
> On Mon, Oct 19, 2015 at 10:27:22AM +0800, ling.ma.prog...@gmail.com wrote:
>> From: Ma Ling
>>
>> All load instructions can run speculatively but they have to follow
>> memory order rule in multiple cores as below:
>> _x = _y = 0
>>
>> Processor 0
2015-10-19 17:33 GMT+08:00 Peter Zijlstra :
> On Mon, Oct 19, 2015 at 10:27:22AM +0800, ling.ma.prog...@gmail.com wrote:
>> From: Ma Ling
>>
>> All load instructions can run speculatively but they have to follow
>> memory order rule in multiple cores as below:
>> _x = _y = 0
>>
>> Processor 0
>
> So it would be nice to create a new user-space spinlock testing facility, via
> a
> new 'perf bench spinlock' feature or so. That way others can test and validate
> your results on different hardware as well.
>
Attached the spinlock test module . Queued spinlock will run very
slowly in user sp
From: Ma Ling
All load instructions can run speculatively but they have to follow
memory order rule in multiple cores as below:
_x = _y = 0
Processor 0 Processor 1
mov r1, [ _y] //M1 mov [ _x], 1 //M3
mov r2, [ _x] //M2 mov
The kernel version 3.14 shows memcpy, memset occur 19622 and 14189
times respectively.
so memset is still important for us, correct?
Thanks
Ling
2014-04-14 6:03 GMT+08:00, Andi Kleen :
> On Sun, Apr 13, 2014 at 11:11:59PM +0800, Ling Ma wrote:
>> Any further comments ?
>
> It w
Any further comments ?
Thanks
Ling
2014-04-08 22:00 GMT+08:00, Ling Ma :
> Andi,
>
> The below is compared result on older machine(cpu info is attached):
> That shows new code get better performance up to 1.6x.
>
> Bytes: ORG_TIME: NEW_TIME: ORG vs NEW:
> 7 0.87
1.61
356 1.611.021.57
601 1.781.221.45
958 2.041.471.38
10242.071.481.39
20482.802.211.26
Thanks
Ling
2014-04-08 0:42 GMT+08:00, Andi Kleen :
> ling.ma.prog...@gmail.com writes:
>
>> From: Ling Ma
>>
>> In
Append test suit
after tar, run ./test command please.
thanks
2014-04-07 22:50 GMT+08:00, ling.ma.prog...@gmail.com
:
> From: Ling Ma
>
> In this patch we manage to reduce miss branch prediction by
> avoiding using branch instructions and force destination to be aligned
> wit
From: Ling Ma
In this patch we manage to reduce miss branch prediction by
avoiding using branch instructions and force destination to be aligned
with general 64bit instruction.
Below compared results shows we improve performance up to 1.8x
(We modified test suit from Ondra, send after this
Hi Ingo
Thanks for your correcting.
Because thinking of most of 32bit CPU belong to low-end CPUs(smaller
cache), they should more emphasize i-cache miss, I chose Os for
them.
I will test it and send out result ASAP.
Regards
Ling
2013/1/27, Ingo Molnar :
>
> * ling.ma.prog...@gmail.com wrote:
From: Ma Ling
Currently we use O2 as compiler option for better performance,
although it will enlarge code size, in modern CPUs larger instructon
and unified cache, sophisticated instruction prefetch weaken instruction
cache miss, meanwhile flags such as -falign-functions, -falign-jumps,
-falig
From: Ma Ling
Currently we use O2 as compiler option for better performance,
although it will enlarge code size, in modern CPUs larger instructon
and unified cache, sophisticated instruction prefetch weaken instruction
cache miss, meanwhile flags such as -falign-functions, -falign-jumps,
-falig
Hi Ingo,
By netperf we did double check on older Nehalem platform too as below:
O2 NHM
Performance counter stats for 'netperf' (3 runs):
3779.262214 task-clock#0.378 CPUs utilized
( +- 0.37% )
47,580 context-switches #0.013 M/sec
From: Ma Ling
Currently we use O2 as compiler option for better performance,
although it will enlarge code size, in modern CPUs larger instructon
and unified cache, sophisticated instruction prefetch weaken instruction
cache miss, meanwhile flags such as -falign-functions, -falign-jumps,
-falig
Hi Eric,
Attached benchmark test-cwf.c(cc -o test-cwf test-cwf.c), the result
shows when last level cache(LLC) miss and CPU fetches data from
memory, critical word as first 64bit member in cache line has better
performance(costs 158290336 cycles ) than other positions(offset 0x10,
costs 164100732
> networking patches should be sent to netdev.
>
> (I understand this patch is more a generic one, but at least CC netdev)
Ling: OK, this is my first inet patch, I will send to netdev later.
> You give no performance numbers for this change...
Ling: after I get machine, I will send out test result
From: Ma Ling
In order to reduce memory latency when last level cache miss occurs,
modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or
Early Restart(ER) to get data ASAP. For CWF if critical word is first member
in cache line, memory feed CPU with critical word, then fill others
d
From: Ma Ling
CISC code has higher instruction density, saving memory and
improving i-cache hit rate. However decode become challenge,
only one mulitple-uops(2~3)instruction could be decoded in one cycle,
and instructions containing more 4 uops(rep movsq/b) have to be handled by
MS-ROM,
the proc
Attached memcpy micro benchmark, cpu info ,comparison results between
rep movsq/b and memcpy on atom, ivb.
Thanks
Ling
2012/10/23, ling.ma.prog...@gmail.com :
> From: Ma Ling
>
> CISC code has higher instruction density, saving memory and
> improving i-cache hit rate. However decode become chal
From: Ma Ling
CISC code has higher instruction density, saving memory and
improving i-cache hit rate. However decode become challenge,
only one mulitple-uops(2~3)instruction could be decoded in one cycle,
and instructions containing more 4 uops(rep movsq/b) have to be handled by
MS-ROM,
the proc
From: Ma Ling
Modern CPU use fast-string instruction to accelerate copy performance,
by combining data into 128bit, so we modify comments and code style.
Signed-off-by: Ma Ling
---
In this version, update comments from Borislav Petkov
Thanks
Ling
arch/x86/lib/copy_page_64.S | 120 ++
From: Ma Ling
Load and write operation occupy about 35% and 10% respectively
for most industry benchmarks. Fetched 16-aligned bytes code include
about 4 instructions, implying 1.40(0.35 * 4) load, 0.4 write.
Modern CPU support 2 load and 1 write per cycle, throughput from write is
bottleneck f
From: Ma Ling
Load and write operation occupy about 35% and 10% respectively
for most industry benchmarks. Fetched 16-aligned bytes code include
about 4 instructions, implying 1.34(0.35 * 4) load, 0.4 write.
Modern CPU support 2 load and 1 write per cycle, so throughput from write is
bottlenec
From: Ma Ling
Modern CPU use fast-string instruction to accelerate copy performance,
by combining data into 128bit, so we modify comments and code style.
Signed-off-by: Ma Ling
---
arch/x86/lib/copy_page_64.S | 119 +--
1 files changed, 59 insertions(+
44 matches
Mail list logo