[RFC PATCH] aliworkqueue: Adaptive lock integration on multi-core platform

2016-04-14 Thread ling . ma . program
From: Ma Ling Wire-latency(RC delay) dominate modern computer performance, conventional serialized works cause cache line ping-pong seriously, the process spend lots of time and power to complete. specially on multi-core platform. However if the serialized works are

[RFC PATCH] aliworkqueue: Adaptive lock integration on multi-core platform

2016-04-14 Thread ling . ma . program
From: Ma Ling Wire-latency(RC delay) dominate modern computer performance, conventional serialized works cause cache line ping-pong seriously, the process spend lots of time and power to complete. specially on multi-core platform. However if the serialized works are sent to one core and

[RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2015-12-31 Thread ling . ma . program
From: Ma Ling Hi ALL, Wire-latency(RC delay) dominate modern computer performance, conventional serialized works cause cache line ping-pong seriously, the process spend lots of time and power to complete. specially on multi-core platform. However if the serialized works are sent to one core

[RFC PATCH] alispinlock: acceleration from lock integration on multi-core platform

2015-12-31 Thread ling . ma . program
From: Ma Ling Hi ALL, Wire-latency(RC delay) dominate modern computer performance, conventional serialized works cause cache line ping-pong seriously, the process spend lots of time and power to complete. specially on multi-core platform. However if the serialized

[RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-18 Thread ling . ma . program
From: Ma Ling All load instructions can run speculatively but they have to follow memory order rule in multiple cores as below: _x = _y = 0 Processor 0 Processor 1 mov r1, [ _y] //M1 mov [ _x], 1 //M3 mov r2, [ _x] //M2 mov

[RFC PATCH] qspinlock: Improve performance by reducing load instruction rollback

2015-10-18 Thread ling . ma . program
From: Ma Ling All load instructions can run speculatively but they have to follow memory order rule in multiple cores as below: _x = _y = 0 Processor 0 Processor 1 mov r1, [ _y] //M1 mov [ _x], 1 //M3 mov r2, [ _x]

[PATCH RFC] x86:Improve memset with general 64bit instruction

2014-04-07 Thread ling . ma . program
From: Ling Ma In this patch we manage to reduce miss branch prediction by avoiding using branch instructions and force destination to be aligned with general 64bit instruction. Below compared results shows we improve performance up to 1.8x (We modified test suit from Ondra, send after this

[PATCH RFC] x86:Improve memset with general 64bit instruction

2014-04-07 Thread ling . ma . program
From: Ling Ma ling...@alibaba-inc.com In this patch we manage to reduce miss branch prediction by avoiding using branch instructions and force destination to be aligned with general 64bit instruction. Below compared results shows we improve performance up to 1.8x (We modified test suit from

[PATCH V2] [x86]: Compiler Option Os is better on latest x86

2013-01-26 Thread ling . ma . program
From: Ma Ling Currently we use O2 as compiler option for better performance, although it will enlarge code size, in modern CPUs larger instructon and unified cache, sophisticated instruction prefetch weaken instruction cache miss, meanwhile flags such as -falign-functions, -falign-jumps,

[PATCH V2] [x86]: Compiler Option Os is better on latest x86

2013-01-26 Thread ling . ma . program
From: Ma Ling ling...@alipay.com Currently we use O2 as compiler option for better performance, although it will enlarge code size, in modern CPUs larger instructon and unified cache, sophisticated instruction prefetch weaken instruction cache miss, meanwhile flags such as -falign-functions,

[PATCH] [x86]: Compiler Option Os is better on latest x86

2013-01-25 Thread ling . ma . program
From: Ma Ling Currently we use O2 as compiler option for better performance, although it will enlarge code size, in modern CPUs larger instructon and unified cache, sophisticated instruction prefetch weaken instruction cache miss, meanwhile flags such as -falign-functions, -falign-jumps,

[PATCH] [x86]: Compiler Option Os is better on latest x86

2013-01-25 Thread ling . ma . program
From: Ma Ling ling...@alipay.com Currently we use O2 as compiler option for better performance, although it will enlarge code size, in modern CPUs larger instructon and unified cache, sophisticated instruction prefetch weaken instruction cache miss, meanwhile flags such as -falign-functions,

[Suggestion] [x86]: Compiler Option Os is better on latest x86

2012-12-25 Thread ling . ma . program
From: Ma Ling Currently we use O2 as compiler option for better performance, although it will enlarge code size, in modern CPUs larger instructon and unified cache, sophisticated instruction prefetch weaken instruction cache miss, meanwhile flags such as -falign-functions, -falign-jumps,

[Suggestion] [x86]: Compiler Option Os is better on latest x86

2012-12-25 Thread ling . ma . program
From: Ma Ling ling...@alipay.com Currently we use O2 as compiler option for better performance, although it will enlarge code size, in modern CPUs larger instructon and unified cache, sophisticated instruction prefetch weaken instruction cache miss, meanwhile flags such as -falign-functions,

[PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line

2012-11-25 Thread ling . ma . program
From: Ma Ling In order to reduce memory latency when last level cache miss occurs, modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or Early Restart(ER) to get data ASAP. For CWF if critical word is first member in cache line, memory feed CPU with critical word, then fill others

[PATCH RFC] [INET]: Get cirtical word in first 64bit of cache line

2012-11-25 Thread ling . ma . program
From: Ma Ling ling.ma.prog...@gmail.com In order to reduce memory latency when last level cache miss occurs, modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or Early Restart(ER) to get data ASAP. For CWF if critical word is first member in cache line, memory feed CPU with

[PATCH RFC V2] [x86] Optimize small size memcpy by avoding long latency from decode stage

2012-10-22 Thread ling . ma . program
From: Ma Ling CISC code has higher instruction density, saving memory and improving i-cache hit rate. However decode become challenge, only one mulitple-uops(2~3)instruction could be decoded in one cycle, and instructions containing more 4 uops(rep movsq/b) have to be handled by MS-ROM, the

[PATCH RFC V2] [x86] Optimize small size memcpy by avoding long latency from decode stage

2012-10-22 Thread ling . ma . program
From: Ma Ling ling.ma.prog...@gmail.com CISC code has higher instruction density, saving memory and improving i-cache hit rate. However decode become challenge, only one mulitple-uops(2~3)instruction could be decoded in one cycle, and instructions containing more 4 uops(rep movsq/b) have to be

[PATCH RFC] [x86] Optimize small size memcpy by avoding long latency from decode stage

2012-10-18 Thread ling . ma . program
From: Ma Ling CISC code has higher instruction density, saving memory and improving i-cache hit rate. However decode become challenge, only one mulitple-uops(2~3)instruction could be decoded in one cycle, and instructions containing more 4 uops(rep movsq/b) have to be handled by MS-ROM, the

[PATCH RFC] [x86] Optimize small size memcpy by avoding long latency from decode stage

2012-10-18 Thread ling . ma . program
From: Ma Ling ling.ma.prog...@gmail.com CISC code has higher instruction density, saving memory and improving i-cache hit rate. However decode become challenge, only one mulitple-uops(2~3)instruction could be decoded in one cycle, and instructions containing more 4 uops(rep movsq/b) have to be