From: Ma Ling
Wire-latency(RC delay) dominate modern computer performance,
conventional serialized works cause cache line ping-pong seriously,
the process spend lots of time and power to complete.
specially on multi-core platform.
However if the serialized works are
From: Ma Ling
Wire-latency(RC delay) dominate modern computer performance,
conventional serialized works cause cache line ping-pong seriously,
the process spend lots of time and power to complete.
specially on multi-core platform.
However if the serialized works are sent to one core and
From: Ma Ling
Hi ALL,
Wire-latency(RC delay) dominate modern computer performance,
conventional serialized works cause cache line ping-pong seriously,
the process spend lots of time and power to complete.
specially on multi-core platform.
However if the serialized works are sent to one core
From: Ma Ling
Hi ALL,
Wire-latency(RC delay) dominate modern computer performance,
conventional serialized works cause cache line ping-pong seriously,
the process spend lots of time and power to complete.
specially on multi-core platform.
However if the serialized
From: Ma Ling
All load instructions can run speculatively but they have to follow
memory order rule in multiple cores as below:
_x = _y = 0
Processor 0 Processor 1
mov r1, [ _y] //M1 mov [ _x], 1 //M3
mov r2, [ _x] //M2 mov
From: Ma Ling
All load instructions can run speculatively but they have to follow
memory order rule in multiple cores as below:
_x = _y = 0
Processor 0 Processor 1
mov r1, [ _y] //M1 mov [ _x], 1 //M3
mov r2, [ _x]
From: Ling Ma
In this patch we manage to reduce miss branch prediction by
avoiding using branch instructions and force destination to be aligned
with general 64bit instruction.
Below compared results shows we improve performance up to 1.8x
(We modified test suit from Ondra, send after this
From: Ling Ma ling...@alibaba-inc.com
In this patch we manage to reduce miss branch prediction by
avoiding using branch instructions and force destination to be aligned
with general 64bit instruction.
Below compared results shows we improve performance up to 1.8x
(We modified test suit from
From: Ma Ling
Currently we use O2 as compiler option for better performance,
although it will enlarge code size, in modern CPUs larger instructon
and unified cache, sophisticated instruction prefetch weaken instruction
cache miss, meanwhile flags such as -falign-functions, -falign-jumps,
From: Ma Ling ling...@alipay.com
Currently we use O2 as compiler option for better performance,
although it will enlarge code size, in modern CPUs larger instructon
and unified cache, sophisticated instruction prefetch weaken instruction
cache miss, meanwhile flags such as -falign-functions,
From: Ma Ling
Currently we use O2 as compiler option for better performance,
although it will enlarge code size, in modern CPUs larger instructon
and unified cache, sophisticated instruction prefetch weaken instruction
cache miss, meanwhile flags such as -falign-functions, -falign-jumps,
From: Ma Ling ling...@alipay.com
Currently we use O2 as compiler option for better performance,
although it will enlarge code size, in modern CPUs larger instructon
and unified cache, sophisticated instruction prefetch weaken instruction
cache miss, meanwhile flags such as -falign-functions,
From: Ma Ling
Currently we use O2 as compiler option for better performance,
although it will enlarge code size, in modern CPUs larger instructon
and unified cache, sophisticated instruction prefetch weaken instruction
cache miss, meanwhile flags such as -falign-functions, -falign-jumps,
From: Ma Ling ling...@alipay.com
Currently we use O2 as compiler option for better performance,
although it will enlarge code size, in modern CPUs larger instructon
and unified cache, sophisticated instruction prefetch weaken instruction
cache miss, meanwhile flags such as -falign-functions,
From: Ma Ling
In order to reduce memory latency when last level cache miss occurs,
modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or
Early Restart(ER) to get data ASAP. For CWF if critical word is first member
in cache line, memory feed CPU with critical word, then fill others
From: Ma Ling ling.ma.prog...@gmail.com
In order to reduce memory latency when last level cache miss occurs,
modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or
Early Restart(ER) to get data ASAP. For CWF if critical word is first member
in cache line, memory feed CPU with
From: Ma Ling
CISC code has higher instruction density, saving memory and
improving i-cache hit rate. However decode become challenge,
only one mulitple-uops(2~3)instruction could be decoded in one cycle,
and instructions containing more 4 uops(rep movsq/b) have to be handled by
MS-ROM,
the
From: Ma Ling ling.ma.prog...@gmail.com
CISC code has higher instruction density, saving memory and
improving i-cache hit rate. However decode become challenge,
only one mulitple-uops(2~3)instruction could be decoded in one cycle,
and instructions containing more 4 uops(rep movsq/b) have to be
From: Ma Ling
CISC code has higher instruction density, saving memory and
improving i-cache hit rate. However decode become challenge,
only one mulitple-uops(2~3)instruction could be decoded in one cycle,
and instructions containing more 4 uops(rep movsq/b) have to be handled by
MS-ROM,
the
From: Ma Ling ling.ma.prog...@gmail.com
CISC code has higher instruction density, saving memory and
improving i-cache hit rate. However decode become challenge,
only one mulitple-uops(2~3)instruction could be decoded in one cycle,
and instructions containing more 4 uops(rep movsq/b) have to be
20 matches
Mail list logo