And should also mention that doing very early load scheduling will increase
register pressure as that value will need to be kept live across more
instructions. Stack spills and reloads suck in a hot/tight code sequence.
On Tue, Jan 17, 2017 at 7:08 PM Vitaly Davidovich wrote:
I understand. My point was you could induce tearing of a single field via
that scenario and not just via a CPU that doesn't have low level load/store
granularity that can impact neighbors due to too wide of a load/store.
On Tue, Jan 17, 2017 at 7:15 PM Dave Cheney wrote:
> >
> Yeah, "word tearing" is an overloaded term. You can also consider splitting
> a word across cachelines as potentially causing a tear as a store/load
> involves two cachelines
If a word was split across cache lines, it is by definition not
aligned, so the guarantees of an atomic write don't
The cache miss latency can be hidden either by this load being done ahead
of time or if there're other instructions that can execute while this load
is outstanding. So breaking dependency chains is good, but extending the
distance like this seems weird and may hurt common cases. If ICC does this
Hmm, I've never seen such scheduling (doesn't mean it doesn't exist of
course) for OoO cores. Besides what Aleksey said about macro fusion, what
happens to the flags register in between the cmp and the jmp?
It's also hard to look at these few instructions in isolation. For
example, the
In most cases the most heavy instructions are load/stores. So, I believe,
in this case it's better to try to hide load latency then enable
macro-fusion. BTW, I'm not sure about SKL/SKX, but for the previous
generations macro-fusion depends on code alignment.
--Sergey
On Wed, Jan 18, 2017 at 2:44
(triggered again)
On 01/18/2017 12:33 AM, Sergey Melnikov wrote:
> mov (%rax), %rbx
> cmp %rbx, %rdx
> jxx Lxxx
>
> But if you schedule them this way
>
> mov (%rax), %rbx
> cmp %rbx, %rdx
> ... few instructions
> jxx Lxxx
...doesn't this give up on macro-fusion, and effectively sets up for a
>> Pretty sure OOO cores will do a good job themselves for scheduling
provided you don't bottleneck in instruction fetch/decode phases or create
other pipeline hazards. If you artificially increase distance between
dependent instructions, you may cause instructions to hang out in the
reorder
On Tue, Jan 17, 2017 at 3:39 PM, Aleksey Shipilev <
aleksey.shipi...@gmail.com> wrote:
> On 01/17/2017 12:55 PM, Vitaly Davidovich wrote:
> > Atomicity of values isn't something I'd assume happens automatically.
> Word
> > tearing isn't observable from single threaded code.
>
> On 01/17/2017
Hi Gil,
Your slides are really inspiring, especially for JIT code. Now, it's
comparable with code produced by static C/C++ compilers. Have you compared
a performance of this code with a code produced by ICC (Intel's compiler)
for example?
BTW, it may be better for performance to schedule
On 01/17/2017 12:55 PM, Vitaly Davidovich wrote:
> Atomicity of values isn't something I'd assume happens automatically. Word
> tearing isn't observable from single threaded code.
On 01/17/2017 09:17 PM, Michael Barker wrote:
> That was my understanding too. Normal load/stores on 32 bit JVMs
>
> Atomicity of values isn't something I'd assume happens automatically.
> Word tearing isn't observable from single threaded code.
>
That was my understanding too. Normal load/stores on 32 bit JVMs would
tear 64 bit values. Although, I think object references are guaranteed to
be written
Atomicity of values isn't something I'd assume happens automatically. Word
tearing isn't observable from single threaded code.
I think the only thing you can safely and portably assume is the high level
"single threaded observable behavior will occur" statement. It's also
interesting to note
"what about all the encoders/decoders or any program that rely on data access
patterns to pretend to be and remain "fast"?"
There's no problem in reordering while maintaining observable effects, right?
You should assume a compiler interprets "observable effects" to mean "order
imposed by memory
14 matches
Mail list logo