"- You can assume atomicity of values (no word tearing)"
This seems to have tickled people, I apologise for my imprecise wording.
Better wording would be:
"You can assume atomicity of read/writes, and no word tearing, to the extent
these are promised to you by the spec"
- long/double plain
And should also mention that doing very early load scheduling will increase
register pressure as that value will need to be kept live across more
instructions. Stack spills and reloads suck in a hot/tight code sequence.
On Tue, Jan 17, 2017 at 7:08 PM Vitaly Davidovich wrote:
I understand. My point was you could induce tearing of a single field via
that scenario and not just via a CPU that doesn't have low level load/store
granularity that can impact neighbors due to too wide of a load/store.
On Tue, Jan 17, 2017 at 7:15 PM Dave Cheney wrote:
> >
> Yeah, "word tearing" is an overloaded term. You can also consider splitting
> a word across cachelines as potentially causing a tear as a store/load
> involves two cachelines
If a word was split across cache lines, it is by definition not
aligned, so the guarantees of an atomic write don't
The cache miss latency can be hidden either by this load being done ahead
of time or if there're other instructions that can execute while this load
is outstanding. So breaking dependency chains is good, but extending the
distance like this seems weird and may hurt common cases. If ICC does this
Hmm, I've never seen such scheduling (doesn't mean it doesn't exist of
course) for OoO cores. Besides what Aleksey said about macro fusion, what
happens to the flags register in between the cmp and the jmp?
It's also hard to look at these few instructions in isolation. For
example, the
In most cases the most heavy instructions are load/stores. So, I believe,
in this case it's better to try to hide load latency then enable
macro-fusion. BTW, I'm not sure about SKL/SKX, but for the previous
generations macro-fusion depends on code alignment.
--Sergey
On Wed, Jan 18, 2017 at 2:44
(triggered again)
On 01/18/2017 12:33 AM, Sergey Melnikov wrote:
> mov (%rax), %rbx
> cmp %rbx, %rdx
> jxx Lxxx
>
> But if you schedule them this way
>
> mov (%rax), %rbx
> cmp %rbx, %rdx
> ... few instructions
> jxx Lxxx
...doesn't this give up on macro-fusion, and effectively sets up for a
>> Pretty sure OOO cores will do a good job themselves for scheduling
provided you don't bottleneck in instruction fetch/decode phases or create
other pipeline hazards. If you artificially increase distance between
dependent instructions, you may cause instructions to hang out in the
reorder
On Tue, Jan 17, 2017 at 3:39 PM, Aleksey Shipilev <
aleksey.shipi...@gmail.com> wrote:
> On 01/17/2017 12:55 PM, Vitaly Davidovich wrote:
> > Atomicity of values isn't something I'd assume happens automatically.
> Word
> > tearing isn't observable from single threaded code.
>
> On 01/17/2017
Hi Gil,
Your slides are really inspiring, especially for JIT code. Now, it's
comparable with code produced by static C/C++ compilers. Have you compared
a performance of this code with a code produced by ICC (Intel's compiler)
for example?
BTW, it may be better for performance to schedule
On 01/17/2017 12:55 PM, Vitaly Davidovich wrote:
> Atomicity of values isn't something I'd assume happens automatically. Word
> tearing isn't observable from single threaded code.
On 01/17/2017 09:17 PM, Michael Barker wrote:
> That was my understanding too. Normal load/stores on 32 bit JVMs
>
> Atomicity of values isn't something I'd assume happens automatically.
> Word tearing isn't observable from single threaded code.
>
That was my understanding too. Normal load/stores on 32 bit JVMs would
tear 64 bit values. Although, I think object references are guaranteed to
be written
Atomicity of values isn't something I'd assume happens automatically. Word
tearing isn't observable from single threaded code.
I think the only thing you can safely and portably assume is the high level
"single threaded observable behavior will occur" statement. It's also
interesting to note
"what about all the encoders/decoders or any program that rely on data access
patterns to pretend to be and remain "fast"?"
There's no problem in reordering while maintaining observable effects, right?
You should assume a compiler interprets "observable effects" to mean "order
imposed by memory
Thanks,
"Assume nothing" is a pretty scientific approach,I like it :)
But this (absence of) assumption lead me to think about another couple of
things: what about all the encoders/decoders or any program that rely on data
access patterns to pretend to be and remain "fast"?
Writing mechanichal
The compiler's reordering generally DOES NOT depend on the hardware.
Optimizations that result in reordering generally occur well before
instruction selection, and will happen in the same way for different
hardware architectures. E.g. on X86, PowerPC, and ARM, HotSpot, gcc, and
clang will all
Depends on which hardware. For instance, x86/64 is very specific about
what memory operations can be reordered (for cacheable operations), and two
stores aren't reordered. The only reordering is stores followed by loads,
where the load can appear to reorder with the preceding store.
On Mon, Jan
Doesn't hardware already reorder memory writes along 64 byte boundaries?
They're called cache lines.
Dave
On Tue, 17 Jan 2017, 05:35 Tavian Barnes wrote:
> On Monday, 16 January 2017 12:38:01 UTC-5, Francesco Nigro wrote:
>
> I'm missing something for sure, because if it
19 matches
Mail list logo