Re: Operation Reordering

2017-01-18 Thread 'Nitsan Wakart' via mechanical-sympathy
"- You can assume atomicity of values (no word tearing)" This seems to have tickled people, I apologise for my imprecise wording. Better wording would be: "You can assume atomicity of read/writes, and no word tearing, to the extent these are promised to you by the spec" - long/double plain

Re: Operation Reordering

2017-01-17 Thread Vitaly Davidovich
And should also mention that doing very early load scheduling will increase register pressure as that value will need to be kept live across more instructions. Stack spills and reloads suck in a hot/tight code sequence. On Tue, Jan 17, 2017 at 7:08 PM Vitaly Davidovich wrote:

Re: Operation Reordering

2017-01-17 Thread Vitaly Davidovich
I understand. My point was you could induce tearing of a single field via that scenario and not just via a CPU that doesn't have low level load/store granularity that can impact neighbors due to too wide of a load/store. On Tue, Jan 17, 2017 at 7:15 PM Dave Cheney wrote: > >

Re: Operation Reordering

2017-01-17 Thread Dave Cheney
> Yeah, "word tearing" is an overloaded term. You can also consider splitting > a word across cachelines as potentially causing a tear as a store/load > involves two cachelines If a word was split across cache lines, it is by definition not aligned, so the guarantees of an atomic write don't

Re: Operation Reordering

2017-01-17 Thread Vitaly Davidovich
The cache miss latency can be hidden either by this load being done ahead of time or if there're other instructions that can execute while this load is outstanding. So breaking dependency chains is good, but extending the distance like this seems weird and may hurt common cases. If ICC does this

Re: Operation Reordering

2017-01-17 Thread Vitaly Davidovich
Hmm, I've never seen such scheduling (doesn't mean it doesn't exist of course) for OoO cores. Besides what Aleksey said about macro fusion, what happens to the flags register in between the cmp and the jmp? It's also hard to look at these few instructions in isolation. For example, the

Re: Operation Reordering

2017-01-17 Thread Sergey Melnikov
In most cases the most heavy instructions are load/stores. So, I believe, in this case it's better to try to hide load latency then enable macro-fusion. BTW, I'm not sure about SKL/SKX, but for the previous generations macro-fusion depends on code alignment. --Sergey On Wed, Jan 18, 2017 at 2:44

Re: Operation Reordering

2017-01-17 Thread Aleksey Shipilev
(triggered again) On 01/18/2017 12:33 AM, Sergey Melnikov wrote: > mov (%rax), %rbx > cmp %rbx, %rdx > jxx Lxxx > > But if you schedule them this way > > mov (%rax), %rbx > cmp %rbx, %rdx > ... few instructions > jxx Lxxx ...doesn't this give up on macro-fusion, and effectively sets up for a

Re: Operation Reordering

2017-01-17 Thread Sergey Melnikov
>> Pretty sure OOO cores will do a good job themselves for scheduling provided you don't bottleneck in instruction fetch/decode phases or create other pipeline hazards. If you artificially increase distance between dependent instructions, you may cause instructions to hang out in the reorder

Re: Operation Reordering

2017-01-17 Thread Vitaly Davidovich
On Tue, Jan 17, 2017 at 3:39 PM, Aleksey Shipilev < aleksey.shipi...@gmail.com> wrote: > On 01/17/2017 12:55 PM, Vitaly Davidovich wrote: > > Atomicity of values isn't something I'd assume happens automatically. > Word > > tearing isn't observable from single threaded code. > > On 01/17/2017

Re: Operation Reordering

2017-01-17 Thread Sergey Melnikov
​Hi Gil, ​Your ​slides are really inspiring, especially for JIT code. Now, it's comparable with code produced by static C/C++ compilers. Have you compared a performance of this code with a code produced by ICC (Intel's compiler) for example? BTW, it may be better for performance to schedule

Re: Operation Reordering

2017-01-17 Thread Aleksey Shipilev
On 01/17/2017 12:55 PM, Vitaly Davidovich wrote: > Atomicity of values isn't something I'd assume happens automatically. Word > tearing isn't observable from single threaded code. On 01/17/2017 09:17 PM, Michael Barker wrote: > That was my understanding too. Normal load/stores on 32 bit JVMs

Re: Operation Reordering

2017-01-17 Thread Michael Barker
> > Atomicity of values isn't something I'd assume happens automatically. > Word tearing isn't observable from single threaded code. > That was my understanding too. Normal load/stores on 32 bit JVMs would tear 64 bit values. Although, I think object references are guaranteed to be written

Re: Operation Reordering

2017-01-17 Thread Vitaly Davidovich
Atomicity of values isn't something I'd assume happens automatically. Word tearing isn't observable from single threaded code. I think the only thing you can safely and portably assume is the high level "single threaded observable behavior will occur" statement. It's also interesting to note

Re: Operation Reordering

2017-01-17 Thread 'Nitsan Wakart' via mechanical-sympathy
"what about all the encoders/decoders or any program that rely on data access patterns to pretend to be and remain "fast"?" There's no problem in reordering while maintaining observable effects, right? You should assume a compiler interprets "observable effects" to mean "order imposed by memory

Re: Operation Reordering

2017-01-16 Thread Francesco Nigro
Thanks, "Assume nothing" is a pretty scientific approach,I like it :) But this (absence of) assumption lead me to think about another couple of things: what about all the encoders/decoders or any program that rely on data access patterns to pretend to be and remain "fast"? Writing mechanichal

Re: Operation Reordering

2017-01-16 Thread Gil Tene
The compiler's reordering generally DOES NOT depend on the hardware. Optimizations that result in reordering generally occur well before instruction selection, and will happen in the same way for different hardware architectures. E.g. on X86, PowerPC, and ARM, HotSpot, gcc, and clang will all

Re: Operation Reordering

2017-01-16 Thread Vitaly Davidovich
Depends on which hardware. For instance, x86/64 is very specific about what memory operations can be reordered (for cacheable operations), and two stores aren't reordered. The only reordering is stores followed by loads, where the load can appear to reorder with the preceding store. On Mon, Jan

Re: Operation Reordering

2017-01-16 Thread Dave Cheney
Doesn't hardware already reorder memory writes along 64 byte boundaries? They're called cache lines. Dave On Tue, 17 Jan 2017, 05:35 Tavian Barnes wrote: > On Monday, 16 January 2017 12:38:01 UTC-5, Francesco Nigro wrote: > > I'm missing something for sure, because if it