Re: [i386] Scalar DImode instructions on XMM registers
On 05/25/2015 09:27 AM, Ilya Enkovich wrote: 2015-05-22 15:01 GMT+03:00 Ilya Enkovich : 2015-05-22 11:53 GMT+03:00 Ilya Enkovich : 2015-05-21 22:08 GMT+03:00 Vladimir Makarov : So, Ilya, to solve the problem you need to avoid sharing subregs for the correct LRA/reload work. Thanks a lot for your help! I'll fix it. Ilya I've fixed SUBREG sharing and got a missing spill. I added --enable-checking=rtl to check other possible bugs. Spill/fill code still seems incorrect because different sizes are used. Shouldn't block me though. .L5: movl16(%esp), %eax addl$8, %esi movl20(%esp), %edx movl%eax, (%esp) movl%edx, 4(%esp) callcounter@PLT movq-8(%esi), %xmm0 **movdqa 16(%esp), %xmm2** pand%xmm0, %xmm2 movdqa %xmm2, %xmm0 movd%xmm2, %edx **movq%xmm2, 16(%esp)** psrlq $32, %xmm0 movd%xmm0, %eax orl %edx, %eax jne .L5 Thanks, Ilya I was wrong assuming reloads with wrong size shouldn't block me. These reloads require memory to be aligned which is not always true. Here is what I have in RTL now: (insn 2 7 3 2 (set (reg/v:DI 93 [ l ]) (mem/c:DI (reg/f:SI 16 argp) [1 l+0 S8 A32])) test.c:5 89 {*movdi_internal} (nil)) ... (insn 27 26 52 6 (set (subreg:V2DI (reg:DI 87 [ D.1822 ]) 0) (ior:V2DI (subreg:V2DI (reg:DI 99 [ D.1822 ]) 0) (subreg:V2DI (reg/v:DI 93 [ l ]) 0))) test.c:11 3489 {*iorv2di3} (expr_list:REG_DEAD (reg:DI 99 [ D.1822 ]) (expr_list:REG_DEAD (reg/v:DI 93 [ l ]) (nil After reload I get: (insn 2 7 75 2 (set (reg/v:DI 0 ax [orig:93 l ] [93]) (mem/c:DI (plus:SI (reg/f:SI 7 sp) (const_int 24 [0x18])) [1 l+0 S8 A32])) test.c:5 89 {*movdi_internal} (nil)) (insn 75 2 3 2 (set (mem/c:DI (reg/f:SI 7 sp) [3 %sfp+-16 S8 A64]) (reg/v:DI 0 ax [orig:93 l ] [93])) test.c:5 89 {*movdi_internal} (nil)) ... (insn 27 26 52 6 (set (reg:V2DI 21 xmm0 [orig:87 D.1822 ] [87]) (ior:V2DI (reg:V2DI 21 xmm0 [orig:99 D.1822 ] [99]) (mem/c:V2DI (reg/f:SI 7 sp) [3 %sfp+-16 S16 A64]))) test.c:11 3489 {*iorv2di3} 'por' instruction requires memory to be aligned and fails in a bigger testcase. There is also movdqa generated for esp by reload. May it mean I still have some inconsistencies in the produced RTL? Probably I should somehow transform loads and stores? I'd start by looking at the AP->SP elimination step. What's the defined stack alignment and whether or not a dynamic stack realignment is needed. If you don't have all that setup properly prior to the allocators, then they're not going to know how what objects to align nor how to align them. jeff
Re: [c++std-parallel-1651] Re: Compilers and RCU readers: Once more unto the breach!
On Tue, May 26, 2015 at 07:08:35PM +0200, Torvald Riegel wrote: > On Tue, 2015-05-19 at 17:55 -0700, Paul E. McKenney wrote: > > http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf > > I have been discussing Section 7.9 with Paul during last week. > > While I think that 7.9 helps narrow down the problem somewhat, I'm still > concerned that it effectively requires compilers to either track > dependencies or conservatively prevent optimizations like value > speculation and specialization based on that. Neither is a good thing > for a compiler. I do believe that we can find some middle ground. > 7.9 adds requirements that dependency chains stop if the program itself > informs the compiler about the value of something in the dependency > chain (e.g., as shown in Figure 33). Also, if a correct program that > does not have undefined behavior must use a particular value, this is > also considered as "informing" the compiler about that value. For > example: > int arr[2]; > int* x = foo.load(mo_consume); > if (x > arr) // implies same object/array, so x is in arr[] > int r1 = *x; // compiler knows x == arr + 1 > The program, assuming no undefined behavior, first tells the compiler > that x should be within arr, and then the comparison tells the compiler > that x!=arr, so x==arr+1 must hold because there are just two elements > in the array. The updated version of Section 7.9 says that if undefined behavior allows the compiler to deduce the exact pointer value, as in the case you show above, the dependency chain is broken. > Having these requirements is generally good, but we don't yet know how > to specify this properly. For example, I suppose we'd need to also say > that the compiler cannot assume to know anything about a value returned > from an mo_consume load; otherwise, nothing prevents a compiler from > using knowledge about the stores that the mo_consume load can read from > (see Section 7.2). I expect that the Linux kernel's rcu_dereference() primitives would map to volatile memory_order_consume loads for this reason. > Also, a compiler is still required to not do value-speculation or > optimizations based on that. For example, this program: > > void op(type *p) > { > foo /= p->a; > bar = p->b; > } > void bar() > { > pointer = ppp.load(mo_consume); > op(pointer); > } > > ... must not be transformed into this program, even if the compiler > knows that global_var->a == 1: > > void op(type *p) { /* unchanged */} > void bar() > { > pointer = ppp.load(mo_consume); > if (pointer != global_var) { > op(pointer); > else // specialization for global_var > { > // compiler knows global_var->a==1; > // compiler uses global_var directly, inlines, optimizes: > bar = global_var->b; > } > > The compiler could optimize out the division if pointer==global_var but > it must not access field b directly through global_var. This would be > pretty awkwaard; the compiler may work based on an optimized expression > in the specialization (ie, create code that assumes global_var instead > of pointer) but it would still have to carry around and use the > non-optimized expression. Exactly how valuable is this sort of optimization in real code? And how likely is the compiler to actually be able to apply it? (I nevertheless will take another pass through the Linux kernel looking for global variables being added to RCU-protected linked structures. Putting a global into an RCU-protected structure seems more likely than is an RCU-protected pointer into a two-element array.) > This wouldn't be as bad if it were easily constrained to code sequences > that really need the dependencies. However, 7.9 does not effectively > contain dependencies to only the code that really needs them, IMO. > Unless the compiler proves otherwise, it has to assume that a load from > a pointer carries a dependency. Proving that is often very hard because > it requires points-to analysis; 7.9 restricts this to intra-thread > analysis but that is still nontrivial. > Michael Matz' had a similar concern (in terms of what it results in). Again, I will be looking through the Linux kernel for vulnerabilities to this sort of transformation. However, I am having a very hard time seeing how the compiler is going to know that much about the vast majority of the Linux-kernel use cases. The linked structures are allocated on the heap, not in arrays or in globals. > Given that mo_consume is useful but a very specialized feature, I > wouldn't be optimistic that 7.9 would actually be supported by many > compilers. The trade-off between having to track dependencies or having > to disallow optimizations is a bad one to make. The simple way out for > a compiler would be to just emit mo_acquire instead of mo_consume and be > done with all -- and this might be the most practical decision overall, > or the default general-purpose implementation. At least I haven't heard > any compiler impl
gcc-5-20150526 is now available
Snapshot gcc-5-20150526 is now available on ftp://gcc.gnu.org/pub/gcc/snapshots/5-20150526/ and on various mirrors, see http://gcc.gnu.org/mirrors.html for details. This snapshot has been generated from the GCC 5 SVN branch with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-5-branch revision 223720 You'll find: gcc-5-20150526.tar.bz2 Complete GCC MD5=3dfffc8efcbfc069d41239cb7578b054 SHA1=bae3ce5b6c8f61bd1b273ae6908032b08aa416d7 Diffs from 5-20150519 are available in the diffs/ subdirectory. When a particular snapshot is ready for public consumption the LATEST-5 link is updated and a message is sent to the gcc list. Please do not use a snapshot before it has been announced that way.
Re: [c++std-parallel-1641] Re: Compilers and RCU readers: Once more unto the breach!
On Thu, 2015-05-21 at 13:42 -0700, Linus Torvalds wrote: > On Thu, May 21, 2015 at 1:02 PM, Paul E. McKenney > wrote: > > > > The compiler can (and does) speculate non-atomic non-volatile writes > > in some cases, but I do not believe that it is permitted to speculate > > either volatile or atomic writes. > > I do *not* believe that a compiler is ever allowed to speculate *any* > writes - volatile or not - unless the compiler can prove that the end > result is either single-threaded, or the write in question is > guaranteed to only be visible in that thread (ie local stack variable > etc). It must not speculative volatile accesses. It could speculate non-volatiles even if those are atomic and observable by other threads but that would require further work/checks on all potential observers of those (ie, to still satisfy as-if). Thus, compilers are unlikely to do such speculation, I'd say. The as-if rule (ie, equality of observable behavior (ie, volatiles, ...) to the abstract machine) makes all this clear. > Also, I do think that the whole "consume" read should be explained > better to compiler writers. Right now the language (including very > much in the "restricted dependency" model) is described in very > abstract terms. Yet those abstract terms are actually very subtle and > complex, and very opaque to a compiler writer. I believe the issues around the existing specification of mo_consume where pointed out by compiler folks. It's a complex problem, and I'm all for more explanations, but I did get the impression that the compiler writers in ISO C++ Study Group 1 do have a good understanding of the problem. > I personally think the whole "abstract machine" model of the C > language is a mistake. It would be much better to talk about things in > terms of actual code generation and actual issues. Make all the > problems much more concrete, with actual examples of how memory > ordering matters on different architectures. As someone working for a toolchain team, I don't see how the abstract-machine-based specification is a problem at all, nor have I seen compiler writers struggling with it. It does give precise rules for code generation. The level of abstraction is a good thing for most programs because for those, the primary concern is that the observable behavior and end result is computed -- it's secondary and QoI how that happens. In contrast, if you specify on the level of code generation, you'd have to foresee how code generation might look in the future, including predict future optimizations and all that. That doesn't look future-proof to me. I do realize that this may be less than ideal for cases when one would want to use a C compiler more like a convenient assembler. But that case isn't the 99%, I believe.
Re: [c++std-parallel-1611] Compilers and RCU readers: Once more unto the breach!
On Tue, 2015-05-19 at 17:55 -0700, Paul E. McKenney wrote: > http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf I have been discussing Section 7.9 with Paul during last week. While I think that 7.9 helps narrow down the problem somewhat, I'm still concerned that it effectively requires compilers to either track dependencies or conservatively prevent optimizations like value speculation and specialization based on that. Neither is a good thing for a compiler. 7.9 adds requirements that dependency chains stop if the program itself informs the compiler about the value of something in the dependency chain (e.g., as shown in Figure 33). Also, if a correct program that does not have undefined behavior must use a particular value, this is also considered as "informing" the compiler about that value. For example: int arr[2]; int* x = foo.load(mo_consume); if (x > arr) // implies same object/array, so x is in arr[] int r1 = *x; // compiler knows x == arr + 1 The program, assuming no undefined behavior, first tells the compiler that x should be within arr, and then the comparison tells the compiler that x!=arr, so x==arr+1 must hold because there are just two elements in the array. Having these requirements is generally good, but we don't yet know how to specify this properly. For example, I suppose we'd need to also say that the compiler cannot assume to know anything about a value returned from an mo_consume load; otherwise, nothing prevents a compiler from using knowledge about the stores that the mo_consume load can read from (see Section 7.2). Also, a compiler is still required to not do value-speculation or optimizations based on that. For example, this program: void op(type *p) { foo /= p->a; bar = p->b; } void bar() { pointer = ppp.load(mo_consume); op(pointer); } ... must not be transformed into this program, even if the compiler knows that global_var->a == 1: void op(type *p) { /* unchanged */} void bar() { pointer = ppp.load(mo_consume); if (pointer != global_var) { op(pointer); else // specialization for global_var { // compiler knows global_var->a==1; // compiler uses global_var directly, inlines, optimizes: bar = global_var->b; } The compiler could optimize out the division if pointer==global_var but it must not access field b directly through global_var. This would be pretty awkwaard; the compiler may work based on an optimized expression in the specialization (ie, create code that assumes global_var instead of pointer) but it would still have to carry around and use the non-optimized expression. This wouldn't be as bad if it were easily constrained to code sequences that really need the dependencies. However, 7.9 does not effectively contain dependencies to only the code that really needs them, IMO. Unless the compiler proves otherwise, it has to assume that a load from a pointer carries a dependency. Proving that is often very hard because it requires points-to analysis; 7.9 restricts this to intra-thread analysis but that is still nontrivial. Michael Matz' had a similar concern (in terms of what it results in). Given that mo_consume is useful but a very specialized feature, I wouldn't be optimistic that 7.9 would actually be supported by many compilers. The trade-off between having to track dependencies or having to disallow optimizations is a bad one to make. The simple way out for a compiler would be to just emit mo_acquire instead of mo_consume and be done with all -- and this might be the most practical decision overall, or the default general-purpose implementation. At least I haven't heard any compiler implementer say that they think it's obviously worth implementing. I also don't think 7.9 is ready for ISO standardization yet (or any of the other alternatives mentioned in the paper). Standardizing a feature that we're not sure whether it will actually be implemented is not a good thing to do; it's too costly for all involved parties (compiler writers *and* users). IMO, the approach outlined in Section 7.7 is still the most promising contender in the long run. It avoid the perhaps more pervasive changes that a type-system-based approach as the one in Section 7.2 might result in, yet still informs the compiler where dependencies are actually used and which chain of expressions would be involved in that. Tracking is probably simplified, as dependencies are never open-ended and potentially leaking into various other regions of code. It seems easier to specify in a standard because we just need the programmer to annotate the intent and the rest is compiler QoI. It would require users to annotate their use of dependencies, but they don't need to follow further rules; performance tuning of the code so it actually makes use of dependencies is mostly a compiler QoI thing, and if the compiler can't maintain a dependency, it can issue warnings and thus make the tuning interactive for the user. Of
Re: Balanced partition map for Firefox
On Tue, 19 May 2015, Martin Liška wrote: > Hello. > > I've just noticed that we, for default configuration, produce just 30 > partitions. > I'm wondering whether that's fine, or it would be necessary to re-tune > partitioning > algorithm to produce better balanced map? > > Attached patch is used to produce following dump: > > Partition sizes: > partition 0 contains 9806 (5.42)% symbols and 232445 (2.37)% insns > partition 1 contains 15004 (8.30)% symbols and 389297 (3.96)% insns > partition 2 contains 13954 (7.71)% symbols and 390076 (3.97)% insns > partition 3 contains 14349 (7.93)% symbols and 390476 (3.97)% insns > partition 4 contains 13852 (7.66)% symbols and 391346 (3.98)% insns > partition 5 contains 10766 (5.95)% symbols and 278110 (2.83)% insns > partition 6 contains 11465 (6.34)% symbols and 396298 (4.03)% insns > partition 7 contains 16467 (9.10)% symbols and 396043 (4.03)% insns > partition 8 contains 12959 (7.16)% symbols and 316753 (3.22)% insns > partition 9 contains 17422 (9.63)% symbols and 402809 (4.10)% insns > partition 10 contains 15431 (8.53)% symbols and 404822 (4.12)% insns > partition 11 contains 15967 (8.83)% symbols and 342655 (3.49)% insns > partition 12 contains 12325 (6.81)% symbols and 409573 (4.17)% insns > partition 13 contains 11876 (6.57)% symbols and 411484 (4.19)% insns > partition 14 contains 20902 (11.56)% symbols and 391188 (3.98)% insns > partition 15 contains 18894 (10.45)% symbols and 339148 (3.45)% insns > partition 16 contains 27028 (14.94)% symbols and 426811 (4.34)% insns > partition 17 contains 19626 (10.85)% symbols and 431548 (4.39)% insns > partition 18 contains 23864 (13.19)% symbols and 437657 (4.45)% insns > partition 19 contains 28677 (15.86)% symbols and 445054 (4.53)% insns > partition 20 contains 32558 (18.00)% symbols and 457975 (4.66)% insns > partition 21 contains 37598 (20.79)% symbols and 470463 (4.79)% insns > partition 22 contains 21612 (11.95)% symbols and 488384 (4.97)% insns > partition 23 contains 18981 (10.49)% symbols and 493152 (5.02)% insns > partition 24 contains 20591 (11.38)% symbols and 493380 (5.02)% insns > partition 25 contains 20721 (11.46)% symbols and 496018 (5.05)% insns > partition 26 contains 26171 (14.47)% symbols and 479232 (4.88)% insns > partition 27 contains 29242 (16.17)% symbols and 530613 (5.40)% insns > partition 28 contains 35817 (19.80)% symbols and 563768 (5.74)% insns > partition 29 contains 42662 (23.59)% symbols and 741133 (7.54)% insns > > As seen, there are partitions that are about 3x bigger than a different one. > What do you think about installing the patch to trunk? If yes, I'll test the > patch > and write a ChangeLog entry. I think a patch like this is fine (lto-partition.h parts are missing). Thanks, Richard. > Thanks, > Martin > -- Richard Biener SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Dilip Upmanyu, Graham Norton, HRB 21284 (AG Nuernberg)