On Sat, Sep 2, 2017 at 6:04 PM, Hal Finkel <hfin...@anl.gov> wrote: > > On 08/22/2017 10:56 PM, Wei Mi via llvm-commits wrote: >> >> On Tue, Aug 22, 2017 at 7:03 PM, Xinliang David Li <davi...@google.com> >> wrote: >>> >>> >>> On Tue, Aug 22, 2017 at 6:37 PM, Chandler Carruth via Phabricator >>> <revi...@reviews.llvm.org> wrote: >>>> >>>> chandlerc added a comment. >>>> >>>> I'm really not a fan of the degree of complexity and subtlety that this >>>> introduces into the frontend, all to allow particular backend >>>> optimizations. >>>> >>>> I feel like this is Clang working around a fundamental deficiency in >>>> LLVM >>>> and we should instead find a way to fix this in LLVM itself. >>>> >>>> As has been pointed out before, user code can synthesize large integers >>>> that small bit sequences are extracted from, and Clang and LLVM should >>>> handle those just as well as actual bitfields. >>>> >>>> Can we see how far we can push the LLVM side before we add complexity to >>>> Clang here? I understand that there remain challenges to LLVM's stuff, >>>> but I >>>> don't think those challenges make *all* of the LLVM improvements off the >>>> table, I don't think we've exhausted all ways of improving the LLVM >>>> changes >>>> being proposed, and I think we should still land all of those and >>>> re-evaluate how important these issues are when all of that is in place. >>> >>> >>> The main challenge of doing this in LLVM is that inter-procedural >>> analysis >>> (and possibly cross module) is needed (for store forwarding issues). >>> >>> Wei, perhaps you can provide concrete test case to illustrate the issue >>> so >>> that reviewers have a good understanding. >>> >>> David >> >> Here is a runable testcase: >> -------------------- 1.cc ------------------------ >> class A { >> public: >> unsigned long f1:2; >> unsigned long f2:6; >> unsigned long f3:8; >> unsigned long f4:4; >> }; >> A a; >> unsigned long b; >> unsigned long N = 1000000000; >> >> __attribute__((noinline)) >> void foo() { >> a.f3 = 3; >> } >> >> __attribute__((noinline)) >> void goo() { >> b = a.f3; >> } >> >> int main() { >> unsigned long i; >> for (i = 0; i < N; i++) { >> foo(); >> goo(); >> } >> } >> ------------------------------------------------------------ >> Now trunk takes about twice running time compared with trunk + this >> patch. That is because trunk shrinks the store of a.f3 in foo (Done by >> DagCombiner) but not shrink the load of a.f3 in goo, so store >> forwarding will be blocked. > > > I can confirm that this is true on Haswell and also on an POWER8. > Interestingly, on a POWER7, the performance is the same either way (on the > good side). I ran the test case as presented and where I replaced f3 with a > non-bitfield unsigned char member. Thinking that the POWER7 result might be > because it's big-Endian, I flipped the order of the fields, and found that > the version where f3 is not a bitfield is faster than otherwise, but only by > 12.5%. > > Why, in this case, don't we shrink the load? It seems like we should (and it > seems like a straightforward case). > > Thanks again, > Hal >
Hal, thanks for trying the test. Yes, it is straightforward to shrink the load in the test. I can change the testcase a little to show why it is sometime difficult to shrink the load: class A { public: unsigned long f1:16; unsigned long f2:16; unsigned long f3:16; unsigned long f4:8; }; A a; bool b; unsigned long N = 1000000000; __attribute__((noinline)) void foo() { a.f4 = 3; } __attribute__((noinline)) void goo() { b = (a.f4 == 0 && a.f3 == (unsigned short)-1); } int main() { unsigned long i; for (i = 0; i < N; i++) { foo(); goo(); } } For the load a.f4 in goo, it is diffcult to motivate its shrink after instcombine because the comparison with a.f3 and the comparison with a.f4 are merged: define void @_Z3goov() local_unnamed_addr #0 { %1 = load i64, i64* bitcast (%class.A* @a to i64*), align 8 %2 = and i64 %1, 0xffffff00000000 %3 = icmp eq i64 %2, 0xffff00000000 %4 = zext i1 %3 to i8 store i8 %4, i8* @b, align 1, !tbaa !2 ret void } Thanks, Wei. >> >> The testcases shows the potential problem of store shrinking. Before >> we decide to do store shrinking, we need to know all the related loads >> will be shrunk, and that requires IPA analysis. Otherwise, when load >> shrinking was blocked for some difficult case (Like the instcombine >> case described in >> https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html), >> performance regression will happen. >> >> Wei. >> >> >>>> >>>> >>>> Repository: >>>> rL LLVM >>>> >>>> https://reviews.llvm.org/D36562 >>>> >>>> >>>> >> _______________________________________________ >> llvm-commits mailing list >> llvm-comm...@lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits > > > -- > Hal Finkel > Lead, Compiler Technology and Programming Languages > Leadership Computing Facility > Argonne National Laboratory > _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits