On 08/22/2017 10:56 PM, Wei Mi via llvm-commits wrote:
On Tue, Aug 22, 2017 at 7:03 PM, Xinliang David Li <davi...@google.com> wrote:

On Tue, Aug 22, 2017 at 6:37 PM, Chandler Carruth via Phabricator
<revi...@reviews.llvm.org> wrote:
chandlerc added a comment.

I'm really not a fan of the degree of complexity and subtlety that this
introduces into the frontend, all to allow particular backend optimizations.

I feel like this is Clang working around a fundamental deficiency in LLVM
and we should instead find a way to fix this in LLVM itself.

As has been pointed out before, user code can synthesize large integers
that small bit sequences are extracted from, and Clang and LLVM should
handle those just as well as actual bitfields.

Can we see how far we can push the LLVM side before we add complexity to
Clang here? I understand that there remain challenges to LLVM's stuff, but I
don't think those challenges make *all* of the LLVM improvements off the
table, I don't think we've exhausted all ways of improving the LLVM changes
being proposed, and I think we should still land all of those and
re-evaluate how important these issues are when all of that is in place.

The main challenge of doing  this in LLVM is that inter-procedural analysis
(and possibly cross module) is needed (for store forwarding issues).

Wei, perhaps you can provide concrete test case to illustrate the issue so
that reviewers have a good understanding.

David
Here is a runable testcase:
-------------------- 1.cc ------------------------
class A {
public:
   unsigned long f1:2;
   unsigned long f2:6;
   unsigned long f3:8;
   unsigned long f4:4;
};
A a;
unsigned long b;
unsigned long N = 1000000000;

__attribute__((noinline))
void foo() {
   a.f3 = 3;
}

__attribute__((noinline))
void goo() {
   b = a.f3;
}

int main() {
   unsigned long i;
   for (i = 0; i < N; i++) {
     foo();
     goo();
   }
}
------------------------------------------------------------
Now trunk takes about twice running time compared with trunk + this
patch. That is because trunk shrinks the store of a.f3 in foo (Done by
DagCombiner) but not shrink the load of a.f3 in goo, so store
forwarding will be blocked.

I can confirm that this is true on Haswell and also on an POWER8. Interestingly, on a POWER7, the performance is the same either way (on the good side). I ran the test case as presented and where I replaced f3 with a non-bitfield unsigned char member. Thinking that the POWER7 result might be because it's big-Endian, I flipped the order of the fields, and found that the version where f3 is not a bitfield is faster than otherwise, but only by 12.5%.

Why, in this case, don't we shrink the load? It seems like we should (and it seems like a straightforward case).

Thanks again,
Hal


The testcases shows the potential problem of store shrinking. Before
we decide to do store shrinking, we need to know all the related loads
will be shrunk,  and that requires IPA analysis. Otherwise, when load
shrinking was blocked for some difficult case (Like the instcombine
case described in
https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html),
performance regression will happen.

Wei.




Repository:
   rL LLVM

https://reviews.llvm.org/D36562



_______________________________________________
llvm-commits mailing list
llvm-comm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to