On 09/03/2017 10:38 PM, Xinliang David Li wrote:
Store forwarding stall cost is usually much higher compared with a load hitting L1 cache. For instance, on Haswell, the latter is ~4 cycles, while the store forwarding stalls cost about 10 cycles more than a successful store forwarding, which is roughly 15 cycles. In some scenarios, the store forwarding stalls can be as high as 50 cycles. See Agner's documentation.

I understand. As I understand it, there are two potential ways to fix this problem:

1. You can make the store wider (to match the size of the wide load, thus permitting forwarding). 2. You can make the load smaller (to match the size of the small store, thus permitting forwarding).

At least in this benchmark, which is a better solution?

Thanks again,
Hal


In other words, the optimizer needs to be taught to avoid defeating the HW pipeline feature as much as possible.

David

On Sun, Sep 3, 2017 at 6:32 PM, Hal Finkel <hfin...@anl.gov <mailto:hfin...@anl.gov>> wrote:


    On 09/03/2017 03:44 PM, Wei Mi wrote:

        On Sat, Sep 2, 2017 at 6:04 PM, Hal Finkel <hfin...@anl.gov
        <mailto:hfin...@anl.gov>> wrote:

            On 08/22/2017 10:56 PM, Wei Mi via llvm-commits wrote:

                On Tue, Aug 22, 2017 at 7:03 PM, Xinliang David Li
                <davi...@google.com <mailto:davi...@google.com>>
                wrote:


                    On Tue, Aug 22, 2017 at 6:37 PM, Chandler Carruth
                    via Phabricator
                    <revi...@reviews.llvm.org
                    <mailto:revi...@reviews.llvm.org>> wrote:

                        chandlerc added a comment.

                        I'm really not a fan of the degree of
                        complexity and subtlety that this
                        introduces into the frontend, all to allow
                        particular backend
                        optimizations.

                        I feel like this is Clang working around a
                        fundamental deficiency in
                        LLVM
                        and we should instead find a way to fix this
                        in LLVM itself.

                        As has been pointed out before, user code can
                        synthesize large integers
                        that small bit sequences are extracted from,
                        and Clang and LLVM should
                        handle those just as well as actual bitfields.

                        Can we see how far we can push the LLVM side
                        before we add complexity to
                        Clang here? I understand that there remain
                        challenges to LLVM's stuff,
                        but I
                        don't think those challenges make *all* of the
                        LLVM improvements off the
                        table, I don't think we've exhausted all ways
                        of improving the LLVM
                        changes
                        being proposed, and I think we should still
                        land all of those and
                        re-evaluate how important these issues are
                        when all of that is in place.


                    The main challenge of doing  this in LLVM is that
                    inter-procedural
                    analysis
                    (and possibly cross module) is needed (for store
                    forwarding issues).

                    Wei, perhaps you can provide concrete test case to
                    illustrate the issue
                    so
                    that reviewers have a good understanding.

                    David

                Here is a runable testcase:
                -------------------- 1.cc ------------------------
                class A {
                public:
                    unsigned long f1:2;
                    unsigned long f2:6;
                    unsigned long f3:8;
                    unsigned long f4:4;
                };
                A a;
                unsigned long b;
                unsigned long N = 1000000000;

                __attribute__((noinline))
                void foo() {
                    a.f3 = 3;
                }

                __attribute__((noinline))
                void goo() {
                    b = a.f3;
                }

                int main() {
                    unsigned long i;
                    for (i = 0; i < N; i++) {
                      foo();
                      goo();
                    }
                }
                ------------------------------------------------------------
                Now trunk takes about twice running time compared with
                trunk + this
                patch. That is because trunk shrinks the store of a.f3
                in foo (Done by
                DagCombiner) but not shrink the load of a.f3 in goo,
                so store
                forwarding will be blocked.


            I can confirm that this is true on Haswell and also on an
            POWER8.
            Interestingly, on a POWER7, the performance is the same
            either way (on the
            good side). I ran the test case as presented and where I
            replaced f3 with a
            non-bitfield unsigned char member. Thinking that the
            POWER7 result might be
            because it's big-Endian, I flipped the order of the
            fields, and found that
            the version where f3 is not a bitfield is faster than
            otherwise, but only by
            12.5%.

            Why, in this case, don't we shrink the load? It seems like
            we should (and it
            seems like a straightforward case).

            Thanks again,
            Hal

        Hal, thanks for trying the test.

        Yes, it is straightforward to shrink the load in the test. I can
        change the testcase a little to show why it is sometime
        difficult to
        shrink the load:

        class A {
        public:
           unsigned long f1:16;
           unsigned long f2:16;
           unsigned long f3:16;
           unsigned long f4:8;
        };
        A a;
        bool b;
        unsigned long N = 1000000000;

        __attribute__((noinline))
        void foo() {
           a.f4 = 3;
        }

        __attribute__((noinline))
        void goo() {
           b = (a.f4 == 0 && a.f3 == (unsigned short)-1);
        }

        int main() {
           unsigned long i;
           for (i = 0; i < N; i++) {
             foo();
             goo();
           }
        }

        For the load a.f4 in goo, it is diffcult to motivate its
        shrink after
        instcombine because the comparison with a.f3 and the
        comparison with
        a.f4 are merged:

        define void @_Z3goov() local_unnamed_addr #0 {
           %1 = load i64, i64* bitcast (%class.A* @a to i64*), align 8
           %2 = and i64 %1, 0xffffff00000000
           %3 = icmp eq i64 %2, 0xffff00000000
           %4 = zext i1 %3 to i8
           store i8 %4, i8* @b, align 1, !tbaa !2
           ret void
        }


    Exactly. But this behavior is desirable, locally. There's no good
    answer here: We either generate extra load traffic here (because
    we need to load the fields separately), or we widen the store
    (which generates extra load traffic there). Do you know, in terms
    of performance, which is better in this case (i.e., is it better
    to widen the store or split the load)?

     -Hal



        Thanks,
        Wei.

                The testcases shows the potential problem of store
                shrinking. Before
                we decide to do store shrinking, we need to know all
                the related loads
                will be shrunk,  and that requires IPA analysis.
                Otherwise, when load
                shrinking was blocked for some difficult case (Like
                the instcombine
                case described in
                
https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html
                
<https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html>),
                performance regression will happen.

                Wei.



                        Repository:
                            rL LLVM

                        https://reviews.llvm.org/D36562
                        <https://reviews.llvm.org/D36562>



                _______________________________________________
                llvm-commits mailing list
                llvm-comm...@lists.llvm.org
                <mailto:llvm-comm...@lists.llvm.org>
                http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
                <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits>


            --
            Hal Finkel
            Lead, Compiler Technology and Programming Languages
            Leadership Computing Facility
            Argonne National Laboratory


-- Hal Finkel
    Lead, Compiler Technology and Programming Languages
    Leadership Computing Facility
    Argonne National Laboratory



--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to