On Fri, Apr 29, 2011 at 7:50 PM, Sriraman Tallam <tmsri...@google.com> wrote:
>
> On Fri, Apr 29, 2011 at 1:56 AM, Richard Guenther
> <richard.guent...@gmail.com> wrote:
>>
>> On Fri, Apr 29, 2011 at 4:52 AM, Sriraman Tallam <tmsri...@google.com>
>> wrote:
>> > I want this patch to be considered for google/main now. This is intended
>> > to be submitted to trunk for review soon.
>> > This patch has been tested with crosstool bootstrap using buildit and by
>> > running all tests.
>> >
>> >
>> > Patch Description :
>> > ==================
>> >
>> > Frequently executed functions in programs are some times compiled
>> > into different versions to take advantage of some specific
>> > advanced features present in some of the types of platforms on
>> > which the program is executed. For instance, SSE4 versus no-SSE4.
>> > Existing techniques to do multi-versioning, for example using
>> > indirect calls, block compiler optimizations, mainly inlining.
>>
>> The general idea of supporting versioning is good, thanks for working on
>> it.
>> Comments below.
>>
>> > This patch adds a new GCC pass to support multi-versioned calls
>> > through a new builtin, __builtin_dispatch.  The __builtin_dispatch
>> > call is converted into a simple if-then construct to call the
>> > appropriate version at run-time. There are no indirect calls so
>> > inlining is not affected.
>>
>> I am not sure that combining the function choice and its invocation
>> to a single builtin is good.  First, you use variadic arguments for
>> the actual function arguments which will cause vararg promotions
>> to apply and thus it will be impossible to dispatch to functions
>> taking single precision float values (and dependent on ABI details
>> many more similar issues).  Second, you restrict yourself to
>> only two versions of a function - that looks quite limited and this
>> restriction is not easy to lift with your proposed interface.
>>
>> Thus, I would have kept regular (but indirect) calls in the source
>> program and only exposed the dispatch via a builtin, like ...
>>
>> > This patch also supports a function unswitching optimization via
>> > cloning of functions, only along hot paths, that call multi-versioned
>> > functions so that the hot multi-versioned paths can be independently
>> > optimized for performance. The cloning optimization is switched on
>> > via a flag, -fclone-hot-version-paths.
>> >
>> > Only two versions are supported in this patch. Example use :
>> >
>> >  int popcnt_sse4(unsigned int x) __attribute__((__target__("popcnt")));
>> >  int popcnt_sse4(unsigned int x)
>> >  {
>> >    int count = __builtin_popcount(x);
>> >    return count;
>> >  }
>> >
>> >  int popcnt(unsigned int x) __attribute__((__target__("no-popcnt")));
>> >  int popcnt(unsigned int x)
>> >  {
>> >    int count = __builtin_popcount(x);
>> >    return count;
>> >  }
>> >
>> >  int testsse() __attribute__((version_selector));
>> >  int main ()
>> >  {
>> >    ...
>> >    /* The multi-versioned call. */
>> >    ret = __builtin_dispatch (testsse,  (void*)popcnt_sse4,
>> > (void*)popcnt, 25);
>> >    ...
>> >  }
>>
>> int main()
>> {
>>  int (*ppcnt)(unsigned int) = __builtin_dispatch (testsse,
>> popcnt_sse4, popcnt);
>>  ret = (*ppcnt) (25);
>> }
>>
>> which would allow the testsse function to return the argument number of
>> the function to select.
>>
>> [snip]
>>
>> >  When to use the "version_selector" attribute ?
>> >  -----------------------------------------------
>> >
>> >  Functions are marked with attribute "version_selector" only if
>> >  they are run-time constants.  Example of such functions would
>> >  be those that test if a specific feature is available on a
>> >  particular architecture.  Such functions must return a positive
>> >  integer. For two-way functions, those that test if a feature
>> >  is present or not must return 1 or 0 respectively.
>> >
>> >  This patch will make constructors that call these function once
>> >  and assign the outcome to a global variable. From then on, only
>> >  the value of the global will be used to decide which version to
>> >  execute.
>>
>> That's a nice idea, but why not simply process all functions with
>> a const attribute and no arguments this way?  IMHO
>>
>> int testsse (void) __attribute__((const));
>>
>> is as good and avoids the new attribute completely.  The pass would
>> also benefit other uses of this idiom (giving a way to have global
>> dynamic initializers in C).  The functions may not be strictly 'const'
>> in a way the compiler autodetects this attribute but it presents
>> exactly the promises to the compiler that allow this optimization.
>
>
> Since, const functions cannot, not according to the definition atleast ,
>  read files or memory, I was thinking along the lines of making it a  "pure"
> function (For example, The version_selector function could be reading
> "/proc/cpuinfo" and looking for certain features).

Well, const functions "can" read files or memory, just its result may not
vary when that memory or files change (or rather the compiler assumes
that if you give the "const" premise).  With just a "pure" function the
compiler cannot possibly elide all calls to a read from a program-start
initialized return cache.

So you really need "const".

> The reason I invented a
> new attribute was because I assumed it is not legal to hoist the call site
> of a pure function into a constructor and substitute the value computed
> simply because it may not be valid to call the pure function ahead of the
> intended time in the execution.

That's true, which is why you need to use "const" which gives the
compiler that freedom.

> Basically, the semantics of the "version_selector" is that the value
> returned is a run-time constant and that it can be called at any point of
> time during the execution. However, it could be reading files and memory.
> Chatting with Ian, it looks like I can just tag this function as const and
> get away with it.

Yes.

> Also, I do not need to do this for every const function in general because
> *inlining* will take care of it. Also, evaluating it ahead-of-time may be
> unnecessary and could hurt in general if the function will not be likely
> executed. So, I could only mark feature test functions as const and have the
> multi-versioning patch hoist these functions alone because they could
> potentially be reading memory and even with inlining there could be  a
> overhead.

Well, the only use of const functions without arguments is for such
functions you want to promote to a static initializer and for maybe
wrapping fancy constants (I didn't see use of the latter).  Because
a const function without arguments is just a "name" for a constant
(in your case a runtime constant).  If inlining can take care of it
it would be a compile-time constant, which you can easily check
before doing the transformation (just check if the function body
is simply returning a constant).

>
>>
>> Thus, please split this piece out into a separate patch and use
>> the const attribute.
>>
>> >  What happens with cloning, -fclone-hot-version-paths ?
>> >  -----------------------------------------------------
>>
>> Now, here you lost me somewhat, because I didn't look into the
>> patch details and I am missing an example on how the lowered
>> IL looks before that cloning.  So for now I suppose this
>> -fclone-hot-version-paths
>> is to expose direct calls to the IL.  If you would lower
>> __builtin_dispatch
>> to a style like
>>
>>  int sel = selector ();
>>  switch (sel)
>>     {
>>     case 0:
>>       fn = popcnt;
>>      break;
>>     case 1:
>>       fn = popcnt_sse4;
>>       break;
>>     }
>>   res = (*fn) (25);
>>
>> then rather than a new pass specialized for __builtin_dispatch existing
>> optimization passes that are able to do tracing like VRP and DOM
>> via jump-threading or the tracer pass should be able to do this
>> optimization for you.  If they do not use FDO in a good way it is better
>> to improve them for this case instead of writing a new pass.
>
> I have shown dumps  for a sample program to better illustrate what happens
> with multi-versioning. It is a toy program used for illustration.
> foo.cc
> =====
> extern int __attribute__ ((version_selector))
> featureTest ();
> int foo (int i)
> {
>   return i;
> }
> int bar (int i)
> {
>   return (11 - i);
> }
> /* This calculates the sum of numbers from 1 to 10 in 2 different ways. */
> int __attribute__ ((hot))
> problem ()
> {
>   int ret = 0;
>   int j = 1;
>   for (j = 1; j<=10; j++)
>     ret += __builtin_dispatch (featureTest, (void *)foo, (void *)bar, j);
>   return ret;
> }
> int __attribute__ ((hot))
> main ()
> {
>   return problem() - 55;
> }
> Here, irrespective of the outcome of featureTest, problem computes the sum
> of the 1st 10 numbers, and hence the value returned is 55.
> Now, without cloning, the IR for  function "problem" after converting the
> __builtin_dispatch is :
> ===================================================================
>  pretmp.8_2 = _Z11featureTestv_version_selector_global;
> <bb 3>:
>   if (pretmp.8_2 > 0)
>     goto <bb 5>;
>   else
>     goto <bb 4>;
> <bb 4>:
>    D.2144_1 = (unsigned int) j_24;
>   D.2145_21 = 11 - D.2144_1;
>   j_22 = (int) D.2145_21;
> <bb 5>:
>   ret_9 = ret_23 + j_19;
>
>   j_10 = j_24 + 1;
>   if (j_10 != 11)
>     goto <bb 3>;
>   else
>     goto <bb 6>;
> <bb 6>:
>    return ret_25;

Ok, so you lower not to a single indirect call but to several direct calls.
We have no pass that currently transforms the former into the latter
but tree-ssa-phiprop.c could be trivially extended to do that (it
hoists indirect loads of a PHI result up to the defs of the PHI args
if that results in direct loads).

> =================================================
> The loop stays and will be executed.
> whereas with cloning, problem is made into 2 clones for each version :
> int _Z7problemv_clone_0() ()
> {
>   bool D.2120;
>   int D.2119;
>   int ret;
>   int j;
> <bb 6>:
> <bb 2>:
>     goto <bb 4>;
> <bb 3>:
>   D.2119_2 = foo (j_1);   /* Only calls foo */
>   ret_4 = D.2119_2 + ret_3;
>    j_5 = j_1 + 1;
> <bb 4>:
>   if (j_1 <= 10)
>     goto <bb 3>;
>   else
>     goto <bb 5>;
> <bb 5>:
>   return ret_3;
> }
> and
> int _Z7problemv_clone_1() ()
> {
>   bool D.2127;
>   int D.2126;
>   int ret;
>   int j;
> <bb 6>:
> <bb 2>:
>    goto <bb 4>;
> <bb 3>:
>   D.2126_2 = bar (j_1); /* Only calls bar */
>   ret_4 = D.2126_2 + ret_3;
>    j_5 = j_1 + 1;
> <bb 4>:
>   if (j_1 <= 10)
>     goto <bb 3>;
>   else
>     goto <bb 5>;
> <bb 5>:
>   return ret_3;
> }

When do you apply this cloning?  It must be before inlining, right?

>  and this gets optimized into :
> int _Z7problemv_clone_0() ()
> {
>   return 55;
> }
> int _Z7problemv_clone_1() ()
> {
>   return 55;
> }

I would have expected that loop unswitching is able to do the same
optimization.

> Now, function problem is converted into :
> int problem ()
> {
>   D.2129_10 = __builtin_dispatch (featureTest, _Z7problemv_clone_0,
> _Z7problemv_clone_1);
> }
> that is, call the right clone based on the outcome. Please note the re-use
> of the __builtin_dispatch here to do this.
> and after conversion and optimization becomes :
> int problem ()
> {
>   return 55;
> }

Where did you detect that the two clones are equal?  Or is the above
__builtin_dispatch never in the IL and lowered to two direct fncalls
directly?

> This gets inlined to main, and main becomes :
> int main ()
> {
>   return 0;
> }
> This has been added as a test case.

I think you are wrong in re-inventing what existing passes should be
able to do just fine (inlining and unswitching).  I realize that doing
things earlier is always appealing, but it feels wrong to add a lot of
code for a new GCC extension that existing code doesn't use instead
of trying to handle the patterns that exist in the wild in a better way.

Richard.

> Thanks,
> -Sri.
>>
>> Note that we already have special FDO support for indirect to direct
>> call promotion, so that might work already.
>>
>> Thus, with all this, __builtin_dispatch would be more like syntactic
>> sugar to avoid writing above switch statement yourself.  If you restrict
>> that sugar to a constant number of candidates it can be replaced by
>> a macro (or several ones, specialized for the number of candidates).
>>
>> Richard.
>>
>> >  For the call graph that looks like this before cloning :
>> >
>> > func_cold ----> func_hot ----> findOnes ----> IsPopCntSupported ---->
>> > popcnt
>> >                                                |
>> >                                                -----> no_popcnt
>> >
>> >  where popcnt and no_popcnt are the multi-versioned functions, then
>> > after cloning :
>> >
>> > func_cold ---->IsPopCntSupported ----> func_hot_clone0 ---->
>> > findOnes_clone0 ----> popcnt
>> >                              |
>> >                               ----> func_hot_clone1 ---->
>> > findOnes_clone1 ----> no_popcnt
>> >
>> >
>> >
>> >
>> >  Flags : -fclone-hot-version-paths does function unswitching via
>> > cloning.
>> >          --param=num-mversn-clones=<num> allows to specify the number of
>> >          functions that should be cloned.
>> >          --param=mversn-clone-depth=<num> allows to specify the length
>> > of
>> >          the call graph path that should be cloned.  num = 0 implies
>> > only
>> >          leaf node that contains the __builtin_dispatch statement must
>> > be
>> >          cloned.
>> >
>> > ============================================================================================
>> >
>> > Google ref 45947, 45986
>> >
>> > 2011-04-28  Sriraman Tallam  <tmsri...@google.com>
>> >
>> >        * c-family/c-common.c   (revision 173122)
>> > (handle_version_selector_attribute): New function.
>> >        (c_common_attribute_table): New attribute "version_selector".
>> >        * tree-pass.h   (revision 173122)
>> > (pass_tree_convert_builtin_dispatch): New pass.
>> >        (pass_ipa_multiversion_dispatch): New pass.
>> >        * testsuite/gcc.dg/mversn7.c    (revision 0): New test case.
>> >        * testsuite/gcc.dg/mversn4.c    (revision 0): New test case.
>> >        * testsuite/gcc.dg/mversn4.h    (revision 0): New test case.
>> >        * testsuite/gcc.dg/mversn4a.c   (revision 0): New test case.
>> >        * testsuite/gcc.dg/torture/mversn1.c    (revision 0): New test
>> > case.
>> >        * testsuite/gcc.dg/mversn2.c    (revision 0): New test case.
>> >        * testsuite/gcc.dg/mversn6.c    (revision 0): New test case.
>> >        * testsuite/gcc.dg/mversn3.c    (revision 0): New test case.
>> >        * testsuite/g++.dg/mversn8.C    (revision 0): New test case.
>> >        * testsuite/g++.dg/mversn10a.C  (revision 0): New test case.
>> >        * testsuite/g++.dg/mversn14a.C  (revision 0): New test case.
>> >        * testsuite/g++.dg/tree-prof/mversn13.C (revision 0): New test
>> > case.
>> >        * testsuite/g++.dg/tree-prof/mversn15.C (revision 0): New test
>> > case.
>> >        * testsuite/g++.dg/tree-prof/mversn15a.C        (revision 0): New
>> > test case.
>> >        * testsuite/g++.dg/mversn9.C    (revision 0): New test case.
>> >        * testsuite/g++.dg/mversn10.C   (revision 0): New test case.
>> >        * testsuite/g++.dg/mversn12.C   (revision 0): New test case.
>> >        * testsuite/g++.dg/mversn14.C   (revision 0): New test case.
>> >        * testsuite/g++.dg/mversn16.C   (revision 0): New test case.
>> >        * testsuite/g++.dg/torture/mversn11.C   (revision 0): New test
>> > case.
>> >        * testsuite/g++.dg/torture/mversn5.C    (revision 0): New test
>> > case.
>> >        * testsuite/g++.dg/torture/mversn5.h    (revision 0): New test
>> > case.
>> >        * testsuite/g++.dg/torture/mversn5a.C   (revision 0): New test
>> > case.
>> >        * builtin-types.def     (revision 173122) (BT_PTR_FN_INT): New
>> > pointer type.
>> >        (BT_FN_INT_PTR_FN_INT_PTR_PTR_VAR): New function type for
>> > __builtin_dispatch.
>> >        * builtins.def  (revision 173122) (BUILT_IN_DISPATCH): New
>> > builtin to
>> >        support multi-version calls.
>> >        * mversn-dispatch.c     (revision 0): New file.
>> >        * timevar.def   (revision 173122) (TV_MVERSN_DISPATCH): New time
>> > var.
>> >        * common.opt    (revision 173122) (fclone-hot-version-paths): New
>> > flag.
>> >        * Makefile.in   (revision 173122) (mversn-dispatch.o): New rule.
>> >        * passes.c      (revision 173122) (init_optimization_passes): Add
>> > the new
>> >        multi-version and dispatch passes to the pass list.
>> >        * params.def    (revision 173122)
>> > (PARAM_NUMBER_OF_MVERSN_CLONES): Define.
>> >        (PARAM_MVERSN_CLONE_CGRAPH_DEPTH): Define.
>> >        * doc/invoke.texi       (revision 173122) (mversn-clone-depth):
>> > Document.
>> >        (num-mversn-clones): Document.
>> >        (fclone-hot-version-paths): Document.
>
>

Reply via email to