RE: [og12] libgomp: Document OpenMP 'pinned' memory (was: [PATCH] libgomp, openmp: pinned memory

2023-03-27 Thread Stubbs, Andrew via Gcc-patches
> -Original Message-
> From: Thomas Schwinge 
> Sent: 24 March 2023 15:50
> To: gcc-patches@gcc.gnu.org; Andrew Stubbs ;
> Tobias Burnus 
> Subject: [og12] libgomp: Document OpenMP 'pinned' memory (was: [PATCH]
> libgomp, openmp: pinned memory
> 
> Hi!
> 
> On 2022-01-04T15:32:17+, Andrew Stubbs 
> wrote:
> > This patch implements the OpenMP pinned memory trait [...]
> 
> I figure it may be helpful to document the current og12 state of affairs; does
> the attached "libgomp: Document OpenMP 'pinned' memory" look good to
> you?

I don't really know what "allocated via the device" means? I mean, I presume 
you mean "via CUDA", but I don't think this is obvious to the average reader.

Maybe "allocation is optimized for the device" or some such thing?

Andrew


RE: Attempt to register OpenMP pinned memory using a device instead of 'mlock' (was: [PATCH] libgomp, openmp: pinned memory)

2023-02-16 Thread Stubbs, Andrew via Gcc-patches
> -Original Message-
> From: Thomas Schwinge 
> Sent: 16 February 2023 15:33
> To: Andrew Stubbs ; Jakub Jelinek ;
> Tobias Burnus ; gcc-patches@gcc.gnu.org
> Subject: Attempt to register OpenMP pinned memory using a device instead of
> 'mlock' (was: [PATCH] libgomp, openmp: pinned memory)
> 
> Hi!
> 
> On 2022-06-09T11:38:22+0200, I wrote:
> > On 2022-06-07T13:28:33+0100, Andrew Stubbs  wrote:
> >> On 07/06/2022 13:10, Jakub Jelinek wrote:
> >>> On Tue, Jun 07, 2022 at 12:05:40PM +0100, Andrew Stubbs wrote:
>  Following some feedback from users of the OG11 branch I think I need to
>  withdraw this patch, for now.
> 
>  The memory pinned via the mlock call does not give the expected
> performance
>  boost. I had not expected that it would do much in my test setup, given
> that
>  the machine has a lot of RAM and my benchmarks are small, but others
> have
>  tried more and on varying machines and architectures.
> >>>
> >>> I don't understand why there should be any expected performance boost
> (at
> >>> least not unless the machine starts swapping out pages),
> >>> { omp_atk_pinned, true } is solely about the requirement that the memory
> >>> can't be swapped out.
> >>
> >> It seems like it takes a faster path through the NVidia drivers. This is
> >> a black box, for me, but that seems like a plausible explanation. The
> >> results are different on x86_64 and powerpc hosts (such as the Summit
> >> supercomputer).
> >
> > For example, it's documented that 'cuMemHostAlloc',
> >
>  ia.com%2Fcuda%2Fcuda-driver-
> api%2Fgroup__CUDA__MEM.html%23group__CUDA__MEM_1g572ca4011bfcb25034888a14d4e
> 035b9=05%7C01%7Candrew.stubbs%40siemens.com%7C239a86c9ff1142313daa08db1
> 0331cfc%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C638121583939887694%7CUn
> known%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJX
> VCI6Mn0%3D%7C3000%7C%7C%7C=7S8K2opKAV%2F5Ub2tyZtcgplptZ65dNc3b%2F2IYoh
> me%2Fw%3D=0>,
> > "Allocates page-locked host memory".  The crucial thing, though, what
> > makes this different from 'malloc' plus 'mlock' is, that "The driver
> > tracks the virtual memory ranges allocated with this function and
> > automatically accelerates calls to functions such as cuMemcpyHtoD().
> > Since the memory can be accessed directly by the device, it can be read
> > or written with much higher bandwidth than pageable memory obtained with
> > functions such as malloc()".
> >
> > Similar, for example, for 'cuMemAllocHost',
> >
>  ia.com%2Fcuda%2Fcuda-driver-
> api%2Fgroup__CUDA__MEM.html%23group__CUDA__MEM_1gdd8311286d2c2691605362c689b
> c64e0=05%7C01%7Candrew.stubbs%40siemens.com%7C239a86c9ff1142313daa08db1
> 0331cfc%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C638121583939887694%7CUn
> known%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJX
> VCI6Mn0%3D%7C3000%7C%7C%7C=TAhX%2BFjPavhKZKICMDiO%2BuZuytxnkaDvfDArT0R
> KDV0%3D=0>.
> >
> > This, to me, would explain why "the mlock call does not give the expected
> > performance boost", in comparison with 'cuMemAllocHost'/'cuMemHostAlloc';
> > with 'mlock' you're missing the "tracks the virtual memory ranges"
> > aspect.
> >
> > Also, by means of the Nvidia Driver allocating the memory, I suppose
> > using this interface likely circumvents any "annoying" 'ulimit'
> > limitations?  I get this impression, because documentation continues
> > stating that "Allocating excessive amounts of memory with
> > cuMemAllocHost() may degrade system performance, since it reduces the
> > amount of memory available to the system for paging.  As a result, this
> > function is best used sparingly to allocate staging areas for data
> > exchange between host and device".
> >
>  It seems that it isn't enough for the memory to be pinned, it has to be
>  pinned using the Cuda API to get the performance boost.
> >>>
> >>> For performance boost of what kind of code?
> >>> I don't understand how Cuda API could be useful (or can be used at all)
> if
> >>> offloading to NVPTX isn't involved.  The fact that somebody asks for
> host
> >>> memory allocation with omp_atk_pinned set to true doesn't mean it will
> be
> >>> in any way related to NVPTX offloading (unless it is in NVPTX target
> region
> >>> obviously, but then mlock isn't available, so sure, if there is
> something
> >>> CUDA can provide for that case, nice).
> >>
> >> This is specifically for NVPTX offload, of course, but then that's what
> >> our customer is paying for.
> >>
> >> The expectation, from users, is that memory pinning will give the
> >> benefits specific to the active device. We can certainly make that
> >> happen when there is only one (flavour of) offload device present. I had
> >> hoped it could be one way for all, but it looks like not.
> >
> > Aren't there CUDA Driver interfaces for that?  That is:
> 

Re: [Patch] gcn: Add __builtin_gcn_{get_stack_limit,first_call_this_thread_p}

2022-11-21 Thread Stubbs, Andrew via Gcc-patches

On 21/11/2022 13:41, Tobias Burnus wrote:

On 19.11.22 11:46, Tobias Burnus wrote:

+   stacklimit = stackbase + seg_size*64;

(this should be '*seg_size' not 'seg_size' and the name should be
s/seg_size/seg_size_ptr/.)

I have updated the comment and ...

(Reading it, I think it should be '..._MEM(SImode,' and
'..._MULT(SImode' instead of DImode.)

Additionally, there was a problem of bytes vs. bits in:

My understanding is that
dispatch_ptr->private_segment_size == *((char*)dispatch_ptr + 192)


which is wrong - its 192 bits but only 24 bytes!

Finally, in the first_call_this_thread_p() call, I mixed up EQ vs. NE at 
one place.


BTW: It seems as if there is no problem with zero extension, if I look 
at the assembler result.


Updated version. Consists of: GCC patch adding the builtins,
the newlib patch using those (unchanged; used for testing + to be 
submitted), and

a 'test.c' using the builtins and its dump produced with amdgcn's
'cc1 -O2' to show the resulting assembly.

Tested with libgomp on gfx908 offloading and getting only the known fails:
(libgomp.c-c++-common/teams-2.c, libgomp.fortran/async_io_*.f90,
libgomp.oacc-c-c++-common/{deep-copy-10.c,static-variable-1.c,vprop.c})

OK for mainline?


OK, provided it has been tested in both stand-alone and offload modes, 
and the newlib tests too.


Andrew


RE: [committed 6/6] amdgcn: vector testsuite tweaks

2022-10-28 Thread Stubbs, Andrew
> -Original Message-
> Looking into commit r13-3225-gbd9a05594d227cde79a67dc715bd9d82e9c464e9
> "amdgcn: vector testsuite tweaks" for a moment, I also did wonder about
> the following changes, because for 'vect_multiple_sizes' (for example,
> x86_64-pc-linux-gnu) that seems to lose more specific testing;
> previously: 'scan-tree-dump-times' exactly once, now: 'scan-tree-dump'
> any number of times.  But I've no clue about that myself, so just
> mentioning this, in case somebody else has an opinion.  ;-)

When vect_multiple_sizes is true the number of times the pattern appears will 
be greater that normal.  Most likely the pattern will appear once for each 
vector size.  In the case of GCN, a pattern that normally appears 4 times now 
appears 24 times.

The alternative would be to have a whole set of patterns for each configuration 
of each target that can have the multiple sizes.  That or change the 
implementation of 'scan-tree-dump-times' to support expressions of some kind, 
but even then the expressions would get hairy.

Andrew


RE: [PATCH] amdgcn: Add support for additional natively supported floating-point operations

2022-09-09 Thread Stubbs, Andrew
> -Original Message-
> I agree - for example powerpc has -mrecip= to control which instructions
> to use (float/double rsqrt or inverse) and -mrecip-precision to
> specify whether further iteration is done or not.
> 
> x86 has similar but does always perform newton raphson iteration,
> documenting 2 ulp instead of 0.5 ulp precision.
> 
> Your suggested huge reduction in precision isn't usually acceptable
> and should be always explicitely enabled.

There isn't a problem with *this* patch (although we do have existing accuracy 
issues thanks to previous documents lacking the information).

The "inaccurate" instructions are single-precision only, and therefore 
acceptable with -ffast-math.

Kwok intends to provide vectorized library calls for the double-precision and 
-fno-fast-math cases.

In general I want to avoid adding extra arch-specific options; partly because 
approximately no one will use them, and partly because the amdgcn compiler is 
almost always hidden behind an x86_64 compiler.

Andrew


RE: [PATCH] libgomp, openmp: pinned memory

2022-06-09 Thread Stubbs, Andrew
> For example, it's documented that 'cuMemHostAlloc',
>  api/group__CUDA__MEM.html#group__CUDA__MEM_1g572ca4011bfcb25034888a14d4e035b
> 9>,
> "Allocates page-locked host memory".  The crucial thing, though, what
> makes this different from 'malloc' plus 'mlock' is, that "The driver
> tracks the virtual memory ranges allocated with this function and
> automatically accelerates calls to functions such as cuMemcpyHtoD().
> Since the memory can be accessed directly by the device, it can be read
> or written with much higher bandwidth than pageable memory obtained with
> functions such as malloc()".

OK, interesting. I had not seen this, but I think it confirms that the 
performance difference is within Cuda and regular locked memory is not so great.

> Also, by means of the Nvidia Driver allocating the memory, I suppose
> using this interface likely circumvents any "annoying" 'ulimit'
> limitations?

Yes, this is the case.

> If not directly *allocating and registering* such memory via
> 'cuMemAllocHost'/'cuMemHostAlloc', you should still be able to only
> *register* your standard 'malloc'ed etc. memory via 'cuMemHostRegister',
>  api/group__CUDA__MEM.html#group__CUDA__MEM_1gf0a9fe11544326dabd743b7aa6b5422
> 3>:
> "Page-locks the memory range specified [...] and maps it for the
> device(s) [...].  This memory range also is added to the same tracking
> mechanism as cuMemHostAlloc to automatically accelerate [...]"?  (No
> manual 'mlock'ing involved in that case, too; presumably again using this
> interface likely circumvents any "annoying" 'ulimit' limitations?)
> 
> Such a *register* abstraction can then be implemented by all the libgomp
> offloading plugins: they just call the respective
> CUDA/HSA/etc. functions to register such (existing, 'malloc'ed, etc.)
> memory.
> 
> ..., but maybe I'm missing some crucial "detail" here?

I'm investigating this stuff for the AMD USM implementation as well right now. 
It might be a good way to handle static and stack data too. Or not.

Andrew


RE: [PATCH] libgomp, openmp: pinned memory

2022-06-09 Thread Stubbs, Andrew
> The question is only what to do with 'requires unified_shared_memory' –
> and a non-multi-device allocator.

The compiler emits an error at compile time if you attempt to use both 
-foffload-memory=pinned and USM, because they’re not compatible. You're fine to 
use both explicit allocators in the same program, but the "pinnedness" of USM 
allocations is a matter for Cuda to care about (cuMallocManaged) and has 
nothing to do with this discussion.

The OpenMP pinned memory feature is intended to accelerate normal mappings, as 
far as I can tell.

Andrew


RE: [wwwdocs] gcc-12/changes.html (GCN): >1 workers per gang

2021-08-16 Thread Stubbs, Andrew
> In other words:  For gangs > #CUs or >1 gang per CU, the following patch
> is needed:
>[OG11] https://gcc.gnu.org/g:4dcd1e1f4e6b451aac44f919b8eb3ac49292b308
>[email] https://gcc.gnu.org/pipermail/gcc-patches/2020-July/550102.html
>   "not suitable for mainline until the multiple-worker support is merged
> there"
> 
> @Andrew + @Julian:  Do you intent to commit it relatively soon?
> Regarding the wwwdocs patch, I can hold off until that commit or reword
> it to only cover the workers part.

Were these not part of the patch set Thomas was working on?

Andrew


Re: [PATCH] emit-rtl.c: Allow vector subreg of float vectors

2020-08-10 Thread Stubbs, Andrew


On 10 Aug 2020 17:23, Richard Sandiford  wrote:
Andrew Stubbs  writes:
> On 06/08/2020 04:54, Richard Sandiford wrote:
>>> diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c
>>> index f9b0e9714d9..d7067989ad7 100644
>>> --- a/gcc/emit-rtl.c
>>> +++ b/gcc/emit-rtl.c
>>> @@ -947,6 +947,11 @@ validate_subreg (machine_mode omode, machine_mode 
>>> imode,
>>> else if (VECTOR_MODE_P (omode)
>>> && GET_MODE_INNER (omode) == GET_MODE_INNER (imode))
>>>   ;
>>> +  /* Allow sections of vectors, both smaller and paradoxical.  (This just
>>> + works for integers, but needs to be explicitly allowed for floats.)  
>>> */
>>> +  else if (VECTOR_MODE_P (omode) && VECTOR_MODE_P (imode)
>>> +  && GET_MODE_INNER (omode) == GET_MODE_INNER (imode))
>>> +;
>>
>> Might be missing something, but isn't this a subcondition of the
>> previous “else if”?  It looks like that ought to catch every case
>> that the new one does.
>
> Apparently, Hongtao and Uroš fixed my problem while I was working on this.
>
> Yes, my patch does the same (although I would question whether it's safe
> to use "GET_MODE_INNER (imode)" without having first confirmed "imode"
> is a vector mode).

FWIW, skipping the VECTOR_MODE_P test means that we also allow vector
paradoxical subregs of scalars, which can be useful if you're trying
to make a vector in which only element 0 is initialised.  (Only works
for element 0 though.)  IIRC this is what some of the x86 patterns do.

Guess it means we also allow vector subregs of complex values,
but that kind-of seems OK/useful too.

Is GET_INNER_MODE valid for scalers though?

Andrew


Re: [PATCH] [og10] amdgcn: Add waitcnt after LDS write instructions

2020-06-29 Thread Stubbs, Andrew
On 29 Jun 2020 22:03, "Brown, Julian"  wrote:
On Mon, 29 Jun 2020 21:32:41 +0100
Andrew Stubbs  wrote:
> In particular, it seems logical that any barrier should be a memory
> barrier, so inserting it in the barrier pattern is not a big deal.
> IIRC, only OpenACC is using that anyway (OpenMP has explicit asm
> inserts in libgomp).

I'd be happier with that idea if ds_{read,write} operations were *only*
used for broadcasting -- but they're not, they may also be used for
(some) gang-private variables and for reduction temporaries. I don't
have a test case for either of those at present demonstrating bad
behaviour with no waitcnt, but I guess it's theoretically possible for
there to be one, at least.

If there's no barrier then a few cycles this way or that shouldn't make any 
difference, surely?

The only exception I can think of might be atomic release operators, but those 
do a cache flush already, so there shouldn't be any issue with a slightly 
delayed DS operation. Maybe there should be a wait instruction before those too.

The "proper" solution is a general (& "optimal") waitcnt insertion
pass, I think, that works with other memory operations as well as the
DS ones.

Well, yes, that would be nice. The read waits are surely the worst performance 
loss. It's not a trivial task though, and AMD refused to fund it as a directed 
services task last winter.

Andrew


Re: [PATCH][ARM] -m{cpu,tune,arch}=native

2011-08-30 Thread Stubbs, Andrew
On 29/08/11 04:29, Michael Hope wrote:
 On Sat, Aug 27, 2011 at 3:19 AM, Andrew Stubbsa...@codesourcery.com  wrote:
 Hi all,

 This patch adds support for -mcpu=native, -mtune=native, and -march=native
 for ARM Linux hosts.

 So far, it only recognises Cortex-A8 and Cortex-A9, so I really need to find
 out what the magic part numbers are for other cpus before this patch is
 complete. I couldn't just find this information listed anywhere. I think
 there are a lot of clues in the kernel code, but it's hard to mine and it
 mostly only goes as far the architecture version, not the individual cpu.

 Could you send linaro-dev@ an email and ask them where this API is documented?

API? It reads /proc/cpuinfo, and that just spits out the magic numbers 
from the processor registers. Linaro-dev might know more magic numbers 
though.

 Hmm.  Should there be a -mfpu=native that picks between soft, VFP, and
 NEON based on the host?

Yes, we could do that also, based on the CPU feature flags. I'll add it 
to my list of things to do, but save it for another patch.

Andrew


Re: [PATCH (3/7)] Widening multiply-and-accumulate pattern matching

2011-07-01 Thread Stubbs, Andrew
On 28/06/11 17:37, Michael Matz wrote:
 What I want (and I'm not totally clear on what this actually means) is
   to be able to optimize all the cases where the end result will be the
   same as the compiler produces now (using multiple multiply, shift, and
   add operations).
 Okay, then you really want to look through value-preserving conversions.

   Ok, so that's an obvious statement, but the point is that, right now,
   the compiler does nothing special when you cast from int -  unsigned
   int, or vice-versa, and I want to capture that somehow. There are some
   exceptions, I'm sure, but what are they?
 Same-sized signed-  unsigned conversions aren't value preserving:
unsigned char c = 255; (signed char)c == -1; 255 != -1
 unsigned -  larger sized signed is value preserving
unsigned char c = 255; (signed short)c == 255;
 signed -  unsigned never is value preserving

OK, so I've tried implementing this, and I find I hit against a problem:

Given this test case:

   unsigned long long
   foo (unsigned long long a, signed char *b, signed char *c)
   {
 return a + *b * *c;
   }

Those rules say that it should not be suitable for optimization because 
there's an implicit cast from signed int to unsigned long long.

Without any widening multiplications allowed, GCC gives this code (for ARM):

   ldrsb   r2, [r2, #0]
   ldrsb   r3, [r3, #0]
   mul r2, r2, r3
   addsr0, r0, r2
   adc r1, r1, r2, asr #31

This is exactly what a signed widening multiply-and-accumulate with 
smlalbb would have done!

OK, so the types in the testcase are a bit contrived, but my point is 
that I want to be able to use the widening-mult instructions everywhere 
that they would produce the same output and gcc would otherwise, and gcc 
just doesn't seem that interested in signed-unsigned conversions.

So, I'm happy to put in checks to ensure that truncations are not 
ignore, but I'm really not sure what's the right thing to do with the 
extends and signedness switches.

Any suggestions?

Andrew


Re: [PATCH (3/7)] Widening multiply-and-accumulate pattern matching

2011-07-01 Thread Stubbs, Andrew
On 01/07/11 13:33, Paolo Bonzini wrote:
 Got it now! Casts from signed to unsigned are not value-preserving, but
 they are bit-preserving: s32-s64 obviously is, and s32-u64 has the
 same result bit-by-bit as the s64 result. The fact that s64 has an
 implicit ... in front, while an u64 has an implicit ... does not
 matter.

But, the ... and ... are not implicit. They are very real, and 
if applied incorrectly will change the result, I think.

 Is this the meaning of the predicate you want? I think so, based on the
 discussion, but it's hard to say without seeing the cases enumerated
 (i.e. a patch).

The purpose of this predicate is to determine whether any type 
conversions that occur between the output of a widening multiply, and 
the input of an addition have any bearing on the end result.

We know what the effective output type of the multiply is (the size is 
2x the input type, and the signed if either one of the inputs in 
signed), and we know what the input type of the addition is, but any 
amount of junk can lie in between. The problem is determining if it *is* 
junk.

In an ideal world there would only be two cases to consider:

   1. No conversion needed.

   2. A single sign-extend or zero-extend (according to the type of the 
inputs) to match the input size of the addition.

Anything else would be unsuitable for optimization. Of course, it's 
never that simple, but it should still be possible to boil down a list 
of conversions to one of these cases, if it's valid.

The signedness of the input to the addition is not significant - the 
code would be the same either way. But, I is important not to try to 
zero-extend something that started out signed, and not to sign-extend 
something that started out unsigned.

 However, perhaps there is a catch. We can do the following thought
 experiment. What would happen if you had multiple widening multiplies?
 Like 8-bit signed to 64-bit unsigned and then 64-bit unsigned to 128-bit
 unsigned? I believe in this case you couldn't optimize 8-bit signed to
 128-bit unsigned. Would your code do it?

My code does not attempt to combine multiple multiplies. In any case, if 
you have two multiplications, surely you have at least three input 
values, so they can't be combined?

It does attempt to combine a multiply and an addition, where a suitable 
madd* insn is available. (This is not new; I'm just trying to do it in 
more cases.)

I have considered the case where you have (a * b) + (c * d), but have 
not yet coded anything for it. At present, the code will simply choose 
whichever multiply happens to find itself the first input operand of the 
plus, and ignores the other, even if the first turns out not to be a 
suitable candidate.

Andrew


Re: [PATCH (3/7)] Widening multiply-and-accumulate pattern matching

2011-07-01 Thread Stubbs, Andrew
On 01/07/11 15:40, Paolo Bonzini wrote:
 On 07/01/2011 03:30 PM, Stubbs, Andrew wrote:
  However, perhaps there is a catch. We can do the following thought
  experiment. What would happen if you had multiple widening multiplies?
  Like 8-bit signed to 64-bit unsigned and then 64-bit unsigned to
 128-bit
  unsigned? I believe in this case you couldn't optimize 8-bit signed to
  128-bit unsigned. Would your code do it?
 My code does not attempt to combine multiple multiplies. In any case, if
 you have two multiplications, surely you have at least three input
 values, so they can't be combined?

 What about (u128)c + (u64)((s8)a * (s8)b)? You cannot convert this to
 (u128)c + (u128)((s8)a * (s8)b).

Oh I see, sorry. Yes, that's exactly what I'm trying to do here.

No, wait, I don't see. Where are these multiple widening multiplies 
you're talking about? I only see one multiply?

Andrew


Re: [PATCH (3/7)] Widening multiply-and-accumulate pattern matching

2011-07-01 Thread Stubbs, Andrew
On 01/07/11 14:30, Stubbs, Andrew wrote:
 Got it now! Casts from signed to unsigned are not value-preserving, but
   they are bit-preserving: s32-s64 obviously is, and s32-u64 has the
   same result bit-by-bit as the s64 result. The fact that s64 has an
   implicit ... in front, while an u64 has an implicit ... does not
   matter.
 But, the ... and ... are not implicit. They are very real, and
 if applied incorrectly will change the result, I think.

Wait, I'm clearly confused 

When I try a s32-u64 conversion, the expand pass generates a 
sign_extend insn.

Clearly it's the source type that determines the extension type, not the 
destination type ... and I'm a dunce!

Thanks :)

Andrew


Re: [PATCH (3/7)] Widening multiply-and-accumulate pattern matching

2011-07-01 Thread Stubbs, Andrew
On 01/07/11 16:54, Paolo Bonzini wrote:
 On 07/01/2011 04:55 PM, Stubbs, Andrew wrote:
 
  What about (u128)c + (u64)((s8)a * (s8)b)? You cannot convert this to
  (u128)c + (u128)((s8)a * (s8)b).
 Oh I see, sorry. Yes, that's exactly what I'm trying to do here.

 No, wait, I don't see. Where are these multiple widening multiplies
 you're talking about? I only see one multiply?

 I meant one multiplication with multiple widening steps. Not clear at
 all, sorry.

Yes, I see now, the whole purpose of my patch set is widening by more 
than one mode.

The case of the multiply-and-accumulate is the only way there can be 
more than one step though. Widening multiplies themselves are always 
handled as one unit.

Andrew


Re: [PATCH (3/7)] Widening multiply-and-accumulate pattern matching

2011-06-24 Thread Stubbs, Andrew
On 24/06/11 09:28, Richard Guenther wrote:
   To be clear, it only skips past NOP_EXPR. Is it not the case that what
   you're describing would need a CONVERT_EXPR?
 NOP_EXPR is the same as CONVERT_EXPR.

Are you sure?

I thought this was safe because the internals manual says:

   NOP_EXPR
   These nodes are used to represent conversions that do not require any
   code-generation 

   CONVERT_EXPR
   These nodes are similar to NOP_EXPRs, but are used in those
   situations where code may need to be generated 

So, I tried this example:

int
foo (int a, short b, short c)
{
   int bc = b * c;
   return a + (short)bc;
}

Both before and after my patch, GCC gives:

 mul r2, r1, r2
 sxtah   r0, r0, r2

(where, SXTAH means sign-extend the third operand from HImode to SImode 
and add to the second operand.)

The dump after the widening_mult pass is:

foo (int a, short int b, short int c)
{
   int bc;
   int D.2018;
   short int D.2017;
   int D.2016;
   int D.2015;
   int D.2014;

bb 2:
   D.2014_2 = (int) b_1(D);
   D.2015_4 = (int) c_3(D);
   bc_5 = b_1(D) w* c_3(D);
   D.2017_6 = (short int) bc_5;
   D.2018_7 = (int) D.2017_6;
   D.2016_9 = D.2018_7 + a_8(D);
   return D.2016_9;

}

Where you can clearly see that the addition has not been recognised as a 
multiply-and-accumulate.

When I step through convert_plusminus_to_widen, I can see that the 
reason it has not matched is because D.2017_6 = (short int) bc_5 is 
encoded with a CONVERT_EXPR, just as the manual said it would be.

So, according to the manual, and my (admittedly limited) experiments, 
skipping over NOP_EXPR does appear to be safe.

But you say that it isn't safe. So now I'm confused. :(

I can certainly add checks to make sure that the skipped operations 
actually don't make any important changes to the value, but do I need to?

Andrew


Re: [PATCH (3/7)] Widening multiply-and-accumulate pattern matching

2011-06-24 Thread Stubbs, Andrew
On 24/06/11 16:47, Richard Guenther wrote:
   I can certainly add checks to make sure that the skipped operations
   actually don't make any important changes to the value, but do I need to?
 Yes.

Ok, I'll go away and do that then.

BTW, I see useless_type_conversion_p, but that's not quite what I want. 
Is there an equivalent existing function to determine whether a 
conversion changes the logical/arithmetic meaning of a type?

I mean, conversion to a wider mode is not useless, but it is harmless, 
whereas conversion to a narrower mode may truncate the value.

Andrew