33562] Improve DSE of complex stores

Richard Sandiford Fri, 05 May 2017 07:04:40 -0700

Jeff Law <l...@redhat.com> writes:
> On 05/05/2017 06:13 AM, Richard Sandiford wrote:
>> Hi Jeff,
>> 
>> Jeff Law <l...@redhat.com> writes:
>>> +/* Compute the number of elements that we can trim from the head and
>>> +   tail of ORIG resulting in a bitmap that is a superset of LIVE.
>>> +
>>> +   Store the number of elements trimmed from the head and tail in
>>> +   TRIM_HEAD and TRIM_TAIL.  */
>>> +
>>> +static void
>>> +compute_trims (ao_ref *ref, sbitmap live, int *trim_head, int *trim_tail)
>>> +{
>>> +  /* We use sbitmaps biased such that ref->offset is bit zero and the 
>>> bitmap
>>> +     extends through ref->size.  So we know that in the original bitmap
>>> +     bits 0..ref->size were true.  We don't actually need the bitmap, just
>>> +     the REF to compute the trims.  */
>>> +
>>> +  /* Now identify how much, if any of the tail we can chop off.  */
>>> +  *trim_tail = 0;
>>> +  int last_orig = (ref->size / BITS_PER_UNIT) - 1;
>>> +  int last_live = bitmap_last_set_bit (live);
>>> +  *trim_tail = (last_orig - last_live) & ~0x1;
>>> +
>>> +  /* Identify how much, if any of the head we can chop off.  */
>>> +  int first_orig = 0;
>>> +  int first_live = bitmap_first_set_bit (live);
>>> +  *trim_head = (first_live - first_orig) & ~0x1;
>>> +}
>> 
>> Can you remember why you needed to force the lengths to be even (the & 
>> ~0x1s)?
>> I was wondering whether it might have been because trimming single bytes
>> interferes with the later strlen optimisations, which the patch I just
>> posted should fix.
>> 
>> I guess there's also a risk that trimming a byte from a memcpy that has
>> a "nice" length could make things less efficient, but that could go both
>> ways: changing a memcpy of 9 bytes to a mempcy of 8 bytes would be good,
>> while changing from 8 to 7 might not be.  The same goes for even lengths
>> too though, like 10->8 (good) and 16->14 (maybe not a win).  FWIW, it
>> looks like the strlen pass uses:
>> 
>>        /* Don't adjust the length if it is divisible by 4, it is more 
>> efficient
>>           to store the extra '\0' in that case.  */
>>        if ((tree_to_uhwi (len) & 3) == 0)
>>          return;
>> 
>> for that.
>> 
>> Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK if the strlen
>> patch is OK?
> It was primarily to avoid mucking up alignments of the start of the copy 
> or leaving residuals at the end of a copy.  It's an idea I saw while 
> scanning the LLVM implementation of DSE.  The fact that it avoids 
> mucking things up for tree-ssa-strlen was a unplanned side effect.


OK.  Why 2 bytes though?  I wouldn't have expected that misaligning
to x+2 would be significantly better than misaligning to x+3.
And subtracting odd values could sometimes give a nicer alignment
than we had before.

Would it make sense to limit the head trims to multiples of the pointer
alignment, or perhaps the minimum of the pointer alignment and the word
size?  And limit tail trims based on the alignment of the new size,
rather than the alignment of the value that's being subtracted from
the size?

> I never did any real benchmarking either way.  If you've got any hard 
> data which shows it's a bad idea, then let's remove it and deal with the 
> tree-ssa-strlen stuff (as I noted you'd done this morning).

TBH I don't have any performance data either.  This was just something
that came up with the SVE changes, where the offsets could have a
runtime component.

I can preserve the current mask fairly easily if we want to keep it.
I just wasn't quite sure how to explain it away.

Thanks,
Richard

Re: [RFA][PATCH 1b/4] [PR tree-optimization/33562] Improve DSE of complex stores

Reply via email to