Re: GNU Tools Cauldron 2022

2022-09-28 Thread Vineet Gupta

Hi,

On 9/16/22 00:47, Jan Hubicka via Gcc wrote:

Hello,

Hello,

We are pleased to invite you all to the next GNU Tools Cauldron,
taking place in Paris on September 16-18, 2022.  We are looking forward
to meet you again after three years!

As for the previous instances, we have setup a wiki page for
details:

  https://gcc.gnu.org/wiki/cauldron2022  



I was interest in watching the RISC-V Vector presentation:
First presentation of Day 2 @ S9 [1].
But it seems it was not recorded or am I looking at wrong place.

Also Juzhe is it possible to share the slides, as they seem to be 
missing on Cauldron page [2]


Thx,
-Vineet

[1] https://www.youtube.com/watch?v=dOIwE2932XI
[2] https://gcc.gnu.org/wiki/cauldron2022


Re: GNU Tools Cauldron 2022

2022-09-28 Thread Vineet Gupta

On 9/28/22 10:03, Vineet Gupta wrote:

We are pleased to invite you all to the next GNU Tools Cauldron,
taking place in Paris on September 16-18, 2022.  We are looking forward
to meet you again after three years!

As for the previous instances, we have setup a wiki page for
details:

https://gcc.gnu.org/wiki/cauldron2022 
<https://gcc.gnu.org/wiki/cauldron2022>


I was interest in watching the RISC-V Vector presentation:
First presentation of Day 2 @ S9 [1].
But it seems it was not recorded or am I looking at wrong place.


At seems it starts later around 57 minute mark.

Thx


Fences/Barriers when mixing C++ atomics and non-atomics

2022-10-13 Thread Vineet Gupta

Hi,

I have a testcase (from real workloads) involving C++ atomics and trying 
to understand the codegen (gcc 12) for RVWMO and x86.
It does mix atomics with non-atomics so not obvious what the behavior is 
intended to be hence some explicit CC of subject matter experts 
(apologies for that in advance).


Test has a non-atomic store followed by an atomic_load(SEQ_CST). I 
assume that unadorned direct access defaults to safest/conservative seq_cst.


   extern int g;
   std::atomic a;

   int bar_noaccessor(int n, int *n2)
   {
    *n2 = g;
    return n + a;
   }

   int bar_seqcst(int n, int *n2)
   {
    *n2 = g;
    return n + a.load(std::memory_order_seq_cst);
   }

On RV (rvwmo), with current gcc 12 we get 2 full fences around the load 
as prescribed by Privileged Spec, Chpater A, Table A.6 (Mappings from 
C/C++ to RISC-V primitives).


   _Z10bar_seqcstiPi:
   .LFB382:
    .cfi_startproc
    lui    a5,%hi(g)
    lw    a5,%lo(g)(a5)
    sw    a5,0(a1)
   *fence    iorw,iorw*
    lui    a5,%hi(a)
    lw    a5,%lo(a)(a5)
   *fence    iorw,iorw*
    addw    a0,a5,a0
    ret


OTOH, for x86 (same default toggles) there's no barriers at all.

   _Z10bar_seqcstiPi:
    endbr64
    movl    g(%rip), %eax
    movl    %eax, (%rsi)
    movl    a(%rip), %eax
    addl    %edi, %eax
    ret


My naive intuition was x86 TSO would require a fence before 
load(seq_cst) for a prior store, even if that store was non atomic, so 
ensure load didn't bubble up ahead of store.


Perhaps this begs the general question of intermixing non atomic 
accesses with atomics and if that is undefined behavior or some such. I 
skimmed through C++14 specification chapter Atomic Operations library 
but nothing's jumping out on the topic.


Or is it much deeper, related to As-if rule or something.

Thx,
-Vineet


Re: Fences/Barriers when mixing C++ atomics and non-atomics

2022-10-13 Thread Vineet Gupta

Hi Hans,

On 10/13/22 13:54, Hans Boehm wrote:
The generated code here is correct in both cases. In the RISC--V case, 
I believe it is conservative, at a minimum, in that atomics should not 
imply IO ordering. We had an earlier discussion, which seemed to have 
consensus in favor of that opinion. I believe clang does not enforce 
IO ordering.


You can think of a "sequentially consistent" load roughly as enforcing 
two properties:


1) It behaves as an "acquire" load. Later (in program order) memory 
operations do not advance past it. This is implicit for x86. It 
requires the trailing fence on RISC-V, which could probably be 
weakened to r,rw.


Acq implies later things won't leak out, but prior things could still 
leak-in, meaning prior write could happen after load which contradicts 
what user is asking by load(seq_cst) on x86 ?




2) It ensures that seq_cst operations are fully ordered. This means 
that, in addition to (1), and the corresponding fence for stores, 
every seq_cst store must be separated from a seq_cst load by at least 
a w,r fence, so a seq_cst store followed by a seq_cst load is not 
reordered.


This makes sense when both store -> load are seq_cst.
But the question is what happens when that store is non atomic. IOW if 
we had a store(relaxed) -> load(seq_cst) would the generated code still 
ensure that load had a full barrier to prevent



w,r fences are discouraged on RISC-V, and probably no better than 
rw,rw, so that's how the leading fence got there. (Again the io 
ordering should disappear. It's the responsibility of IO code to 
insert that explicitly, rather than paying for it everywhere.)


Thanks for explaining the RV semantics.



x86 does (2) by associating that fence with stores instead of loads, 
either by using explicit fences after stores, or by turning stores 
into xchg.


That makes sense as x86 has ld->ld and ld -> st architecturally ordered, 
so any fences ought to be associated with st.


Thx,
-Vineet

RISC-V could do the same. And I believe that if the current A 
extension were the final word on the architecture, it should. But that 
convention is not compatible with the later introduction of an 
"acquire load", which I think is essential for performance, at least 
on larger cores. So I think the two fence mapping for loads should be 
maintained for now, as I suggested in the document I posted to the list.


Hans

On Thu, Oct 13, 2022 at 12:31 PM Vineet Gupta  
wrote:


Hi,

I have a testcase (from real workloads) involving C++ atomics and
trying
to understand the codegen (gcc 12) for RVWMO and x86.
It does mix atomics with non-atomics so not obvious what the
behavior is
intended to be hence some explicit CC of subject matter experts
(apologies for that in advance).

Test has a non-atomic store followed by an atomic_load(SEQ_CST). I
assume that unadorned direct access defaults to
safest/conservative seq_cst.

    extern int g;
    std::atomic a;

    int bar_noaccessor(int n, int *n2)
    {
         *n2 = g;
         return n + a;
    }

    int bar_seqcst(int n, int *n2)
    {
         *n2 = g;
         return n + a.load(std::memory_order_seq_cst);
    }

On RV (rvwmo), with current gcc 12 we get 2 full fences around the
load
as prescribed by Privileged Spec, Chpater A, Table A.6 (Mappings from
C/C++ to RISC-V primitives).

    _Z10bar_seqcstiPi:
    .LFB382:
         .cfi_startproc
         lui    a5,%hi(g)
         lw    a5,%lo(g)(a5)
         sw    a5,0(a1)
    *fence    iorw,iorw*
         lui    a5,%hi(a)
         lw    a5,%lo(a)(a5)
    *fence    iorw,iorw*
         addw    a0,a5,a0
         ret


OTOH, for x86 (same default toggles) there's no barriers at all.

    _Z10bar_seqcstiPi:
     endbr64
         movl    g(%rip), %eax
         movl    %eax, (%rsi)
         movl    a(%rip), %eax
         addl    %edi, %eax
         ret


My naive intuition was x86 TSO would require a fence before
load(seq_cst) for a prior store, even if that store was non
atomic, so
ensure load didn't bubble up ahead of store.

Perhaps this begs the general question of intermixing non atomic
accesses with atomics and if that is undefined behavior or some
such. I
skimmed through C++14 specification chapter Atomic Operations library
but nothing's jumping out on the topic.

Or is it much deeper, related to As-if rule or something.

Thx,
-Vineet



Re: Fences/Barriers when mixing C++ atomics and non-atomics

2022-10-13 Thread Vineet Gupta




On 10/13/22 13:30, Uros Bizjak wrote:

OTOH, for x86 (same default toggles) there's no barriers at all.

 _Z10bar_seqcstiPi:
  endbr64
  movlg(%rip), %eax
  movl%eax, (%rsi)
  movla(%rip), %eax
  addl%edi, %eax
  ret


Regarding x86 memory model, please see Intel® 64 and IA-32 Architectures
Software Developer’s Manual, Volume 3A, section 8.2 [1]

[1]https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html


My naive intuition was x86 TSO would require a fence before
load(seq_cst) for a prior store, even if that store was non atomic, so
ensure load didn't bubble up ahead of store.

As documented in the SDM above, the x86 memory model guarantees that

• Reads are not reordered with other reads.
• Writes are not reordered with older reads.
• Writes to memory are not reordered with other writes, with the
following exceptions:
...
• Reads may be reordered with older writes to different locations but
not with older writes to the same location.


So my example is the last case where older write is followed by read to 
different location and thus potentially could be reordered.


Redundant constants in coremark crc8 for RISCV/aarch64 (no-if-conversion)

2022-10-14 Thread Vineet Gupta

Hi,

When analyzing coremark build for RISC-V, noticed redundant constants 
not being eliminated. While this is a recurrent issue with RV, this 
specific instance is not unique to RV as I can trigger similar output on 
aarch64 with -fno-if-conversion, hence something which could be 
addressed in common passes.


-O3 -march=rv64gc_zba_zbb_zbc_zbs

crcu8:
xor a3,a0,a1
andia3,a3,1
srlia4,a0,1
srlia5,a1,1
beq a3,zero,.L2

li  a3,-24576   # 0x_A000
addia3,a3,1 # 0x_A001
xor a5,a5,a3
zext.h  a5,a5

.L2:
xor a4,a4,a5
andia4,a4,1 
srlia3,a0,2 
srlia5,a5,1 
beq a4,zero,.L3 

li  a4,-24576   # 0x_A000
addia4,a4,1 # 0x_A001
xor a5,a5,a4
zext.h  a5,a5   

.L3:
xor a3,a3,a5
andia3,a3,1 
srlia4,a0,3 
srlia5,a5,1 
beq a3,zero,.L4

li  a3,-24576   # 0x_A000
addia3,a3,1 # 0x_A001
...
...

I see that with small tests cse1 is able to substitute redundant 
constant reg with equivalent old reg.


cse_insn
   make_regs_equiv()
   ..
   canon_reg()

e.g.
void foo(void)
{
   bar(2072, 2096);
}

foo:
li  a0,4096
addia1,a0,-2000
addia0,a0,-2024
tailbar

And it seems to even work across basic blocks.

e.g.

void foo(int x) # -O2
{
bar1(2072);
foo2();
if (x)
bar2(2096);
}

foo:
 ...
li s1, 4096
addi a0, s1, -2024
 ...
addi a0, a1, -2000
tail bar2

It seems cse redundancy elimination logic is scoped to "paths" created 
by cse_find_path() and those seems to chain only 2 basic blocks.


;; Following path with 17 sets: 2 3
;; Following path with 15 sets: 4 5
;; Following path with 15 sets: 6 7
;; Following path with 15 sets: 8 9
;; Following path with 15 sets: 10 11
;; Following path with 15 sets: 12 13
;; Following path with 15 sets: 14 15
;; Following path with 13 sets: 16 17
;; Following path with 2 sets: 18
...

While in this case the BBs containing the dups are 6, 8, ...

(note 36 35 37 6 [bb 6] NOTE_INSN_BASIC_BLOCK)
(insn 37 36 38 6 (set (reg:DI 196)
(const_int -24576 [0xa000])) 
"../../coremark-crc8.c":23:17 -1

 (nil))
...
(note 54 53 55 8 [bb 8] NOTE_INSN_BASIC_BLOCK)
(insn 55 54 56 8 (set (reg:DI 207)
(const_int -24576 [0xa000])) 
"../../coremark-crc8.c":23:17 -1

 (nil))

The path construction logic in cse seems non trivial for someone just 
starting to dig into that code so I'd appreciate some insight.


Also I'm wondering if cse is the right place to do this at all, or 
should it be done in gcse/PRE ?


TIA
-Vineet

P.S. With the proposed RV cond ops extension if-conversion could elide 
this specific instance of problem, however redundant constants are a 
common enough nuisance in RV codegen whcih we need to tackle, specially 
when it is typically a 2 instruction seq.typedef unsigned short ee_u16;
typedef unsigned char ee_u8;

ee_u16
crcu8(ee_u8 data, ee_u16 crc)
{
ee_u8 i = 0, x16 = 0, carry = 0;

for (i = 0; i < 8; i++)
{
x16 = (ee_u8)((data & 1) ^ ((ee_u8)crc & 1));
data >>= 1;

if (x16 == 1)
{
crc ^= 0x4002;
carry = 1;
}
else
carry = 0;
crc >>= 1;
if (carry)
crc |= 0x8000;	// set MSB bit
else
crc &= 0x7fff;	// clear MSB bit
}
return crc;
}


Re: Redundant constants in coremark crc8 for RISCV/aarch64 (no-if-conversion)

2022-10-18 Thread Vineet Gupta

Hi Jeff,

On 10/14/22 09:54, Jeff Law via Gcc wrote:

...

.L2:
xor    a4,a4,a5
andi    a4,a4,1
srli    a3,a0,2
srli    a5,a5,1
beq    a4,zero,.L3

li    a4,-24576    # 0x_A000
addi    a4,a4,1    # 0x_A001
xor    a5,a5,a4
zext.h    a5,a5

.L3:
xor    a3,a3,a5
andi    a3,a3,1
srli    a4,a0,3
srli    a5,a5,1
beq    a3,zero,.L4

li    a3,-24576    # 0x_A000
addi    a3,a3,1    # 0x_A001
...
...

I see that with small tests cse1 is able to substitute redundant 
constant reg with equivalent old reg.


I find it easier to reason about this stuff with a graphical CFG, so a 
bit of ascii art...



   2
     /    \
  3 ---> 4
  /    \
  5 --->  6



Yeah A picture is worth thousand words :-)


Where BB4 corresponds to .L2 and BB6 corresponds to .L3. Evaluation of 
the constants occurs in BB3 and BB5.


And Evaluation here means use of the constant (vs. definition ?).

CSE isn't going to catch this.  The best way to think about CSE's 
capabilities is that it can work on extended basic blocks. An 
extended basic block can have jumps out, but not jumps in.  There are 3 
EBBs in this code.  (1,2), (4,5) and 6.    So BB4 is in a different EBB 
than BB3.  So the evaluation in BB3 can't be used by CSE in the EBB 
containing BB4, BB5.


Thanks for the detailed explanation.

PRE/GCSE is better suited for this scenario, but it has a critical 
constraint.  In particular our PRE formulation is never allowed to put 
an evaluation of an expression on a path that didn't have one before. So
while there clearly a redundancy on the path 2->3->4->5 (BB3 and BB5), 
there is nowhere we could put an evaluation that would reduce the number 
of evaluation on that path without introducing an evaluation on paths 
that didn't have one.  So consider 2->4->6.  On that path there are zero 
evaluations.  So we can't place an eval in BB2 because that will cause 
evaluations on 2->4->6 which didn't have any evaluations.


OK. How does PRE calculate all possible paths to consider: say your 
example 2-3-4-5 and 2-4-6 ? Is that just indicative or would actually be 
the one PRE calculates for this case. Would there be more ?


There isn't a great place in GCC to handle this right now.  If the 
constraints were relaxed in PRE, then we'd have a chance, but getting 
the cost model right is going to be tough.


It would have been better (for this specific case) if loop unrolling was 
not being done so early. The tree pass cunroll is flattening it out and 
leaving for rest of the all tree/rtl passes to pick up the pieces and 
remove any redundancies, if at all. It obviously needs to be early if we 
are injecting 7x more instructions, but seems like a lot to unravel.


FWIW -fno-unroll-loops only seems to work at -O2. At -O3 it always 
unrolls. Is that expected ?


If this seems worthwhile and you have ideas to do this any better, I'd 
be happy to work on this with some guidance.


Thx,
-Vineet


Re: Redundant constants in coremark crc8 for RISCV/aarch64 (no-if-conversion)

2022-10-18 Thread Vineet Gupta



On 10/18/22 16:36, Jeff Law wrote:
There isn't a great place in GCC to handle this right now.  If the 
constraints were relaxed in PRE, then we'd have a chance, but 
getting the cost model right is going to be tough.


It would have been better (for this specific case) if loop unrolling 
was not being done so early. The tree pass cunroll is flattening it 
out and leaving for rest of the all tree/rtl passes to pick up the 
pieces and remove any redundancies, if at all. It obviously needs to 
be early if we are injecting 7x more instructions, but seems like a 
lot to unravel.


Yup.  If that loop gets unrolled, it's going to be a mess.  It will 
almost certainly make this problem worse as each iteration is going to 
have a pair of constants loaded and no good way to remove them.


Thats the original problem that I started this thread with. I'd snipped 
the disassembly as it would have been too much text but basically on RV, 
Coremark crc8 loop of const 8 iterations gets unrolled including 
extraneous 8 insns pairs to load the same constant - which is 
preposterous. Other arches side-step by using if-conversion / cond 
moves, latter currently WIP in RV International. x86 w/o if-convert 
seems OK since the const can be encoded in the xor insn.


OTOH given that gimple/tree-pass cunroll is doing the culprit loop 
unrolling and introducing redundant const 8 times, can it ne addressed 
there somehow.
tree_estimate_loop_size() seems to identify constant expression, not 
just an operand. Can it be taught to identify a "non-trivial const" and 
hoist/code-move the expression. Sorry just rambling here, most likely 
non-sense.






FWIW -fno-unroll-loops only seems to work at -O2. At -O3 it always 
unrolls. Is that expected ?


The only case I'm immediately aware of where this wouldn't work would 
be if -O3 came after -fno-unroll-oops.


Weird that gcc-12, gcc-11, gcc-10 all seem to be silently ignoring 
-funroll-loops despite following -O3. Perhaps a different toggle is 
needed to supress the issue.


Thx,
-Vineet


query about commit 666fdc46bc8489 ("RISC-V: Fix bad insn splits with paradoxical subregs")

2022-11-04 Thread Vineet Gupta

Hi Jakub,

I had a question about the aforementioned commit in RV backend.

(define_split
  [(set (match_operand:GPR 0 "register_operand")
(and:GPR (match_operand:GPR 1 "register_operand")
   (match_operand:GPR 2 "p2m1_shift_operand")))
+   (clobber (match_operand:GPR 3 "register_operand"))]
  ""
-  [(set (match_dup 0)
+  [(set (match_dup 3)
   (ashift:GPR (match_dup 1) (match_dup 2)))

Is there something specific to this split which warrants this or so any 
split patterns involving shifts have this to avoid the shifting by more 
than SUBREG_REG problem.


Also could you please explain where the clobber itself is allocated ?

This came up when discussing the solution to PR/106602 [1]


Thx,
-Vineet

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106602


Re: query about commit 666fdc46bc8489 ("RISC-V: Fix bad insn splits with paradoxical subregs")

2022-11-04 Thread Vineet Gupta

On 11/4/22 16:13, Jeff Law wrote:


On 11/4/22 16:59, Vineet Gupta wrote:

Hi Jakub,

I had a question about the aforementioned commit in RV backend.

(define_split
  [(set (match_operand:GPR 0 "register_operand")
(and:GPR (match_operand:GPR 1 "register_operand")
   (match_operand:GPR 2 "p2m1_shift_operand")))
+   (clobber (match_operand:GPR 3 "register_operand"))]
  ""
-  [(set (match_dup 0)
+  [(set (match_dup 3)
   (ashift:GPR (match_dup 1) (match_dup 2)))

Is there something specific to this split which warrants this or so 
any split patterns involving shifts have this to avoid the shifting 
by more than SUBREG_REG problem.


Not sure.  Note it was Jim Wilson's change, not Jakub's change AFAICT:


commit 666fdc46bc848984ee7d2906f2dfe10e1ee5d535
Author: Jim Wilson 
Date:   Sat Jun 30 21:52:01 2018 +

    RISC-V: Add patterns to convert AND mask to two shifts.

    gcc/
    * config/riscv/predicates.md (p2m1_shift_operand): New.
    (high_mask_shift_operand): New.


Indeed Jim introduced the pattern with 666fdc46bc8, but the clobber was 
added later in 36ec3f57d305 ("RISC-V: Fix bad insn splits with 
paradoxical subregs"). He attributed this to Jakub, and with Jim not 
being super active these days, I tried reaching out to this cc list.


Sorry I pasted wrong sha-id in my orig msg, it needs to be 36ec3f57d305




You might find further discussion in the gcc-patches archives.


I'll dig some more.





Also could you please explain where the clobber itself is allocated ?


I'd expect the combiner.   There's going to be 2 or more insns that 
are combined to create this pattern where the output of one feeds a 
subsequent insn and dies.  So it's effectively a temporary.  Combine 
will re-use that temporary as the clobbered operand.


Make sense.

Thx,
-Vineet



-Ofast/-ffast-math and SPEC 511.pov miscompilation with gcc 13.0

2023-02-02 Thread Vineet Gupta

Hi,

I've noticed SPEC 2017, 511.pov failing for RV64 on bleeding edge gcc.
This is with -Ofast -march=rv64gcv_zba_zbb_zbc_zbs.
Problem also happens with -O3 -ffast-math (and goes away with fast-math 
removed)


I've bisected it to b7fd7fb50111 ("frange: drop endpoints to min/max 
representable numbers for -ffinite-math-only")
With that commit evrp is eliding a bunch of if conditions as they 
pertain to inf (in code snippet below, it elides code corresponding to 
lines 1401-1418 with line 1417 being elided causing the failure).


I think I know the answer, but  just wanted to confirm if that is the 
intended behavior given -Ofast / -ffast-math


Thx,
-Vineet


|New->Angle = __builtin_huge_val(); switch(New->Type) ||{ ||... 
||1246: if(Parse_Camera_Mods(New) == false) ||EXIT ||... ||} 1401: 
if (New->Up[X] == __builtin_huge_val|()|) { ... } 1417: if 
(New->Angle != |||__builtin_huge_val|()) ||{ ||1419: if ((New->Type == 
PERSPECTIVE_CAMERA) |||


Re: -Ofast/-ffast-math and SPEC 511.pov miscompilation with gcc 13.0

2023-02-02 Thread Vineet Gupta




On 2/2/23 15:38, Vineet Gupta wrote:

Hi,

I've noticed SPEC 2017, 511.pov failing for RV64 on bleeding edge gcc.
This is with -Ofast -march=rv64gcv_zba_zbb_zbc_zbs.
Problem also happens with -O3 -ffast-math (and goes away with 
fast-math removed)


I've bisected it to b7fd7fb50111 ("frange: drop endpoints to min/max 
representable numbers for -ffinite-math-only")
With that commit evrp is eliding a bunch of if conditions as they 
pertain to inf (in code snippet below, it elides code corresponding to 
lines 1401-1418 with line 1417 being elided causing the failure).


I think I know the answer, but  just wanted to confirm if that is the 
intended behavior given -Ofast / -ffast-math


Thx,
-Vineet


Apologies, mailer munged the code snippet

    New->Angle = __builtin_huge_val();
...
   switch(New->Type)
   {
...
1246:    if(Parse_Camera_Mods(New) == false)
  EXIT
...
    }

1401:   if (New->Up != __builtin_huge_val())
    {
...
    }

1417:   if (New->Angle != __builtin_huge_val())
    {
1419:   if ((New->Type == PERSPECTIVE_CAMERA) ||
||


Re: -Ofast/-ffast-math and SPEC 511.pov miscompilation with gcc 13.0

2023-02-03 Thread Vineet Gupta



On 2/2/23 23:36, Richard Biener wrote:

There's a CLOSED INVALID bug in bugzilla
about the povray failure as well.


Thx for the pointer ! For the record it is
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107021

> The question is what to do with 511.povray_r as we want to support 
SPECs with -Ofast.


I'm curious about Martin's comment above.
So SPEC -Ofast is just not supported anymore unless -fno-finite-math-only

-Vineet



ideas on PR/109279

2023-03-31 Thread Vineet Gupta

Hi Jeff, Kito,

I need some ideas to proceed with PR/109279: pertaining to longer term 
direction and short term fix.


First the executive summary:

long long f(void)
{
  return 0x0101010101010101ull;
}

Up until gcc 12 this used to generate const pool type access.

    lui    a5,%hi(.LANCHOR0)
    ld    a0,%lo(.LANCHOR0)(a5)
    ret
.LC0:
    .dword    0x101010101010101

After commit 2e886eef7f2b ("RISC-V: Produce better code with complex 
constants [PR95632] [PR106602] ") it gets synthesized to following


li    a0,0x0101
    addi    a0,0x0101
    slli    a0,a0,16
    addi    a0,a0,0x0101
    slli    a0,a0,16
    addi    a0,a0,0x0101
    ret

Granted const pool could or not be preferred by  specific uarch, will 
the long term approach be to have a cost model for the const pool vs. 
synthesizing.


The second aspect is to improve the horror above. Per chat on IRC, 
pinskia suggested we relax the in_splitter constraint in 
riscv_move_integer, as the combine issue holding it back is now fixed - 
after commit 61bee6aed26eb30.


That beings it down to some what reasonable

    li    a5,0x0101
    addi   a5,a5,0x0101
    mv a0,a5
    slli  a5,a5,32
    add    a0,a5,a0
    ret

I can spin a minimal patch, will that be acceptable for gcc 13.1 if it 
is testsuite clean !


Thx,
-Vineet


Followup on PR/109279: large constants on RISCV

2023-06-01 Thread Vineet Gupta

Hi Jeff,

I finally got around to collecting various observations on PR/109279 - 
more importantly the state of large constants in RV backend, apologies 
in advance for the long email.


It seems the various commits in area have improved the original test 
case of 0x1010101_01010101


  Before 2e886eef7f2b  |   With 2e886eef7f2b   | With 
0530254413f8 | With c104ef4b5eb1
    (const pool)   | define_insn_and_split | "splitter relaxed 
new |

   | "*mvconst_internal"   | pseudos" |
lui  a5,%hi(.LANCHOR0) | li a0,0x0101  | li a5,0x0101    
| li   a5,0x0101
ld   a0,%lo(.LANCHOR0)(a5) | addi   a0,0x0101  | addi 
a5,a5,0x0101 | addi a5,a5,0x0101
ret    | slli   a0,a0,16   | mv a0,a5    
| slli a0,a5,32
   | addi   a0,a0,0x0101   | slli 
a5,a5,32 | add  a0,a0,a5

   | slli   a0,a0,16   | add a0,a5,a0 |
   | addi   a0,a0,0x0101   | 
ret   |

   | ret   |

But same commits seem to have regressed Andrew's test from same PR 
(which is the theme of this email).

The seemingly contrived test turned out to be much more than I'd hoped for.

   long long f(void)
   {
 unsigned t = 0x101_0101;
 long long t1 = t;
 long long t2 = ((unsigned long long )t) << 32;
 asm("":"+r"(t1));
 return t1 | t2;
   }

  Before 2e886eef7f2b  |   With 2e886eef7f2b    | With 0530254413f8
    (ideal code)   | define_insn_and_split  | "splitter relaxed new
   |    |  pseudos"
   li   a0,0x101   |    li   a5,0x101   |    li a0,0x101_
   addi a0,a0,0x101    |    addi a5,a5,0x101    |    addi a0,a0,0x101
   slli a5,a0,32   |    mv   a0,a5  |    li a5,0x101_
   or   a0,a0,a5   |    slli a5,a5,32   |    slli a0,a0,32
   ret |    or   a0,a0,a5   |    addi a5,a5,0x101
   |    ret |    or   a0,a5,a0
    |    ret

As a baseline, RTL just before cse1 (in 260r.dfinit) in all of above is:

   # lower word

   (insn 6 2 7 2 (set (reg:DI 138)
    (const_int [0x101]))  {*movdi_64bit}

   (insn 7 6 8 2 (set (reg:DI 137)
    (plus:DI (reg:DI 138)
    (const_int [0x101]))) {adddi3}
 (expr_list:REG_EQUAL (const_int [0x1010101]) )

   (insn 5 8 9 2 (set (reg/v:DI 134 [ t1 ])
    (reg:DI 136 [ t1 ])) {*movdi_64bit}

   # upper word (created independently)

   (insn 9 5 10 2 (set (reg:DI 141)
    (const_int [0x101]))  {*movdi_64bit}

   (insn 10 9 11 2 (set (reg:DI 142)
    (plus:DI (reg:DI 141)
    (const_int [0x101]))) {adddi3}

   (insn 11 10 12 2 (set (reg:DI 140)
    (ashift:DI (reg:DI 142)
    (const_int 32 [0x20]))) {ashldi3}
   (expr_list:REG_EQUAL (const_int [0x1010101]))

   # stitch them
   (insn 12 11 13 2 (set (reg:DI 139)
    (ior:DI (reg/v:DI 134 [ t1 ])
    (reg:DI 140))) "const2.c":7:13 99 {iordi3}


Prior to 2e886eef7f2b, cse1 could do its job: finding oldest equivalent 
registers for the fragments of const and reusing the reg.


   (insn 7 6 8 2 (set (reg:DI 137)
    (plus:DI (reg:DI 138)
    (const_int [0x101]))) {adddi3}
    (expr_list:REG_EQUAL (const_int [0x1010101])))
   [...]

   (insn 11 10 12 2 (set (reg:DI 140)
    (ashift:DI (reg:DI 137)
 ^   OLD EQUIV REG
    (const_int 32 [0x20]))) {ashldi3}
    (expr_list:REG_EQUAL (const_int [0x1010101_])))


With 2e886eef7f2b, define_insn_and_split "*mvconst_internal" recog() 
kicks in during cse1, eliding insns for a const_int.


   (insn 7 6 8 2 (set (reg:DI 137)
    (const_int [0x1010101])) {*mvconst_internal}
    (expr_list:REG_EQUAL (const_int [0x1010101])))
   [...]

   (insn 11 10 12 2 (set (reg:DI 140)
    (const_int [0x1010101_])) {*mvconst_internal}
    (expr_list:REG_EQUAL (const_int  [0x1010101_]) ))

Eventually split1 breaks it up using same mvconst_internal splitter, but 
the cse opportunity has been lost.
*This is a now a baseline for large consts handling for RV backend which 
we all need to be aware of*.



(2) Now on to the nuances as to why things get progressively worse after 
commit 0530254413f8.


It all seems to get down to register allocation passes:

sched1 before 0530254413f8

   ;; 0--> b  0: i  22 r140=0x101    :alu
   ;; 1--> b  0: i  20 r137=0x101    :alu
   ;; 2--> b  0: i  23 r140=r140+0x101   :alu
   ;; 3--> b  0: i  21 r137=r137+0x101   :alu
   ;; 4--> b  0: i  24 r140=r140<<0x20   :alu
   ;; 5--> b  0: i  25 r136=r137 :alu
   ;; 6--> b  0: i   8 r136=asm_operands :nothing
   ;; 7--> b  0: i  17 a0=r136|r140  :alu
   ;; 8--> b  0: i  18 use a0    :nothing

sched1 with 053

Re: Followup on PR/109279: large constants on RISCV

2023-06-12 Thread Vineet Gupta

Hi Jeff,

Thx for the detailed explanation and insight.

On 6/7/23 16:44, Jeff Law wrote:
With 2e886eef7f2b, define_insn_and_split "*mvconst_internal" recog() 
kicks in during cse1, eliding insns for a const_int.


    (insn 7 6 8 2 (set (reg:DI 137)
 (const_int [0x1010101])) {*mvconst_internal}
 (expr_list:REG_EQUAL (const_int [0x1010101])))
    [...]

    (insn 11 10 12 2 (set (reg:DI 140)
 (const_int [0x1010101_])) {*mvconst_internal}
 (expr_list:REG_EQUAL (const_int  [0x1010101_]) ))
Understood.  Not ideal, but we generally don't have good ways to limit 
patterns to being available at different times during the optimization 
phase.  One thing you might want to try (which I thought we used at 
one point) was make the pattern conditional on cse_not_expected.  The 
goal would be to avoid exposing the pattern until a later point in the 
optimizer pipeline.  It may have been the case that we dropped that 
over time during development.  It's all getting fuzzy at this point.


Gave this a try and it seems to fix Andrew's test, but then regresses 
the actual large const case: 0x1010101_01010101 : the mem to const_int 
transformation was being done in cse1 which no longer happens and the 
const pool from initial expand remains all the way into asm generated. I 
don't think we want to go back to that state






Eventually split1 breaks it up using same mvconst_internal splitter, 
but the cse opportunity has been lost.
Right.  I'd have to look at the pass definitions, but I suspect the 
splitting pass where this happens is after the last standard CSE pass. 
So we don't get a chance to CSE the constant synthesis.


Yep split1 and friends happen after cse1 and cse2. At -O2 gcse doesn't 
kick in and if forced to, it is currently limited in what it can do more 
so given this is post reload.




*This is a now a baseline for large consts handling for RV backend 
which we all need to be aware of*.
Understood.  Though it's not as bad as you might think :-)  You can 
spend an inordinate amount of time improving constant synthesis, 
generate code that looks really good, but in the end it may not make a 
bit of different in real performance.  Been there, done that.  I'm not 
saying we give up, but we need to keep in mind that we're often better 
off trading a bit on the constant synthesis if doing so helps code 
where those constants get used.


Understood :-) I was coming to same realization and this seems like a 
good segway into switching topic and investigating post reload gcse for 
Const Rematerialization, another pesky issue with RV and likely to have 
bigger impact across a whole bunch of workloads.


FWIW, IRA for latter case only, emits additional REG_EQUIV notes 
which could also be playing a role.
REG_EQUAL notes get promoted to REG_EQUIV notes in some cases. And 
when other equivalences are discovered it may create a REG_EQUIV note 
out of thin air.


The REG_EQUIV note essentially means that everywhere the register 
occurs you can validly (from a program semantics standpoint) replace 
the register with the value.  It might require reloading, but it's a 
valid semantic transformation which may reduce register pressure -- 
especially for constants that were subject to LICM.


Contrast to REG_EQUAL which creates an equivalence at a particular 
point in the IL, but the equivalence may not hold elsewhere in the IL.


Ok. From reading gccint it seems REG_EQUIV is a stronger form of 
equivalence and seems to be prefered by post reload passes, while 
REG_EQUAL is more of use in pre-reload.



  I would also look at reload_cse_regs which should give us some 
chance at seeing the value reuse if/when IRA/LRA muck things up.


I'll be out of office for the rest of week, will look into this once I'm 
back.


Thx,
-Vineet


Possible issue with ARC gcc 4.8

2015-07-03 Thread Vineet Gupta
Hi,

I have the following test case (reduced from Linux kernel sources) and it seems
gcc is optimizing away the first loop iteration.

arc-linux-gcc -c -O2 star-9000857057.c -fno-branch-count-reg --save-temps -mA7

--->8-
static inline int __test_bit(unsigned int nr, const volatile unsigned long 
*addr)
{
 unsigned long mask;

 addr += nr >> 5;
#if 0
nr &= 0x1f;
#endif
 mask = 1UL << nr;
 return ((mask & *addr) != 0);
}

int foo (int a, unsigned long *p)
{
  int i;
  for (i = 63; i>=0; i--)
  {
  if (!(__test_bit(i, p)))
   continue;
  a += i;
  }
  return a;
}
--->8-

gcc generates following

--->8-
.global foo
.type   foo, @function
foo:
ld_s r2,[r1,4]  < dead code
mov_s r2,63 
.align 4
.L2:
sub r2,r2,1<-SUB first
cmp r2,-1
jeq.d [blink]
lsr r3,r2,5   <- BUG: first @mask is (1 << 62) NOT (1 << 63)
.align 2
.L4:
ld.as r3,[r1,r3]
bbit0.nd r3,r2,@.L2
add_s r0,r0,r2
sub r2,r2,1
cmp r2,-1
bne.d @.L4
lsr r3,r2,5
j_s [blink]
.size   foo, .-foo
.ident  "GCC: (ARCv2 ISA Linux uClibc toolchain 
arc-2015.06-rc1-21-g21b2c4b83dfa)
4.8.4"
--->8-

For initial 32 loop operations, this test is effectively doing 64 bit operation,
e.g. (1 << 63) in 32 bit regime. Is this supposed to be undefined, truncated to
zero or port specific.

If it is truncate to zero then generated code below is not correct as it needs 
to
elide not just the first iteration (corresponding to i = 63) but 63..32

Further ARCompact ISA provides that instructions involving bitpos operands BSET,
BCLR, LSL can any number whatsoever, but core will only use the lower 5 bits (so
clamping the bitpos to 0..31 w/o need for doing that in code.

So is this a gcc bug, or some spec misinterpretation,.

TIA,
-Vineet


Re: Possible issue with ARC gcc 4.8

2015-07-05 Thread Vineet Gupta
On Friday 03 July 2015 07:15 PM, Richard Biener wrote:
> On Fri, Jul 3, 2015 at 3:10 PM, Vineet Gupta  
> wrote:
>> Hi,
>>
>> I have the following test case (reduced from Linux kernel sources) and it 
>> seems
>> gcc is optimizing away the first loop iteration.
>>
>> arc-linux-gcc -c -O2 star-9000857057.c -fno-branch-count-reg --save-temps 
>> -mA7
>>
>> --->8-
>> static inline int __test_bit(unsigned int nr, const volatile unsigned long 
>> *addr)
>> {
>>  unsigned long mask;
>>
>>  addr += nr >> 5;
>> #if 0
>> nr &= 0x1f;
>> #endif
>>  mask = 1UL << nr;
>>  return ((mask & *addr) != 0);
>> }
>>
>> int foo (int a, unsigned long *p)
>> {
>>   int i;
>>   for (i = 63; i>=0; i--)
>>   {
>>   if (!(__test_bit(i, p)))
>>continue;
>>   a += i;
>>   }
>>   return a;
>> }
>> --->8-
>>
>> gcc generates following
>>
>> --->8-
>> .global foo
>> .type   foo, @function
>> foo:
>> ld_s r2,[r1,4]  < dead code
>> mov_s r2,63
>> .align 4
>> .L2:
>> sub r2,r2,1<-SUB first
>> cmp r2,-1
>> jeq.d [blink]
>> lsr r3,r2,5   <- BUG: first @mask is (1 << 62) NOT (1 << 63)
>> .align 2
>> .L4:
>> ld.as r3,[r1,r3]
>> bbit0.nd r3,r2,@.L2
>> add_s r0,r0,r2
>> sub r2,r2,1
>> cmp r2,-1
>> bne.d @.L4
>> lsr r3,r2,5
>> j_s [blink]
>> .size   foo, .-foo
>> .ident  "GCC: (ARCv2 ISA Linux uClibc toolchain 
>> arc-2015.06-rc1-21-g21b2c4b83dfa)
>> 4.8.4"
>> --->8-
>>
>> For initial 32 loop operations, this test is effectively doing 64 bit 
>> operation,
>> e.g. (1 << 63) in 32 bit regime. Is this supposed to be undefined, truncated 
>> to
>> zero or port specific.
>>
>> If it is truncate to zero then generated code below is not correct as it 
>> needs to
>> elide not just the first iteration (corresponding to i = 63) but 63..32
>>
>> Further ARCompact ISA provides that instructions involving bitpos operands 
>> BSET,
>> BCLR, LSL can any number whatsoever, but core will only use the lower 5 bits 
>> (so
>> clamping the bitpos to 0..31 w/o need for doing that in code.
>>
>> So is this a gcc bug, or some spec misinterpretation,.
> It is the C language standard that says that shifts like this invoke
> undefined behavior.

Right, but the compiler is a program nevertheless and it knows what to do when 
it
sees 1 << 62
It's not like there is an uninitialized variable or something which will provide
unexpected behaviour.
More importantly, the question is can ports define a specific behaviour for such
cases and whether that would be sufficient to guarantee the semantics.

The point being ARC ISA provides a neat feature where core only considers lower 
5
bits of bitpos operands. Thus we can make such behaviour not only deterministic 
in
the context of ARC, but also optimal, eliding the need for doing specific
masking/clamping to 5 bits.

-Vineet


Re: Possible issue with ARC gcc 4.8

2015-07-06 Thread Vineet Gupta
On Monday 06 July 2015 12:04 PM, Richard Biener wrote:
>> The point being ARC ISA provides a neat feature where core only considers 
>> lower 5
>> > bits of bitpos operands. Thus we can make such behaviour not only 
>> > deterministic in
>> > the context of ARC, but also optimal, eliding the need for doing specific
>> > masking/clamping to 5 bits.
> There is SHIFT_COUNT_TRUNCATED which allows you to combine
> b & 31 with the shift value if you instead write a << (b & 31).

Awesome, this is what we need for ARC then along with the user code change to 
add
the masking so the combiner will nullify the masking.

>
> Of course a << 63 is still undefined behavior regardless of target behavior.

Sure

Thx,
-Vineet



.debug_frame not generated by ARC gas

2016-06-16 Thread Vineet Gupta
Hi,

ARC Linux has an in-kernel dwarf unwinder which has historically consumed
.debug_frame. The kernel is built with -gdwarf-2 and 
-fasynchronous-unwind-tables
toggles which until recently used to generate both .debug_frame and .eh_frame -
latter being discarded by the kernel linker script.

With a recent ARC gas change to support asm cfi psuedo-ops - dwarf info is now
generated by gas and NOT gcc. But more importantly, we no longer get 
.debug_frame
with above 2 toggles. If we drop -fasynchronous-unwind-tables then .debug_frame 
is
generated.

So here are the questions:
1. Is it OK to drop -fasynchronous-unwind-tables toggle and still assume that
"precise" unwind info will be generated (as mentioned in gnu documentation)
2. If not, how do we coax gas to generate .debug_frame even in presence of
-fasynchronous-unwind-tables

Thx,
-Vineet


Re: .debug_frame not generated by ARC gas

2016-06-22 Thread Vineet Gupta

On Thursday 16 June 2016 09:44 PM, Vineet Gupta wrote:
> Hi,
>
> ARC Linux has an in-kernel dwarf unwinder which has historically consumed
> .debug_frame. The kernel is built with -gdwarf-2 and 
> -fasynchronous-unwind-tables
> toggles which until recently used to generate both .debug_frame and .eh_frame 
> -
> latter being discarded by the kernel linker script.
>
> With a recent ARC gas change to support asm cfi psuedo-ops - dwarf info is now
> generated by gas and NOT gcc. But more importantly, we no longer get 
> .debug_frame
> with above 2 toggles. If we drop -fasynchronous-unwind-tables then 
> .debug_frame is
> generated.
>
> So here are the questions:
> 1. Is it OK to drop -fasynchronous-unwind-tables toggle and still assume that
> "precise" unwind info will be generated (as mentioned in gnu documentation)
> 2. If not, how do we coax gas to generate .debug_frame even in presence of
> -fasynchronous-unwind-tables

Ping !
Apologies for explicit CC.  Any response would be much appreciated !

-Vineet


Interpretation of DWARF FDE->CIE_pointer field for .debug_frame

2013-06-23 Thread Vineet Gupta
Hi,

I had a question about interpretation of FDE's CIE_pointer field (for 
.debug_frame)

The spec say (from dwarf4 version although it really doesn't matter):

"2. CIE_pointer (4 or 8 bytes, see Section 7.4)
A constant offset into the .debug_frame section that denotes the CIE that is
associated with this FDE."

Does "offset" above mean offset from current location (in FDE) to CIE or does it
mean offset from start of .debug_frame to the CIE. Per Ian Lance Taylor's blog,
and if I'm interpreting it correctly, (http://www.airs.com/blog/archives/460) it
seems to be latter.

However the example given in spec it seems to have a direct reference to CIE

"Address   Value   Comment
-
fde   40   ...
fde+4  cie   CIE_pointer"--> This is direct reference to CIE (not
relative)

The context is ARC GNU toolchain form Synopsys. ARC gcc 4.8 is currently
generating something like this:

.section.debug_frame,"",@progbits   
.Lframe0:
.4byte@.LECIE0-@.LSCIE0--> CIE
.LSCIE0:
.4byte0x

.LECIE0:

...
.LSFDE0:
.4byte@.LEFDE0-@.LASFDE0   --> FDE
.LASFDE0:
.4byte@.Lframe0--> CIE pointer - direct reference to CI (not
offset from start of .debug_frame)


This direct reference to start of CIE is causing objdump to reference invalid 
CIE
and hence print invalid CFI - although the CFI itself is valid since the
code_factor and such are seeded from a bogus CIE.

0060 0014 80e0c000 FDE cie=48b25ff8   pc=80a680d4..80a6810a
...  ^^


Looking at gcc 4.8 source : gcc/dwarf2out.c : It seems to hint that the
CIE_pointer needs to be relative to .debug_frame (just as I think)

+  if (for_eh)
+dw2_asm_output_delta (4, l1, section_start_label, "FDE CIE offset");
+  else
+dw2_asm_output_offset (DWARF_OFFSET_SIZE, section_start_label,
 +  debug_frame_section, "FDE CIE offset");

However to not generate a direct reference, most targets need to implement
ASM_OUTPUT_DWARF_OFFSET which is not really the case.

So the questions are
1. Is the current encoding of CIE_pointer in FDE correct
2. If yes, then objdump/readelf are at fault ?
2. If not, why do most targets don't implement the above macro.

-Vineet


Re: Interpretation of DWARF FDE->CIE_pointer field for .debug_frame

2013-06-24 Thread Vineet Gupta
On 06/24/2013 12:33 PM, Jakub Jelinek wrote:
> On Mon, Jun 24, 2013 at 12:06:27PM +0530, Vineet Gupta wrote:
>> I had a question about interpretation of FDE's CIE_pointer field (for 
>> .debug_frame)
>>
>> The spec say (from dwarf4 version although it really doesn't matter):
>>
>> "2. CIE_pointer (4 or 8 bytes, see Section 7.4)
>> A constant offset into the .debug_frame section that denotes the CIE that is
>> associated with this FDE."
>>
>> Does "offset" above mean offset from current location (in FDE) to CIE or 
>> does it
>> mean offset from start of .debug_frame to the CIE. Per Ian Lance Taylor's 
>> blog,
>> and if I'm interpreting it correctly, 
>> (http://www.airs.com/blog/archives/460) it
>> seems to be latter.
> CIE_pointer in .debug_frame is relative to the start of the .debug_frame
> section.  In .eh_frame section it is encoded based on the selected encoding,
> often relative to the start of the CIE_pointer.
>
>> ...
>> .LSFDE0:
>> .4byte@.LEFDE0-@.LASFDE0   --> FDE
>> .LASFDE0:
>> .4byte@.Lframe0--> CIE pointer - direct reference to CI 
>> (not
>> offset from start of .debug_frame)
> This looks fine.

Pardon me if I sound dense (not really my area of expertise), wowever, the 2nd
word in FDE above (@.Lframe0) is a direct reference to start of .debbug_frame
shouldn't it be something like

@.Lframe0 - @.Lframe0

i.e. zero.

Or do you think asseblmer/linker need to "interpret it" like that.

>> 
>>
>> This direct reference to start of CIE is causing objdump to reference 
>> invalid CIE
>> and hence print invalid CFI - although the CFI itself is valid since the
>> code_factor and such are seeded from a bogus CIE.
>>
>> 0060 0014 80e0c000 FDE cie=48b25ff8   pc=80a680d4..80a6810a
>> ...  ^^
> The 48b25ff8 looks wrong though, it would really surprise me if .debug_frame
> section was more than 1GB big.  So perhaps your assembler or linker don't
> handle it properly?

Exactly, although this is Linux kernel image which is linked at start of
untranslated address space i.e. 0x8000_ onwards. The point however is that 
cie
value above should read zero - not 0x8abcdefg since it is relative to start of
.debug_frame

-Vineet


Re: Interpretation of DWARF FDE->CIE_pointer field for .debug_frame

2013-06-24 Thread Vineet Gupta
On 06/24/2013 12:58 PM, Jakub Jelinek wrote:
> On most targets, .debug_* sections are placed at address 0, so absolute
> relocations are the same as relocations relative to the start of the
> section.
> [snipped]
>
> So, either .debug_* sections are placed at address 0 and then absolute
> relocations will do the trick, or you need some kind of section relative
> relocation (e.g. ia64 has it I think).  This isn't specific just to
> .debug_frame, e.g. DW_FORM_strp/DW_FORM_sec_offset encoded values in 
> .debug_info
> or .debug_abbrev offsets in .debug_info CU header, DW_OP_call_ref arguments,
> .debug_aranges/.debug_pubtypes/.debug_pubnames offsets to .debug_info all
> have these requirements.
>
>   Jakub

Aha, I see what's happening. For historical reasons, ARC Linux kernel stack
unwinder relies on .debug_frame (vs. .eh_frame) for stack unwinding. Being non
allocatable it would default to address zero hence the orig absolute relocations
would work for most ports. However in our case we force it allocatable in kernel
builds and thus it is relocated to 0x8abcdefg, thus the usage of absolute
relocations ends up generating the invalid reference.

Thus it seems we do need the special section relative reference.

Thanks a bunch for clarifying.

--Vineet


Re: Interpretation of DWARF FDE->CIE_pointer field for .debug_frame

2013-06-24 Thread Vineet Gupta
On 06/25/2013 12:35 AM, Richard Henderson wrote:
> On 06/24/2013 12:37 AM, Vineet Gupta wrote:
>> Aha, I see what's happening. For historical reasons, ARC Linux kernel stack
>> unwinder relies on .debug_frame (vs. .eh_frame) for stack unwinding. Being 
>> non
>> allocatable it would default to address zero hence the orig absolute 
>> relocations
>> would work for most ports. However in our case we force it allocatable in 
>> kernel
>> builds and thus it is relocated to 0x8abcdefg, thus the usage of absolute
>> relocations ends up generating the invalid reference.
>>
>> Thus it seems we do need the special section relative reference.
>>
>> Thanks a bunch for clarifying.
> It seems like it would be easier to change the kernel to use .eh_frame
> rather than adding relocation types and changing the tool chain...

Yes that is what we concluded as well. Although the current issue manifests only
as objdump/readelf splat with kernel binary - the unwinder itself in kernel 
works
OK. But .eh_frame is what others do so we might as well follow suite given that
ARC gcc 4.8 now has started generating one.

-Vineet


Subreg promotion not happening for the test case

2023-10-31 Thread Vineet Gupta

Hi Jeff, Robin,

This morning I was talking about the following test where subreg 
promoted is not being generated during expand, whereas for RISC-V 
ABI/ISA it should.


gcc.c-torture/compile/20070121.c -Os -march=rv64gc -mabi=lp64d

expand_gimple_basic_block
    expand_gimple_stmt
...
    expand_expr_real
...
 case SSA_NAME:
              if (REG_P (decl_rtl)
                      && dmode != BLKmode
                      && GET_MODE (decl_rtl) != dmode)

Here it seems to be taking promote_ssa_mode () codepath and not 
promote_function_mode


Do note this only happens for -Os. Tree dump is same for both -Os and 
-O2, Expand for -Os misses setting the subreg promoted.


Thx,
-Vineet