[Bug debug/118837] Interpretation of DW_FORM_data*

2025-11-21 Thread tromey at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118837

--- Comment #10 from Tom Tromey  ---
(In reply to Simon Marchi from comment #9)

> > The main problem with this approach is that the answer doesn't just depend
> > on the tag or the attribute.  It can depend on other DIEs as well, for
> > instance I believe a variant part's discriminant value is sign-extended,
> > or not, depending on the type of the relevant field.  This of course is
> > difficult to implement, test, etc.
> 
> Regarding variants, DWARF 5 already says:
[...]

Well that maybe wasn't the best example, since it's more clear than
many other places in the standard.

At the same time, this would be simpler for DWARF readers if it were
simply spelled out by the form.  Then the reader wouldn't have to
examine some other DIE to try to find out whether sign extension is
needed.

> Maybe another solution would be to introduced fixed-size forms with explicit
> signedness, DW_FORM_udata{1,2,4,8} and DW_FORM_sdata{1,2,4,8}.

This would be a distinct improvement.
The main point would be to simplify reading, at least at some
distant time when we can drop support for older versions of DWARF.

[Bug debug/118837] Interpretation of DW_FORM_data*

2025-11-20 Thread simon.marchi at polymtl dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118837

--- Comment #9 from Simon Marchi  ---
(In reply to Tom Tromey from comment #8)
> (In reply to Simon Marchi from comment #7)
> 
> > ... so I am afraid that any attempt at addressing this problem will be met
> > with a "it's a quality of implementation issue", but I think it's worth
> > trying.  I think it would be better if DWARF didn't allow being ambiguous.
> 
> Even just making this a hard rule would be an improvement.
> 
> IIRC in one of the old threads the answer was that producers and consumers
> should agree, but to me this is clearly a bad answer, since the DWARF
> standard is precisely the mechanism by which they agree.

Agreed :).

> > So far my understanding is the problem is: you could have an attribute,
> > let's say DW_AT_const_value, with form DW_AT_data1, and value 0x80.  As a
> > consumer, how do you know if that 0x80 means -1 or 128?  You could have
> > compiler-1 people saying "it should obviously be interpreted as a signed
> > constant" and compiler-2 people saying "it should obviously be interpreted
> > as an unsigned constant".  And then, as a consumer, you are in a pickle.
> 
> Correct.  And the decision varies based on context.
> 
> > 1. The easy way: remove the DW_FORM_data forms from the constant class. 
> > This only leaves DW_FORM_udata and DW_FORM_sdata, which are define the
> > signedness explicitly.  The advantages: it's an easy change for everybody
> > (in the spec, in producers, in consumers).  How many ways of describing a
> > constant does DWARF really need?  The downside is obviously a possible
> > increase in debug info size.  But would it be significant?  I would like to
> > prototype it an see how many values in a real-world DWARF file would now
> > take an extra byte because of this.
> 
> Unfortunately DWARF seems to really love these space-saving
> micro-optimizations.
> Personally I think sleb/uleb is enough for nearly everything (basically all
> values not involving relocations).  But, e.g.,  DWARF added DW_FORM_strx3,
> I guess to save one byte sometimes?
> 
> Anyway one problem with this approach is that it provides no guidance
> for DWARF 3-5.  Still, it would be hugely better, be easy to implement, etc.
> But I guess would be a pretty big change from existing practice.

I don't think we can really fix DWARF 3-5.  Even if we did provide some
retroactive guidance on what the producers should have done with DWARF 3-5, the
compilers are out there already producing DWARF.

> > 2. A more complicated way: for each attribute that can be of the constant
> > class, define a default signedness (I imagine an extra column in Table 7.5:
> > Attribute encodings).  If the form does not specify the signedness (i.e.
> > DW_FORM_data), then the consumer would refer to that table to know if the
> > value should be treated as signed or unsigned.
> 
> This is more or less the approach I took to fixing this in gdb: I went
> through every spot and tried to determine the correct answer.  I don't
> think I quite finished.
> 
> And there are spots that are "confused".  That is, compilers in practice
> will emit a DW_FORM_sdata if the value in question is signed, but will
> emit DW_FORM_data1 and expect this to be zero-extended.  This, to me,
> undermines the idea that the value or the context is "signed" or "unsigned".
> 
> The main problem with this approach is that the answer doesn't just depend
> on the tag or the attribute.  It can depend on other DIEs as well, for
> instance I believe a variant part's discriminant value is sign-extended,
> or not, depending on the type of the relevant field.  This of course is
> difficult to implement, test, etc.

Regarding variants, DWARF 5 already says:

  The value that selects a given variant may be represented in one of
  three ways. The variant entry may have a DW_AT_discr_value attribute whose
  value represents the discriminant value selecting this variant. The value of
this
  attribute is encoded as an LEB128 number. The number is signed if the tag
type
  for the variant part containing this variant is a signed type. The number is
  unsigned if the tag type is an unsigned type.

In other words, despite DW_AT_discr_value being of the class "constant",
according to the attribute encoding table (which in theory allows
DW_FORM_data*), the text forces the use of DW_FORM_udata/DW_FORM_sdata,
depending on the signedness of the discriminant.

But even if DW_FORM_data* forms were allowed here, I think that the "default
signedness" column I imagine could say "depends on the signedness of the
discriminant".

I didn't think about relocation needing fixed-size values.  Do values of the
"constant" class ever need to be relocated? I would guess not.

Regarding the space saving: I tried to modify gcc to always emit DW_FORM_udata
for unsigned constants and did a GDB build.  The .debug_info size increased
from 151,601,709 bytes to 152,605,488 bytes (+0.66%).  I would need to try to
force DW_FORM_udata only

[Bug debug/118837] Interpretation of DW_FORM_data*

2025-11-19 Thread tromey at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118837

--- Comment #8 from Tom Tromey  ---
(In reply to Simon Marchi from comment #7)

> ... so I am afraid that any attempt at addressing this problem will be met
> with a "it's a quality of implementation issue", but I think it's worth
> trying.  I think it would be better if DWARF didn't allow being ambiguous.

Even just making this a hard rule would be an improvement.

IIRC in one of the old threads the answer was that producers and consumers
should agree, but to me this is clearly a bad answer, since the DWARF
standard is precisely the mechanism by which they agree.

> So far my understanding is the problem is: you could have an attribute,
> let's say DW_AT_const_value, with form DW_AT_data1, and value 0x80.  As a
> consumer, how do you know if that 0x80 means -1 or 128?  You could have
> compiler-1 people saying "it should obviously be interpreted as a signed
> constant" and compiler-2 people saying "it should obviously be interpreted
> as an unsigned constant".  And then, as a consumer, you are in a pickle.

Correct.  And the decision varies based on context.

> 1. The easy way: remove the DW_FORM_data forms from the constant class. 
> This only leaves DW_FORM_udata and DW_FORM_sdata, which are define the
> signedness explicitly.  The advantages: it's an easy change for everybody
> (in the spec, in producers, in consumers).  How many ways of describing a
> constant does DWARF really need?  The downside is obviously a possible
> increase in debug info size.  But would it be significant?  I would like to
> prototype it an see how many values in a real-world DWARF file would now
> take an extra byte because of this.

Unfortunately DWARF seems to really love these space-saving
micro-optimizations.
Personally I think sleb/uleb is enough for nearly everything (basically all
values not involving relocations).  But, e.g.,  DWARF added DW_FORM_strx3,
I guess to save one byte sometimes?

Anyway one problem with this approach is that it provides no guidance
for DWARF 3-5.  Still, it would be hugely better, be easy to implement, etc.
But I guess would be a pretty big change from existing practice.

> 2. A more complicated way: for each attribute that can be of the constant
> class, define a default signedness (I imagine an extra column in Table 7.5:
> Attribute encodings).  If the form does not specify the signedness (i.e.
> DW_FORM_data), then the consumer would refer to that table to know if the
> value should be treated as signed or unsigned.

This is more or less the approach I took to fixing this in gdb: I went
through every spot and tried to determine the correct answer.  I don't
think I quite finished.

And there are spots that are "confused".  That is, compilers in practice
will emit a DW_FORM_sdata if the value in question is signed, but will
emit DW_FORM_data1 and expect this to be zero-extended.  This, to me,
undermines the idea that the value or the context is "signed" or "unsigned".

The main problem with this approach is that the answer doesn't just depend
on the tag or the attribute.  It can depend on other DIEs as well, for
instance I believe a variant part's discriminant value is sign-extended,
or not, depending on the type of the relevant field.  This of course is
difficult to implement, test, etc.

[Bug debug/118837] Interpretation of DW_FORM_data*

2025-11-18 Thread simon.marchi at polymtl dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118837

Simon Marchi  changed:

   What|Removed |Added

 CC||simon.marchi at polymtl dot ca

--- Comment #7 from Simon Marchi  ---
I volunteer to bring an issue to the DWARF committee to clarify this.  I'd like
to gather feedback about how you think it should be fixed.  There is already
the non-normative text:

If one of the DW_FORM_dataforms is used to represent a signed or
unsigned
integer, it can be hard for a consumer to discover the context necessary to
determine which interpretation is intended. Producers are therefore
strongly
encouraged to use DW_FORM_sdata or DW_FORM_udata for signed and
unsigned integers respectively, rather than DW_FORM_data.

... so I am afraid that any attempt at addressing this problem will be met with
a "it's a quality of implementation issue", but I think it's worth trying.  I
think it would be better if DWARF didn't allow being ambiguous.

So far my understanding is the problem is: you could have an attribute, let's
say DW_AT_const_value, with form DW_AT_data1, and value 0x80.  As a consumer,
how do you know if that 0x80 means -1 or 128?  You could have compiler-1 people
saying "it should obviously be interpreted as a signed constant" and compiler-2
people saying "it should obviously be interpreted as an unsigned constant". 
And then, as a consumer, you are in a pickle.

Here are some ideas on how to fix it:

1. The easy way: remove the DW_FORM_data forms from the constant class. 
This only leaves DW_FORM_udata and DW_FORM_sdata, which are define the
signedness explicitly.  The advantages: it's an easy change for everybody (in
the spec, in producers, in consumers).  How many ways of describing a constant
does DWARF really need?  The downside is obviously a possible increase in debug
info size.  But would it be significant?  I would like to prototype it an see
how many values in a real-world DWARF file would now take an extra byte because
of this.

2. A more complicated way: for each attribute that can be of the constant
class, define a default signedness (I imagine an extra column in Table 7.5:
Attribute encodings).  If the form does not specify the signedness (i.e.
DW_FORM_data), then the consumer would refer to that table to know if the
value should be treated as signed or unsigned.

Any other ideas?

[Bug debug/118837] Interpretation of DW_FORM_data*

2025-06-06 Thread tromey at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118837

--- Comment #6 from Tom Tromey  ---
I also looked through the LLVM DWARF writer.

In part it is a bit nicer because it does:

void DwarfUnit::addConstantValue(DIE &Die, bool Unsigned, uint64_t Val) {
  // FIXME: This is a bit conservative/simple - it emits negative values always
  // sign extended to 64 bits rather than minimizing the number of bytes.
  addUInt(Die, dwarf::DW_AT_const_value,
  Unsigned ? dwarf::DW_FORM_udata : dwarf::DW_FORM_sdata, Val);
}


That is, for this attribute at least, it doesn't try to compute
the narrowest possible form.

Looking at the other "confused" attributes:

if (Offset < 0)
  addSInt(MemberDie, dwarf::DW_AT_bit_offset, dwarf::DW_FORM_sdata,
  Offset);
else
  addUInt(MemberDie, dwarf::DW_AT_bit_offset, std::nullopt,
  (uint64_t)Offset);

.. i.e., LLVM does the same as GCC here.

And finally:

  if (DD->getDwarfVersion() == 3)
addUInt(MemberDie, dwarf::DW_AT_data_member_location,
dwarf::DW_FORM_udata, OffsetInBytes);
  else
addUInt(MemberDie, dwarf::DW_AT_data_member_location, std::nullopt,
OffsetInBytes);

... which is fine since it at least agrees with GCC that narrow
forms should not sign-extend.

[Bug debug/118837] Interpretation of DW_FORM_data*

2025-06-06 Thread tromey at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118837

--- Comment #5 from Tom Tromey  ---
> If the 64-bit value is
> negative, we emit DW_FORM_sdata, otherwise we emit whatever is smallest
> representation of the positive value (DW_FORM_data{1,2,4,8}, could perhaps
> use DW_FORM_udata).

I think I didn't properly understand this earlier.
If GCC emits "128" as "DW_FORM_data1 0x80", even for a could-be-signed
attribute, then that seems like a real problem.

I did find a spot where GCC does this:

  if (bit_offset < 0)
add_AT_int (die, DW_AT_bit_offset, bit_offset);
  else
add_AT_unsigned (die, DW_AT_bit_offset, (unsigned HOST_WIDE_INT)
bit_offset);

I looked at all calls to add_AT_unsigned and also found this applies
to DW_AT_data_member_location, DW_AT_const_value.

There's also this comment in add_scalar_info:

  /* If HOST_WIDE_INT is big enough then represent the bound as
 a constant value.  We need to choose a form based on
 whether the type is signed or unsigned.  We cannot just
 call add_AT_unsigned if the value itself is positive
 (add_AT_unsigned might add the unsigned value encoded as
 DW_FORM_data[1248]).  Some DWARF consumers will lookup the
 bounds type and then sign extend any unsigned values found
 for signed types.  This is needed only for
 DW_AT_{lower,upper}_bound, since for most other attributes,
 consumers will treat DW_FORM_data[1248] as unsigned values,
 regardless of the underlying type.  */

... which to me is just confirmation that this whole area is a mess.

[Bug debug/118837] Interpretation of DW_FORM_data*

2025-02-21 Thread tromey at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118837

--- Comment #4 from Tom Tromey  ---
(In reply to Jakub Jelinek from comment #3)

> And perhaps we could also try to optimize the DW_FORM_sdata cases if there
> is no ambiguity (e.g. for 8-bit signed contexts with negative value
> DW_FORM_data1 could be unambiguous).

Ok.  I think we're in agreement about the interpretation then?
To sum up:

* If the context is clearly signed, then DW_FORM_data* should
  sign-extend.
* If the context is clearly unsigned, then DW_FORM_data* should
  not sign-extend.
* If the context is ambiguous, then DW_FORM_[su]data should be used.

Perhaps ambiguous contexts should be clarified in the DWARF standard,
but that's a separate issue.

FWIW I think this is also the LLVM interpretation.  There are
definitely places in the DWARF emitter that choose DW_FORM_sdata
rather than some default.

If you both agree with the above I think we can just close this
bug and I'll fix gdb.

[Bug debug/118837] Interpretation of DW_FORM_data*

2025-02-12 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118837

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org,
   ||jason at gcc dot gnu.org,
   ||mark at gcc dot gnu.org

--- Comment #3 from Jakub Jelinek  ---
What GCC does right now is that at least when emitting the attributes from RTL
we don't know the sign nor precision, all we know is the value as 64-bit signed
integer.
Now, normally in RTL values are sign extended.  If the 64-bit value is
negative, we emit DW_FORM_sdata, otherwise we emit whatever is smallest
representation of the positive value (DW_FORM_data{1,2,4,8}, could perhaps use
DW_FORM_udata).
So, if we want to change anything, we'd need to track at least for the
dw_val_class_unsigned_const case track not just the value but also precision
and whether it is signed or not.  Now, if precision is the same as
constant_size (AT_unsigned (a)) * BITS_PER_UNIT, we don't really care about
extension and we can keep doing what we are, similarly if the MSB bit in
whatever constant_size we choose is clear, similarly if we know that the
corresponding type is unsigned.  We know the precision and signedness when we
see trees, but not when we just see RTL.
And perhaps we could also try to optimize the DW_FORM_sdata cases if there is
no ambiguity (e.g. for 8-bit signed contexts with negative value DW_FORM_data1
could be unambiguous).

[Bug debug/118837] Interpretation of DW_FORM_data*

2025-02-12 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118837

--- Comment #2 from Richard Biener  ---
I don't read any ambiguity.  How DW_FORM_data1 255 is interpreted depends on
context - but if the context only says 'int', then this is not enough
context to decide if  char 255 is supposed to be sign or
zero-extended to 'int'.  A reasonable interpretation would be that the
producer would only truncate the data when there is no such ambiguity,
either resulting in DW_FORM_data2 255 or in the consumer assuming
zero-extension(?)

IMO LLVM in this case is the one to blame for any ambiguity and it should
not truncate using DW_FORM_dataN when the result has the sign bit not zero.

[Bug debug/118837] Interpretation of DW_FORM_data*

2025-02-11 Thread tromey at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118837

--- Comment #1 from Tom Tromey  ---
Also there was an old thread about this:

https://lists.dwarfstd.org/pipermail/dwarf-discuss/2010-July/000862.html