Problems with dwarf-getmacros test

2017-05-08 Thread Ulf Hermann

Hi,

I frequently get failures from the run-dwarf-getmacros.sh test, on 
testfile-macros:0xb. The test repeatedly outputs "(null)" instead of the 
actual macros and then runs into the assert at dwarf-getmacros.c:50. The 
failure is very nondeterministic, though. I haven't found a reliable way 
to trigger it.


Further examination reveals that the __libdw_in_section check in 
READ_AND_RELOCATE (libdwP.h:656), when called from __libdw_read_offset 
seems to be bogus. The "return -1" in there is what produces the null 
results and ultimately the assert. Experiments show that the address is 
frequently not in the section we're checking there, but still valid. 
Just dropping the check makes the test succeed.


I'm currently at a loss about why this happens. One thing that strikes 
me was that the additional dbg_ret mechamism was added in 2012 with 
commit 775375e3, but the check in READ_AND_RELOCATE was not adapted then.


However, the address is also not necessarily in dbg_ret at that point. 
Checking dbg_ret in addition to dbg still fails sometimes, and also that 
wouldn't explain the nondeterminism.


Ulf


Re: Problems with dwarf-getmacros test

2017-05-09 Thread Mark Wielaard
On Mon, 2017-05-08 at 18:22 +0200, Ulf Hermann wrote:
> I frequently get failures from the run-dwarf-getmacros.sh test, on 
> testfile-macros:0xb. The test repeatedly outputs "(null)" instead of the 
> actual macros and then runs into the assert at dwarf-getmacros.c:50. The 
> failure is very nondeterministic, though. I haven't found a reliable way 
> to trigger it.

Is it only with testfile-macros?
All other testfiles always run correctly?

> Further examination reveals that the __libdw_in_section check in 
> READ_AND_RELOCATE (libdwP.h:656), when called from __libdw_read_offset 
> seems to be bogus. The "return -1" in there is what produces the null 
> results and ultimately the assert.

Do you have the whole call stack of that failed __libdw_read_offset
call? Which source line in tests/dwarf-getmacros.c prints the "(null)"?

> Experiments show that the address is 
> frequently not in the section we're checking there, but still valid. 
> Just dropping the check makes the test succeed.

I think this might be related to our "fake" CU and attributes we invent
in libdw/dwarf_getmacros.c (read_macros). See around this comment:

  /* We pretend this is a DW_AT_GNU_macros attribute so that
 DW_FORM_sec_offset forms get correctly interpreted as
 offset into .debug_macro.  */

If that is the issue then we might need to somehow make
READ_AND_RELOCATE and/or __libdw_in_section aware that the CU is fake
and the check isn't needed. In which case we probably need to add some
flag "fake" to the CU and pass the CU to the various __libdw_read_*
functions to disable that sanity check in READ_AND_RELOCATE.

> I'm currently at a loss about why this happens. One thing that strikes 
> me was that the additional dbg_ret mechamism was added in 2012 with 
> commit 775375e3, but the check in READ_AND_RELOCATE was not adapted then.

The READ_AND_RELOCATE macro is hard to read because it captures the
names of some of the variables it uses instead of getting them passed as
arguments. It took me a couple of times to double check what it does
seems correct. The check in READ_AND_RELOCATE is against (dbg,
sec_index, addr) checking that the address is inside the section for the
dbg Dwarf. While the __libdw_read_offset (dbg_ret, sec_ret, *ret) is for
the returned valued in the returned section in the returned Dwarf.

> However, the address is also not necessarily in dbg_ret at that point. 
> Checking dbg_ret in addition to dbg still fails sometimes, and also that 
> wouldn't explain the nondeterminism.

The nondeterminism is weird indeed.

Cheers,

Mark


Re: Problems with dwarf-getmacros test

2017-05-09 Thread Ulf Hermann
> Is it only with testfile-macros?
> All other testfiles always run correctly?

Yes, it only happens with the testfile-macros check at 0xb.

> Do you have the whole call stack of that failed __libdw_read_offset
> call? Which source line in tests/dwarf-getmacros.c prints the "(null)"?

I'm having a hard time triggering the problem in a controlled environment. So 
far I didn't manage to create a stack trace. The (null) is the result of 
passing a value of null to dwarf-getmacros.c:88. I get an errno of 
DWARF_E_INVALID_OFFSET at the assert failure. So I put some debug messages into 
all places where that is generated and this is how I found the source of this. 
It seems I can't trigger the effect if I put the debug output directly into 
__libdw_in_section (but maybe I was just very unlucky). If I put it in 
READ_AND_RELOCATE I can see that the check always fails before it outputs 
(null) or hits the assertion.

>> Experiments show that the address is 
>> frequently not in the section we're checking there, but still valid. 
>> Just dropping the check makes the test succeed.
> 
> I think this might be related to our "fake" CU and attributes we invent
> in libdw/dwarf_getmacros.c (read_macros). See around this comment:
> 
>   /* We pretend this is a DW_AT_GNU_macros attribute so that
>  DW_FORM_sec_offset forms get correctly interpreted as
>  offset into .debug_macro.  */
> 
> If that is the issue then we might need to somehow make
> READ_AND_RELOCATE and/or __libdw_in_section aware that the CU is fake
> and the check isn't needed. In which case we probably need to add some
> flag "fake" to the CU and pass the CU to the various __libdw_read_*
> functions to disable that sanity check in READ_AND_RELOCATE.

We don't set type_offset on the fake_cu, but we read it via cu_sec_idx when 
calling __libdw_read_offset from dwarf_formstring. This seems wrong.

Ulf


Re: Problems with dwarf-getmacros test

2017-05-09 Thread Ulf Hermann

Do you have the whole call stack of that failed __libdw_read_offset
call? Which source line in tests/dwarf-getmacros.c prints the "(null)"?


Actually I just managed to catch a backtrace by inserting an infinite 
loop at the point where it would normally return -1. It turns out the 
address we are looking for is not in the IDX_debug_info section, but in 
the IDX_debug_macro section (which by itself isn't all that surprising, 
but if that is a rule, how can this code ever work??).


Thread 1 (Thread 220048.0x35af0):
#0  0x70ec311e in __libdw_read_offset (sec_ret=8, size=1, 
ret=, width=4, addr=0xe90089 ".\027", sec_index=0, 
dbg_ret=, dbg=) at 
D:/qtcreator-super-master/elfutils/libdw/libdwP.h:712

orig_addr = 
#1  dwarf_formstring (attrp=attrp@entry=0x6af950) at 
D:/qtcreator-super-master/elfutils/libdw/dwarf_formstring.c:69

dbg = 
dbg_ret = 
off = 
#2  0x70ecc709 in dwarf_macro_param2 
(macro=macro@entry=0x6afaa0, paramp=paramp@entry=0x0, 
strp=strp@entry=0x6af9f0) at 
D:/qtcreator-super-master/elfutils/libdw/dwarf_macro_param2.c:50
param = {code = 8473, form = 14, valp = 0xe90089 ".\027", cu = 
0x6afac0}
#3  0x00401784 in mac (macro=0x6afaa0, dbg=0xf926f0) at 
D:/qtcreator-super-master/elfutils/tests/dwarf-getmacros.c:87

value = 0xf927b8 "p?ù"
level = 0
opcode = 5
#4  0x70ecbf1d in read_macros (dbg=dbg@entry=0xf926f0, 
sec_index=sec_index@entry=10, macoff=macoff@entry=0, callback=out>, callback@entry=0x4015e0 , arg=, 
offset=, offset@entry=0, accept_0xff=, 
accept_0xff@entry=false, cudie=, cudie@entry=0x6afe10) at 
D:/qtcreator-super-master/elfutils/libdw/dwarf_getmacros.c:409

opcode = 
idx = 
attributesp = 
nattributes = {{code = 8473, form = 15, valp = 0xe90088 
"\001.\027", cu = 0x6afac0}, {code = 8473, form = 14, valp = 0xe90089 
".\027", cu = 0x6afac0}, {code = 0, form = 0, valp = 0x0, cu = 0x0}, 
{code = 0, form = 0, valp = 0x0, cu = 0x0}, {code = 0, form = 0, valp = 
0x0, cu = 0x0}, {code = 0, form = 0, valp = 0x0, cu = 0x0}, {code = 0, 
form = 0, valp = 0x0, cu = 0x0}, {code = 0, form = 0, valp = 0x0, cu = 0x0}}
fake_cu = {dbg = 0xf926f0, start = 0, end = 0, address_size = 0 
'\000', offset_size = 4 '\004', version = 4, type_offset = 0, type_sig8 
= 0, abbrev_hash = {size = 0, filled = 0, table = 0x0}, 
orig_abbrev_offset = 0, last_abbrev_offset = 0, lines = 0x0, files = 
0x0, locs = 0x0, startp = 0xe90087, endp = 0xe90ac4}

attributes = 
macro = {table = 0xf928b0, attributes = 0x6afb40, opcode = 5 
'\005'}

res = 
d = 
startp = 
endp = 
readp = 0xe9008d "\005\001î!"
#5  0x70ecc302 in gnu_macros_getmacros_off (dbg=0xf926f0, 
macoff=0, callback=callback@entry=0x4015e0 , 
arg=arg@entry=0xf926f0, offset=offset@entry=0, 
accept_0xff=accept_0xff@entry=false, cudie=cudie@entry=0x6afe10) at 
D:/qtcreator-super-master/elfutils/libdw/dwarf_getmacros.c:478

No locals.
#6  0x70ecc502 in dwarf_getmacros (cudie=cudie@entry=0x6afe10, 
callback=callback@entry=0x4015e0 , arg=arg@entry=0xf926f0, token=0) 
at D:/qtcreator-super-master/elfutils/libdw/dwarf_getmacros.c:564

macoff = 0
accept_0xff = false
offset = 
#7  0x00409a36 in main (argc=3, argv=) at 
D:/qtcreator-super-master/elfutils/tests/dwarf-getmacros.c:133

off = 
name = 
cuoff = 
new_style = 
fd = 
dbg = 0xf926f0
cudie_mem = {addr = 0xf93a5b, cu = 0xf92810, abbrev = 0xf92890, 
padding__ = 0}

cudie = 0x6afe10


Re: Problems with dwarf-getmacros test

2017-05-09 Thread Ulf Hermann

Do you have the whole call stack of that failed __libdw_read_offset
call? Which source line in tests/dwarf-getmacros.c prints the "(null)"?


Actually I just managed to catch a backtrace by inserting an infinite
loop at the point where it would normally return -1. It turns out the
address we are looking for is not in the IDX_debug_info section, but in
the IDX_debug_macro section (which by itself isn't all that surprising,
but if that is a rule, how can this code ever work??).


The likely solution of the mystery is this piece of code in 
__libdw_in_section:


if (unlikely (addr < data->d_buf)
  || unlikely (data->d_size - (addr - data->d_buf) < size))
{
  __libdw_seterrno (DWARF_E_INVALID_OFFSET);
  return false;
}

If addr < data->d_buf we fail immediately. That's what's happening to me 
sometimes. However, if addr > data->d_buf, then it's likely much bigger, 
as the allocations of the different sections are unrelated. data->d_size 
is always 159 in this test case. Therefore, data->d_size - (addr - 
data->d_buf) will produce something negative, which then gets compared 
to a size_t, forcing it to overflow. size is 4, so this case never gets 
detected.


Ulf