On Fri, Jul 5, 2024 at 11:41 AM lijiang <liji...@redhat.com> wrote:
> On Fri, Jul 5, 2024 at 9:25 AM lijiang <liji...@redhat.com> wrote: > >> On Fri, Jul 5, 2024 at 7:29 AM Tao Liu <l...@redhat.com> wrote: >> >>> Hi Lianbo, >>> >>> On Mon, Jul 1, 2024 at 9:15 PM lijiang <liji...@redhat.com> wrote: >>> > >>> > Hi, Tao >>> > On Thu, Jun 27, 2024 at 12:08 PM Tao Liu <l...@redhat.com> wrote: >>> >> >>> >> Hi Lianbo, >>> >> >>> >> Thanks for the patch. >>> >> >>> >> On Fri, Jun 14, 2024 at 3:00 PM Lianbo Jiang <liji...@redhat.com> >>> wrote: >>> >> > >>> >> > Sometimes, in a production environment, there are still some vmcores >>> >> > that are incomplete, such as partial header or the data is >>> corrupted. >>> >> > When crash tool attempts to parse such vmcores, it may fail as >>> below: >>> >> > >>> >> > $ ./crash --osrelease vmcore >>> >> > Bus error (core dumped) >>> >> > >>> >> > or >>> >> > >>> >> > $ crash vmlinux vmcore >>> >> > ... >>> >> > Bus error (core dumped) >>> >> > $ >>> >> > >>> >> > The gdb sees that crash tool reads out a null bitmap from the >>> header in >>> >> > this vmcore, when executing memcpy(), emits a SIGBUS error as below: >>> >> > >>> >> > $ gdb /home/lijiang/src/crash/crash /tmp/core.126301 >>> >> > Core was generated by `./crash --osrelease >>> /home/lijiang/src/39317/vmcore'. >>> >> > Program terminated with signal SIGBUS, Bus error. >>> >> > #0 __memcpy_evex_unaligned_erms () at >>> ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:831 >>> >> > 831 LOAD_ONE_SET((%rsi), PAGE_SIZE, %VMM(4), %VMM(5), >>> %VMM(6), %VMM(7)) >>> >> > (gdb) bt >>> >> > #0 __memcpy_evex_unaligned_erms () at >>> ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:831 >>> >> > #1 0x0000000000651096 in read_dump_header (file=0x7ffc59ddff5f >>> "/home/lijiang/src/39317/vmcore") at diskdump.c:820 >>> >> > #2 0x0000000000651cf3 in is_diskdump (file=0x7ffc59ddff5f >>> "/home/lijiang/src/39317/vmcore") at diskdump.c:1042 >>> >> > #3 0x0000000000502ac9 in get_osrelease (dumpfile=0x7ffc59ddff5f >>> "/home/lijiang/src/39317/vmcore") at main.c:1938 >>> >> > #4 0x00000000004fb2e8 in main (argc=3, argv=0x7ffc59dde3a8) at >>> main.c:271 >>> >> > (gdb) frame 1 >>> >> > #1 0x0000000000651096 in read_dump_header (file=0x7ffc59ddff5f >>> "/home/lijiang/src/39317/vmcore") at diskdump.c:820 >>> >> > 820 memcpy(dd->dumpable_bitmap, dd->bitmap + >>> bitmap_len/2, >>> >> > (gdb) p dd->dumpable_bitmap >>> >> > $1 = 0x7f8e89800010 "" >>> >> > (gdb) p dd->bitmap >>> >> > $2 = 0x7f8e87e09000 "" >>> >> > (gdb) p dd->bitmap + bitmap_len/2 >>> >> > $3 = 0x7f8e88a17000 "" >>> >> > (gdb) p *(char*)(dd->bitmap+bitmap_len/2) >>> >> > $4 = 0 '\000' >>> >> > (gdb) p bitmap_len/2 >>> >> > $5 = 12640256 >>> >> > >>> >> > Let's add a sanity check for such cases to avoid causing a SIGBUS >>> error. >>> >> > >>> >> I didn't really understand the reason why the SIGBUS error occurs. Is >>> >> it because bitmap_len/2 is too large(12M), which exceeded the >>> >> dd->bitmap memory range? >>> > >>> > >>> > I did not see any limitations on the bitmap memory range. >>> > >>> > Looks like an unaligned memory address was accessed, please see here: >>> > >>> > #0 __memcpy_evex_unaligned_erms () at >>> ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:831 >>> > 831 LOAD_ONE_SET((%rsi), PAGE_SIZE, %VMM(4), %VMM(5), >>> %VMM(6), %VMM(7)) >>> > >>> > Maybe the vmcore is corrupted, anyway I have never seen it before this >>> issue. >>> > >>> >> >>> >> Why *(char*)(dd->bitmap+bitmap_len/2) == '\0' will be considered as an >>> >> error, do we expect a value like "0xff" here? >>> > >>> > >>> > No. >>> > If it is a null string, no need to copy it(or it may be an exceptional >>> string because of corrupted data). >>> > >>> I still didn't understand the check here. Assume we have a bitmap like >>> the following. The above code only checks if the starting byte is 0. >>> >> >> No, as above mentioned, the *(char*)(dd->bitmap+bitmap_len/2) == '\0' is >> checking if it is a null string, not a 0 value. They are different. >> >> What about if there are 1s after the leading 0? Do these be considered >>> as valid bitmap and should they be copied? >>> ... 0 1 1 0 1 1 0 1 1 1 1 >>> ^ >>> >> >> This is not a null string. The value of checking >> *(char*)(dd->bitmap+bitmap_len/2) == '\0' is 'false', not 'true'. >> >> >>> ... len/2 >>> >>> I understand that you want to skip the unnecessary memcpy to bypass >>> >> >> As I mentioned in the last email, If it is a null string('\0'), no need >> to copy it. Let's differentiate between 0 and '\0', they are not the same. >> >> Thanks >> Lianbo >> >> the SIGBUS error by setting a check condition. I'm worrying if the >>> check may filter out those valid bitmaps. >>> >> > Think it over again, looks like your concern is correct, Tao. > > Let's see if there is a better way to fix the current issue. > If the header is partial, the original intention might be to copy the bitmap starting from dd->bitmap(instead of dd->bitmap + bitmap_len/2), and the copy length is bitmap_len/2, just like this: diff --git a/diskdump.c b/diskdump.c index 1f7118cacfc6..8b7ec6e19c96 100644 --- a/diskdump.c +++ b/diskdump.c @@ -815,8 +815,7 @@ restart: } if (dump_is_partial(header)) - memcpy(dd->dumpable_bitmap, dd->bitmap + bitmap_len/2, - bitmap_len/2); + memcpy(dd->dumpable_bitmap, dd->bitmap, bitmap_len/2); else memcpy(dd->dumpable_bitmap, dd->bitmap, bitmap_len); But, I can not track the history of these changes according to the github log. Just a guess. With the above patch, it can print the result as below: $ ./crash --osrelease ../vmcore 4.18.0-513.24.1.el8_9.x86_64 Thanks Lianbo > Thanks > Lianbo > > > >>> Thanks, >>> Tao Liu >>> >>> >> >>> >> >>> >> I guess what we want here is to handle SIGBUS errors nicely right? why >>> >> don't we add a SIGBUS handler and process it directly? >>> > >>> > >>> > This may prevent a core dump file from being generated. >>> > And users can easily report it with a core dump file, sometimes It's >>> hard to reproduce for us. >>> > >>> > Thanks >>> > Lianbo >>> > >>> >> >>> >> Thanks, >>> >> Tao Liu >>> >> >>> >> > With the patch: >>> >> > $ crash -s vmlinux vmcore >>> >> > crash: vmcore: not a supported file format >>> >> > ... >>> >> > Enter "crash -h" for details. >>> >> > >>> >> > $ crash --osrelease vmcore >>> >> > unknown >>> >> > >>> >> > Reported-by: Buland Kumar Singh <bsi...@redhat.com> >>> >> > Signed-off-by: Lianbo Jiang <liji...@redhat.com> >>> >> > --- >>> >> > diskdump.c | 6 ++++-- >>> >> > 1 file changed, 4 insertions(+), 2 deletions(-) >>> >> > >>> >> > diff --git a/diskdump.c b/diskdump.c >>> >> > index 1f7118cacfc6..c31eab01aa05 100644 >>> >> > --- a/diskdump.c >>> >> > +++ b/diskdump.c >>> >> > @@ -814,10 +814,12 @@ restart: >>> >> > madvise(dd->bitmap, bitmap_len, MADV_WILLNEED); >>> >> > } >>> >> > >>> >> > - if (dump_is_partial(header)) >>> >> > + if (dump_is_partial(header)) { >>> >> > + if (*(char*)(dd->bitmap + bitmap_len/2) == '\0') >>> >> > + goto err; >>> >> > memcpy(dd->dumpable_bitmap, dd->bitmap + >>> bitmap_len/2, >>> >> > bitmap_len/2); >>> >> > - else >>> >> > + } else >>> >> > memcpy(dd->dumpable_bitmap, dd->bitmap, bitmap_len); >>> >> > >>> >> > dd->data_offset >>> >> > -- >>> >> > 2.45.1 >>> >> > -- >>> >> > Crash-utility mailing list -- devel@lists.crash-utility.osci.io >>> >> > To unsubscribe send an email to >>> devel-le...@lists.crash-utility.osci.io >>> >> > https://${domain_name}/admin/lists/ >>> devel.lists.crash-utility.osci.io/ >>> >> > Contribution Guidelines: >>> https://github.com/crash-utility/crash/wiki >>> >> >>> >>>
-- Crash-utility mailing list -- devel@lists.crash-utility.osci.io To unsubscribe send an email to devel-le...@lists.crash-utility.osci.io https://${domain_name}/admin/lists/devel.lists.crash-utility.osci.io/ Contribution Guidelines: https://github.com/crash-utility/crash/wiki