https://bugs.kde.org/show_bug.cgi?id=339160
Anatolii <toliakpur...@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |toliakpur...@gmail.com --- Comment #4 from Anatolii <toliakpur...@gmail.com> --- Hello. I encountered to the same problem while analyzing Nomad application via Callgrind tool. ## Bug description Valgrind versions tested: - valgrind-3.24.0 (Arch Linux, pacman) commit `2d09ef48e` from the master branch. Reproduction command and the resulting log: ``` # valgrind --tool=callgrind /tmp/nomad -h ... Callgrind: threads.c:247 (vgCallgrind_post_signal): Assertion 'sigNum == vgCallgrind_current_state.sig' failed. host stacktrace: ==486603== at 0x5801F79F: show_sched_status_wrk.lto_priv.0 (m_libcassert.c:407) ==486603== by 0x5801F7FF: report_and_quit (m_libcassert.c:478) ==486603== by 0x5801F992: vgPlain_assert_fail (m_libcassert.c:544) ==486603== by 0x58023021: vgCallgrind_post_signal (threads.c:247) ==486603== by 0x5807A256: vgSysWrap_amd64_linux_sys_rt_sigreturn_before (sigframe-amd64-linux.c:599) ==486603== by 0x581DFD54: handle_syscall.isra.0 (syswrap-main.c:2358) ==486603== by 0x5806956E: vgPlain_scheduler (scheduler.c:1582) ==486603== by 0x580B30A9: run_a_thread_NORETURN (syswrap-linux.c:102) ``` ## Localization I attempted to isolate the location of the bug and to determine the root cause. In my opinion, I achieved some noticeable results. I have inserted additional debug prints (using `VG_(printf)`) into the functions listed below. Therefore, I am able to capture more necessary context. See attached file [debug-prints-339160-commit-2d09ef48e.patch](https://gist.github.com/Toliak/dc9d49424f2ce27cbc7a00b7de3dc2eb#file-debug-prints-339160-commit-2d09ef48e-patch). - [unwind_thread inside the loop](https://sourceware.org/git/?p=valgrind.git;a=blob;f=callgrind/main.c;h=39aa230ed2716a41f1c49bf2f7cc44167bdd18ff;hb=2d09ef48e22061b143cd4765e8defb63f5d5024a#l1432) - [CLG_(pop_call_stack)](https://sourceware.org/git/?p=valgrind.git;a=blob;f=callgrind/callstack.c;h=fc1d6eeab130f94c382062c7071e15311cb953f9;hb=2d09ef48e22061b143cd4765e8defb63f5d5024a#l321) - [CLG_(pre_signal)](https://sourceware.org/git/?p=valgrind.git;a=blob;f=callgrind/threads.c;h=624588a2694927409feeba832335043d181b31b9;hb=2d09ef48e22061b143cd4765e8defb63f5d5024a#l201) - [CLG_(run_post_signal_on_call_stack_bottom)](https://sourceware.org/git/?p=valgrind.git;a=blob;f=callgrind/threads.c;h=624588a2694927409feeba832335043d181b31b9;hb=2d09ef48e22061b143cd4765e8defb63f5d5024a#l234) - [CLG_(post_signal)](https://sourceware.org/git/?p=valgrind.git;a=blob;f=callgrind/threads.c;h=624588a2694927409feeba832335043d181b31b9;hb=2d09ef48e22061b143cd4765e8defb63f5d5024a#l247) - [VG_(sigframe_destroy)](https://sourceware.org/git/?p=valgrind.git;a=blob;f=coregrind/m_sigframe/sigframe-amd64-linux.c;h=0e24306096c7577f959b5c38c4af143e0f77a93d;hb=2d09ef48e22061b143cd4765e8defb63f5d5024a#l598) The callgrind output with additional debug print: ``` # ./dist/bin/valgrind --tool=callgrind /tmp/nomad -h ==575331== Callgrind, a call-graph generating cache profiler ==575331== Copyright (C) 2002-2017, and GNU GPL'd, by Josef Weidendorfer et al. ==575331== Using Valgrind-3.25.0.GIT and LibVEX; rerun with -h for copyright info ==575331== Command: /tmp/nomad -h ==575331== ==575331== For interactive control, run 'callgrind_control -h'. BB# 4793874 -- pre_signal(TID 1, sig 23, alt_st yes) === pop_call_stack current_tid=1 CLG_(current_state).sig=23 === run_post_signal_on_call_stack_bottom -- post_signal(TID 1, sig 23) === post_signal after CLG_(switch_thread)(tid); current_tid=1 tid=1 sigNum=23 CLG_(current_state).sig=23 === post_signal before exec_state_restore es->sig=0 === sigframe_destroy before VG_TRACK( post_deliver_signal, tid, sigNo ) BB# 4793961 -- post_signal(TID 1, sig 23) === post_signal after CLG_(switch_thread)(tid); current_tid=1 tid=1 sigNum=23 CLG_(current_state).sig=0 0x000000000003dcd0 Callgrind: threads.c:262 (vgCallgrind_post_signal): Assertion 'sigNum == vgCallgrind_current_state.sig' failed. ``` Two points to consider: 1. There is the second `post_signal` call before the assertion fail. During the second call variable `CLG_(current_state)` appears reset, but the `sigNum` (post_signal argument) still retains old value "23". 2. `post_signal` is called twice. The first trace is `pop_call_stack -> run_post_signal_on_call_stack_bottom -> post_signal`. And the second: `sigframe_destroy -> track_post_deliver_signal -> post_signal`. I have not investigated this bug further, but based on the information above, I suppose that the signal is processed (called `post_signal`) twice, which is incorrect behavior. The first processing may be valid, but the second processing causes Callgrind to crash with an assertion error. ## Workaround I wrote a workaround in order to bypass the second `post_signal` (wrong one), that is being called from the `sigframe_destroy`. I added a new function, `CLG_(post_signal_delivered)`, which wraps `CLG_(post_signal)` and checks is `CLG_(current_state).sig` equal to zero. I have prepared a patch-file (tested only on commit `2d09ef48e`). See file [bug-339160-workaround-commit-2d09ef48e] (пер. https://gist.github.com/Toliak/dc9d49424f2ce27cbc7a00b7de3dc2eb#file-bug-339160-workaround-commit-2d09ef48e-patch). The results: ``` # ./dist/bin/valgrind --tool=callgrind /tmp/nomad -h ... Triggered workaround to bypass CLG_(post_signal), see https://bugs.kde.org/show_bug.cgi?id=339160 Triggered workaround to bypass CLG_(post_signal), see https://bugs.kde.org/show_bug.cgi?id=339160 Triggered workaround to bypass CLG_(post_signal), see https://bugs.kde.org/show_bug.cgi?id=339160 Usage: nomad [-version] [-help] [-autocomplete-(un)install] <command> [args] Common commands: run Run a new job or update an existing job ...... agent Runs a Nomad agent Other commands: acl Interact with ACL policies and tokens action Run a pre-defined command from a given context ...... version Prints the Nomad version volume Interact with volumes ==639712== ==639712== Events : Ir ==639712== Collected : 619290724 ==639712== ==639712== I refs: 619,290,724 ``` ## Validation and discussion I think my workaround cannot be treated as a bugfix. So, I provided patch-file here without making Merge Request. However, if you encounter the same problem, you can apply the patch and test the Callgrind behavior. To fully resolve the issue, the next steps should focus on identifying why `post_signal` is being called twice. -- You are receiving this mail because: You are watching all bug changes.