On Sat, 07 Jan 2023 at 22:07:31 +0000, Matthew Vernon wrote: > I'm struggling a bit here; I wanted to try and bisect pcre2 upstream commits > to see where this bug might have been introduced (or get to the bottom of > what link-grammar's test is doing wrong, I see they've been troublesome in > the past cf #975696).
I tried using AddressSanitizer to get more information, which might be helpful. I used a bookworm podman container for this, and for simplicity I'm using uid 0 in the container, but you could probably do the same in a Docker container, a VM or a chroot and as an unprivileged user. $ podman run -it --rm debian:bookworm-slim # apt install --no-install-recommends devscripts build-essential # echo "deb-src http://deb.debian.org/debian bookworm main" >> /etc/apt/sources.list # cd /root # apt update # apt source pcre2 link-grammar # echo "deb-src http://deb.debian.org/debian sid main" >> /etc/apt/sources.list # apt source pcre2 # apt build-dep pcre2 link-grammar # cd /root/pcre2-10.40 # debuild -eASAN_OPTIONS=detect_leaks=0 -eDEB_BUILD_OPTIONS="noopt sanitize=+address,+undefined" -us -uc -b # cd /root/pcre2-10.42 # debuild -eASAN_OPTIONS=detect_leaks=0 -eDEB_BUILD_OPTIONS="noopt sanitize=+address,+undefined" -us -uc -b # cd /root/link-grammar-5.11.0~dfsg # debuild -eASAN_OPTIONS=detect_leaks=0 -eDEB_BUILD_OPTIONS="noopt sanitize=+address,+undefined" -eLD_PRELOAD=libasan.so.8 -eLD_LIBRARY_PATH=/root/pcre2-10.40/debian/tmp/usr/lib/x86_64-linux-gnu -us -uc -b ... tests pass ... # debuild -eASAN_OPTIONS=detect_leaks=0 -eDEB_BUILD_OPTIONS="noopt sanitize=+address,+undefined" -eLD_PRELOAD=libasan.so.8 -eLD_LIBRARY_PATH=/root/pcre2-10.42/debian/tmp/usr/lib/x86_64-linux-gnu -us -uc -b ... tests fail ... (Note that the order of options to debuild is significant, debuild option -e must come before dpkg-buildpackage options like -us, -uc, -b.) With those steps, I get AddressSanitizer reporting a heap buffer overflow when the test tries to match a regular expression, and a similar error in the multi-java test. A roughly equivalent setup with the upstream libpcre also "works" (by which I mean, fails). AddressSanitizer output below. # apt install git # git clone https://github.com/PCRE2Project/pcre2 # cd /root/pcre2 # git checkout pcre2-10.42 # ./autogen.sh # ./configure CFLAGS="-fsanitize=address -fno-omit-frame-pointer -fsanitize=undefined" --prefix=/usr # make # ( cd /root/link-grammar-5.11.0~dfsg && debuild -eASAN_OPTIONS=detect_leaks=0 -eDEB_BUILD_OPTIONS="noopt sanitize=+address,+undefined" -eLD_PRELOAD=libasan.so.8 -eLD_LIBRARY_PATH=/root/pcre2/.libs -us -uc -b ) ... tests fail ... The same setup with pcre2-10.40 passes tests. The link-grammar build later fails when using the sanitizers, because it uses d-shlibs which is basically an anti-pattern (#902605), but that's enough to be able to bisect (in progress). smcv =============================================== link-grammar 5.11.0: tests/test-suite.log =============================================== # TOTAL: 5 # PASS: 3 # SKIP: 0 # XFAIL: 0 # FAIL: 2 # XPASS: 0 # ERROR: 0 .. contents:: :depth: 2 FAIL: multi-thread ================== ================================================================= ==107299==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60e000000708 at pc 0x7f8c46092bb4 bp 0x7f8c3cb87800 sp 0x7f8c3cb877f8 READ of size 1 at 0x60e000000708 thread T10 #0 0x7f8c46092bb3 in match (/root/pcre2/.libs/libpcre2-8.so.0+0x492bb3) #1 0x7f8c460b4eac in pcre2_match_8 (/root/pcre2/.libs/libpcre2-8.so.0+0x4b4eac) #2 0x7f8c474930e3 in reg_match dict-common/regex-morph.c:219 #3 0x7f8c47494753 in match_regex dict-common/regex-morph.c:405 #4 0x7f8c475b2997 in regex_guess tokenize/tokenize.c:413 #5 0x7f8c475ca258 in separate_word tokenize/tokenize.c:2709 #6 0x7f8c475cdf73 in separate_sentence tokenize/tokenize.c:3116 #7 0x7f8c4745fc07 in sentence_split link-grammar/api.c:494 #8 0x55b9169165b8 in parse_one_sent tests/multi-thread.cc:34 #9 0x55b9169177e9 in parse_sents tests/multi-thread.cc:119 #10 0x55b9169209c2 in void std::__invoke_impl<void, void (*)(Dictionary_s*, Parse_Options_s*, int, int), Dictionary_s*, Parse_Options_s*, int, int>(std::__invoke_other, void (*&&)(Dictionary_s*, Parse_Options_s*, int, int), Dictionary_s*&&, Parse_Options_s*&&, int&&, int&&) /usr/include/c++/12/bits/invoke.h:61 #11 0x55b91692056f in std::__invoke_result<void (*)(Dictionary_s*, Parse_Options_s*, int, int), Dictionary_s*, Parse_Options_s*, int, int>::type std::__invoke<void (*)(Dictionary_s*, Parse_Options_s*, int, int), Dictionary_s*, Parse_Options_s*, int, int>(void (*&&)(Dictionary_s*, Parse_Options_s*, int, int), Dictionary_s*&&, Parse_Options_s*&&, int&&, int&&) /usr/include/c++/12/bits/invoke.h:96 #12 0x55b91691fffb in void std::thread::_Invoker<std::tuple<void (*)(Dictionary_s*, Parse_Options_s*, int, int), Dictionary_s*, Parse_Options_s*, int, int> >::_M_invoke<0ul, 1ul, 2ul, 3ul, 4ul>(std::_Index_tuple<0ul, 1ul, 2ul, 3ul, 4ul>) /usr/include/c++/12/bits/std_thread.h:252 #13 0x55b91691fc59 in std::thread::_Invoker<std::tuple<void (*)(Dictionary_s*, Parse_Options_s*, int, int), Dictionary_s*, Parse_Options_s*, int, int> >::operator()() /usr/include/c++/12/bits/std_thread.h:259 #14 0x55b91691fc11 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(Dictionary_s*, Parse_Options_s*, int, int), Dictionary_s*, Parse_Options_s*, int, int> > >::_M_run() /usr/include/c++/12/bits/std_thread.h:210 #15 0x7f8c46ed44a2 (/lib/x86_64-linux-gnu/libstdc++.so.6+0xd44a2) #16 0x7f8c470a7fd3 (/lib/x86_64-linux-gnu/libc.so.6+0x88fd3) #17 0x7f8c4712866b (/lib/x86_64-linux-gnu/libc.so.6+0x10966b) 0x60e000000708 is located 12 bytes to the right of 156-byte region [0x60e000000660,0x60e0000006fc) allocated by thread T0 here: #0 0x7f8c47ab89cf in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:69 #1 0x7f8c45f46659 in default_malloc (/root/pcre2/.libs/libpcre2-8.so.0+0x346659) #2 0x7f8c45f40985 in pcre2_compile_8 (/root/pcre2/.libs/libpcre2-8.so.0+0x340985) #3 0x7f8c47492a23 in reg_comp dict-common/regex-morph.c:191 #4 0x7f8c47494490 in compile_regexs dict-common/regex-morph.c:373 #5 0x7f8c4749eae8 in load_regexes dict-file/dictionary.c:108 #6 0x7f8c474a010f in dictionary_six_str dict-file/dictionary.c:232 #7 0x7f8c474a0710 in dictionary_six dict-file/dictionary.c:281 #8 0x7f8c474a084d in dictionary_create_from_file dict-file/dictionary.c:307 #9 0x7f8c4746e9e5 in dictionary_create_lang dict-common/dict-common.c:134 #10 0x55b916917b8a in main tests/multi-thread.cc:134 #11 0x7f8c47046189 (/lib/x86_64-linux-gnu/libc.so.6+0x27189) Thread T10 created by T0 here: #0 0x7f8c47a49726 in __interceptor_pthread_create ../../../../src/libsanitizer/asan/asan_interceptors.cpp:207 #1 0x7f8c46ed4578 in std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) (/lib/x86_64-linux-gnu/libstdc++.so.6+0xd4578) #2 0x55b9169180df in main tests/multi-thread.cc:158 #3 0x7f8c47046189 (/lib/x86_64-linux-gnu/libc.so.6+0x27189) SUMMARY: AddressSanitizer: heap-buffer-overflow (/root/pcre2/.libs/libpcre2-8.so.0+0x492bb3) in match Shadow bytes around the buggy address: 0x0c1c7fff8090: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00 0x0c1c7fff80a0: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa 0x0c1c7fff80b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0c1c7fff80c0: 00 00 00 00 fa fa fa fa fa fa fa fa 00 00 00 00 0x0c1c7fff80d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 04 =>0x0c1c7fff80e0: fa[fa]fa fa fa fa fa fa 00 00 00 00 00 00 00 00 0x0c1c7fff80f0: 00 00 00 00 00 00 00 00 00 00 00 04 fa fa fa fa 0x0c1c7fff8100: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00 0x0c1c7fff8110: 00 00 00 00 00 00 00 04 fa fa fa fa fa fa fa fa 0x0c1c7fff8120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0c1c7fff8130: 00 00 00 02 fa fa fa fa fa fa fa fa 00 00 00 00 Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==107299==ABORTING FAIL multi-thread (exit status: 1) FAIL: multi-java ================ link-grammar: Warning: JNI: locale ANSI_X3.4-1968 was not UTF-8; force-setting to en_US.UTF-8 link-grammar: Info: JNI: dictionary language 'en' version 5.11.0 ================================================================= ==107369==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x611000004af8 at pc 0x7fbc4de92bb4 bp 0x7fbc47348b70 sp 0x7fbc47348b68 READ of size 1 at 0x611000004af8 thread T10 #0 0x7fbc4de92bb3 in match (/root/pcre2/.libs/libpcre2-8.so.0+0x492bb3) #1 0x7fbc4deb4eac in pcre2_match_8 (/root/pcre2/.libs/libpcre2-8.so.0+0x4b4eac) #2 0x7fbc4f2930e3 in reg_match dict-common/regex-morph.c:219 #3 0x7fbc4f294eed in matchspan_regex dict-common/regex-morph.c:437 #4 0x7fbc4f3b263a in is_afdict_punc tokenize/tokenize.c:402 #5 0x7fbc4f3b56b4 in issue_word_alternative tokenize/tokenize.c:651 #6 0x7fbc4f3b97f4 in remqueue_gword tokenize/tokenize.c:1021 #7 0x7fbc4f3cdf4e in separate_sentence tokenize/tokenize.c:3103 #8 0x7fbc4f25fc07 in sentence_split link-grammar/api.c:494 #9 0x7fbc4f261b87 in sentence_parse link-grammar/api.c:679 #10 0x7fbc50098cfc (/root/link-grammar-5.11.0~dfsg/bindings/java-jni/.libs/liblink-grammar-java.so.5+0x3cfc) #11 0x7fbc50098eeb in unit_test_jparse (/root/link-grammar-5.11.0~dfsg/bindings/java-jni/.libs/liblink-grammar-java.so.5+0x3eeb) #12 0x5572ef94c3f6 in parse_one_sent tests/multi-java.cc:32 #13 0x5572ef94c884 in parse_sents tests/multi-java.cc:76 #14 0x5572ef954318 in void std::__invoke_impl<void, void (*)(int, int), int, int>(std::__invoke_other, void (*&&)(int, int), int&&, int&&) /usr/include/c++/12/bits/invoke.h:61 #15 0x5572ef954083 in std::__invoke_result<void (*)(int, int), int, int>::type std::__invoke<void (*)(int, int), int, int>(void (*&&)(int, int), int&&, int&&) /usr/include/c++/12/bits/invoke.h:96 #16 0x5572ef953d21 in void std::thread::_Invoker<std::tuple<void (*)(int, int), int, int> >::_M_invoke<0ul, 1ul, 2ul>(std::_Index_tuple<0ul, 1ul, 2ul>) /usr/include/c++/12/bits/std_thread.h:252 #17 0x5572ef953aeb in std::thread::_Invoker<std::tuple<void (*)(int, int), int, int> >::operator()() /usr/include/c++/12/bits/std_thread.h:259 #18 0x5572ef953aa3 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(int, int), int, int> > >::_M_run() /usr/include/c++/12/bits/std_thread.h:210 #19 0x7fbc4ecd44a2 (/lib/x86_64-linux-gnu/libstdc++.so.6+0xd44a2) #20 0x7fbc4eea7fd3 (/lib/x86_64-linux-gnu/libc.so.6+0x88fd3) #21 0x7fbc4ef2866b (/lib/x86_64-linux-gnu/libc.so.6+0x10966b) 0x611000004af8 is located 31 bytes to the right of 217-byte region [0x611000004a00,0x611000004ad9) allocated by thread T0 here: #0 0x7fbc4f8b89cf in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:69 #1 0x7fbc4dd46659 in default_malloc (/root/pcre2/.libs/libpcre2-8.so.0+0x346659) #2 0x7fbc4dd40985 in pcre2_compile_8 (/root/pcre2/.libs/libpcre2-8.so.0+0x340985) #3 0x7fbc4f292a23 in reg_comp dict-common/regex-morph.c:191 #4 0x7fbc4f294490 in compile_regexs dict-common/regex-morph.c:373 #5 0x7fbc4f278e9d in afdict_init dict-common/dict-impl.c:765 #6 0x7fbc4f2a0253 in dictionary_six_str dict-file/dictionary.c:240 #7 0x7fbc4f2a0710 in dictionary_six dict-file/dictionary.c:281 #8 0x7fbc4f2a084d in dictionary_create_from_file dict-file/dictionary.c:307 #9 0x7fbc4f26e9e5 in dictionary_create_lang dict-common/dict-common.c:134 #10 0x7fbc50098930 (/root/link-grammar-5.11.0~dfsg/bindings/java-jni/.libs/liblink-grammar-java.so.5+0x3930) #11 0x7fbc50099218 in Java_org_linkgrammar_LinkGrammar_init (/root/link-grammar-5.11.0~dfsg/bindings/java-jni/.libs/liblink-grammar-java.so.5+0x4218) #12 0x5572ef94ca93 in main tests/multi-java.cc:85 #13 0x7fbc4ee46189 (/lib/x86_64-linux-gnu/libc.so.6+0x27189) Thread T10 created by T0 here: #0 0x7fbc4f849726 in __interceptor_pthread_create ../../../../src/libsanitizer/asan/asan_interceptors.cpp:207 #1 0x7fbc4ecd4578 in std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) (/lib/x86_64-linux-gnu/libstdc++.so.6+0xd4578) #2 0x5572ef94cb80 in main tests/multi-java.cc:94 #3 0x7fbc4ee46189 (/lib/x86_64-linux-gnu/libc.so.6+0x27189) SUMMARY: AddressSanitizer: heap-buffer-overflow (/root/pcre2/.libs/libpcre2-8.so.0+0x492bb3) in match Shadow bytes around the buggy address: 0x0c227fff8900: 00 00 00 00 00 00 00 00 01 fa fa fa fa fa fa fa 0x0c227fff8910: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00 0x0c227fff8920: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0c227fff8930: 00 00 00 01 fa fa fa fa fa fa fa fa fa fa fa fa 0x0c227fff8940: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 =>0x0c227fff8950: 00 00 00 00 00 00 00 00 00 00 00 01 fa fa fa[fa] 0x0c227fff8960: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd 0x0c227fff8970: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x0c227fff8980: fd fd fd fd fa fa fa fa fa fa fa fa fa fa fa fa 0x0c227fff8990: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0c227fff89a0: 00 00 00 00 00 00 00 00 00 00 00 01 fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==107369==ABORTING FAIL multi-java (exit status: 1) ============================================================================ Testsuite summary for link-grammar 5.11.0 ============================================================================ # TOTAL: 5 # PASS: 3 # SKIP: 0 # XFAIL: 0 # FAIL: 2 # XPASS: 0 # ERROR: 0 ============================================================================ See tests/test-suite.log Please report to https://github.com/opencog/link-grammar ============================================================================