On Sat, 07 Jan 2023 at 22:07:31 +0000, Matthew Vernon wrote:
> I'm struggling a bit here; I wanted to try and bisect pcre2 upstream commits
> to see where this bug might have been introduced (or get to the bottom of
> what link-grammar's test is doing wrong, I see they've been troublesome in
> the past cf #975696).

I tried using AddressSanitizer to get more information, which might be
helpful. I used a bookworm podman container for this, and for simplicity
I'm using uid 0 in the container, but you could probably do the same in
a Docker container, a VM or a chroot and as an unprivileged user.

$ podman run -it --rm debian:bookworm-slim
# apt install --no-install-recommends devscripts build-essential
# echo "deb-src http://deb.debian.org/debian bookworm main" >> 
/etc/apt/sources.list
# cd /root
# apt update
# apt source pcre2 link-grammar
# echo "deb-src http://deb.debian.org/debian sid main" >> /etc/apt/sources.list
# apt source pcre2
# apt build-dep pcre2 link-grammar
# cd /root/pcre2-10.40
# debuild -eASAN_OPTIONS=detect_leaks=0 -eDEB_BUILD_OPTIONS="noopt 
sanitize=+address,+undefined" -us -uc -b
# cd /root/pcre2-10.42
# debuild -eASAN_OPTIONS=detect_leaks=0 -eDEB_BUILD_OPTIONS="noopt 
sanitize=+address,+undefined" -us -uc -b
# cd /root/link-grammar-5.11.0~dfsg
# debuild -eASAN_OPTIONS=detect_leaks=0 -eDEB_BUILD_OPTIONS="noopt 
sanitize=+address,+undefined" -eLD_PRELOAD=libasan.so.8 
-eLD_LIBRARY_PATH=/root/pcre2-10.40/debian/tmp/usr/lib/x86_64-linux-gnu -us -uc 
-b
  ... tests pass ...
# debuild -eASAN_OPTIONS=detect_leaks=0 -eDEB_BUILD_OPTIONS="noopt 
sanitize=+address,+undefined" -eLD_PRELOAD=libasan.so.8 
-eLD_LIBRARY_PATH=/root/pcre2-10.42/debian/tmp/usr/lib/x86_64-linux-gnu -us -uc 
-b
  ... tests fail ...

(Note that the order of options to debuild is significant, debuild option -e
must come before dpkg-buildpackage options like -us, -uc, -b.)

With those steps, I get AddressSanitizer reporting a heap buffer overflow
when the test tries to match a regular expression, and a similar error
in the multi-java test.

A roughly equivalent setup with the upstream libpcre also "works" (by which
I mean, fails). AddressSanitizer output below.

# apt install git
# git clone https://github.com/PCRE2Project/pcre2
# cd /root/pcre2
# git checkout pcre2-10.42
# ./autogen.sh
# ./configure CFLAGS="-fsanitize=address -fno-omit-frame-pointer 
-fsanitize=undefined" --prefix=/usr
# make
# ( cd /root/link-grammar-5.11.0~dfsg && debuild -eASAN_OPTIONS=detect_leaks=0 
-eDEB_BUILD_OPTIONS="noopt sanitize=+address,+undefined" 
-eLD_PRELOAD=libasan.so.8 -eLD_LIBRARY_PATH=/root/pcre2/.libs -us -uc -b )
  ... tests fail ...

The same setup with pcre2-10.40 passes tests. The link-grammar build
later fails when using the sanitizers, because it uses d-shlibs which
is basically an anti-pattern (#902605), but that's enough to be able to
bisect (in progress).

    smcv

===============================================
   link-grammar 5.11.0: tests/test-suite.log
===============================================

# TOTAL: 5
# PASS:  3
# SKIP:  0
# XFAIL: 0
# FAIL:  2
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: multi-thread
==================

=================================================================
==107299==ERROR: AddressSanitizer: heap-buffer-overflow on address 
0x60e000000708 at pc 0x7f8c46092bb4 bp 0x7f8c3cb87800 sp 0x7f8c3cb877f8
READ of size 1 at 0x60e000000708 thread T10
    #0 0x7f8c46092bb3 in match (/root/pcre2/.libs/libpcre2-8.so.0+0x492bb3)
    #1 0x7f8c460b4eac in pcre2_match_8 
(/root/pcre2/.libs/libpcre2-8.so.0+0x4b4eac)
    #2 0x7f8c474930e3 in reg_match dict-common/regex-morph.c:219
    #3 0x7f8c47494753 in match_regex dict-common/regex-morph.c:405
    #4 0x7f8c475b2997 in regex_guess tokenize/tokenize.c:413
    #5 0x7f8c475ca258 in separate_word tokenize/tokenize.c:2709
    #6 0x7f8c475cdf73 in separate_sentence tokenize/tokenize.c:3116
    #7 0x7f8c4745fc07 in sentence_split link-grammar/api.c:494
    #8 0x55b9169165b8 in parse_one_sent tests/multi-thread.cc:34
    #9 0x55b9169177e9 in parse_sents tests/multi-thread.cc:119
    #10 0x55b9169209c2 in void std::__invoke_impl<void, void (*)(Dictionary_s*, 
Parse_Options_s*, int, int), Dictionary_s*, Parse_Options_s*, int, 
int>(std::__invoke_other, void (*&&)(Dictionary_s*, Parse_Options_s*, int, 
int), Dictionary_s*&&, Parse_Options_s*&&, int&&, int&&) 
/usr/include/c++/12/bits/invoke.h:61
    #11 0x55b91692056f in std::__invoke_result<void (*)(Dictionary_s*, 
Parse_Options_s*, int, int), Dictionary_s*, Parse_Options_s*, int, int>::type 
std::__invoke<void (*)(Dictionary_s*, Parse_Options_s*, int, int), 
Dictionary_s*, Parse_Options_s*, int, int>(void (*&&)(Dictionary_s*, 
Parse_Options_s*, int, int), Dictionary_s*&&, Parse_Options_s*&&, int&&, int&&) 
/usr/include/c++/12/bits/invoke.h:96
    #12 0x55b91691fffb in void std::thread::_Invoker<std::tuple<void 
(*)(Dictionary_s*, Parse_Options_s*, int, int), Dictionary_s*, 
Parse_Options_s*, int, int> >::_M_invoke<0ul, 1ul, 2ul, 3ul, 
4ul>(std::_Index_tuple<0ul, 1ul, 2ul, 3ul, 4ul>) 
/usr/include/c++/12/bits/std_thread.h:252
    #13 0x55b91691fc59 in std::thread::_Invoker<std::tuple<void 
(*)(Dictionary_s*, Parse_Options_s*, int, int), Dictionary_s*, 
Parse_Options_s*, int, int> >::operator()() 
/usr/include/c++/12/bits/std_thread.h:259
    #14 0x55b91691fc11 in 
std::thread::_State_impl<std::thread::_Invoker<std::tuple<void 
(*)(Dictionary_s*, Parse_Options_s*, int, int), Dictionary_s*, 
Parse_Options_s*, int, int> > >::_M_run() 
/usr/include/c++/12/bits/std_thread.h:210
    #15 0x7f8c46ed44a2  (/lib/x86_64-linux-gnu/libstdc++.so.6+0xd44a2)
    #16 0x7f8c470a7fd3  (/lib/x86_64-linux-gnu/libc.so.6+0x88fd3)
    #17 0x7f8c4712866b  (/lib/x86_64-linux-gnu/libc.so.6+0x10966b)

0x60e000000708 is located 12 bytes to the right of 156-byte region 
[0x60e000000660,0x60e0000006fc)
allocated by thread T0 here:
    #0 0x7f8c47ab89cf in __interceptor_malloc 
../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:69
    #1 0x7f8c45f46659 in default_malloc 
(/root/pcre2/.libs/libpcre2-8.so.0+0x346659)
    #2 0x7f8c45f40985 in pcre2_compile_8 
(/root/pcre2/.libs/libpcre2-8.so.0+0x340985)
    #3 0x7f8c47492a23 in reg_comp dict-common/regex-morph.c:191
    #4 0x7f8c47494490 in compile_regexs dict-common/regex-morph.c:373
    #5 0x7f8c4749eae8 in load_regexes dict-file/dictionary.c:108
    #6 0x7f8c474a010f in dictionary_six_str dict-file/dictionary.c:232
    #7 0x7f8c474a0710 in dictionary_six dict-file/dictionary.c:281
    #8 0x7f8c474a084d in dictionary_create_from_file dict-file/dictionary.c:307
    #9 0x7f8c4746e9e5 in dictionary_create_lang dict-common/dict-common.c:134
    #10 0x55b916917b8a in main tests/multi-thread.cc:134
    #11 0x7f8c47046189  (/lib/x86_64-linux-gnu/libc.so.6+0x27189)

Thread T10 created by T0 here:
    #0 0x7f8c47a49726 in __interceptor_pthread_create 
../../../../src/libsanitizer/asan/asan_interceptors.cpp:207
    #1 0x7f8c46ed4578 in 
std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, 
std::default_delete<std::thread::_State> >, void (*)()) 
(/lib/x86_64-linux-gnu/libstdc++.so.6+0xd4578)
    #2 0x55b9169180df in main tests/multi-thread.cc:158
    #3 0x7f8c47046189  (/lib/x86_64-linux-gnu/libc.so.6+0x27189)

SUMMARY: AddressSanitizer: heap-buffer-overflow 
(/root/pcre2/.libs/libpcre2-8.so.0+0x492bb3) in match
Shadow bytes around the buggy address:
  0x0c1c7fff8090: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c1c7fff80a0: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa
  0x0c1c7fff80b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c1c7fff80c0: 00 00 00 00 fa fa fa fa fa fa fa fa 00 00 00 00
  0x0c1c7fff80d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 04
=>0x0c1c7fff80e0: fa[fa]fa fa fa fa fa fa 00 00 00 00 00 00 00 00
  0x0c1c7fff80f0: 00 00 00 00 00 00 00 00 00 00 00 04 fa fa fa fa
  0x0c1c7fff8100: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c1c7fff8110: 00 00 00 00 00 00 00 04 fa fa fa fa fa fa fa fa
  0x0c1c7fff8120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c1c7fff8130: 00 00 00 02 fa fa fa fa fa fa fa fa 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==107299==ABORTING
FAIL multi-thread (exit status: 1)

FAIL: multi-java
================

link-grammar: Warning: JNI: locale ANSI_X3.4-1968 was not UTF-8; force-setting 
to en_US.UTF-8
link-grammar: Info: JNI: dictionary language 'en' version 5.11.0
=================================================================
==107369==ERROR: AddressSanitizer: heap-buffer-overflow on address 
0x611000004af8 at pc 0x7fbc4de92bb4 bp 0x7fbc47348b70 sp 0x7fbc47348b68
READ of size 1 at 0x611000004af8 thread T10
    #0 0x7fbc4de92bb3 in match (/root/pcre2/.libs/libpcre2-8.so.0+0x492bb3)
    #1 0x7fbc4deb4eac in pcre2_match_8 
(/root/pcre2/.libs/libpcre2-8.so.0+0x4b4eac)
    #2 0x7fbc4f2930e3 in reg_match dict-common/regex-morph.c:219
    #3 0x7fbc4f294eed in matchspan_regex dict-common/regex-morph.c:437
    #4 0x7fbc4f3b263a in is_afdict_punc tokenize/tokenize.c:402
    #5 0x7fbc4f3b56b4 in issue_word_alternative tokenize/tokenize.c:651
    #6 0x7fbc4f3b97f4 in remqueue_gword tokenize/tokenize.c:1021
    #7 0x7fbc4f3cdf4e in separate_sentence tokenize/tokenize.c:3103
    #8 0x7fbc4f25fc07 in sentence_split link-grammar/api.c:494
    #9 0x7fbc4f261b87 in sentence_parse link-grammar/api.c:679
    #10 0x7fbc50098cfc  
(/root/link-grammar-5.11.0~dfsg/bindings/java-jni/.libs/liblink-grammar-java.so.5+0x3cfc)
    #11 0x7fbc50098eeb in unit_test_jparse 
(/root/link-grammar-5.11.0~dfsg/bindings/java-jni/.libs/liblink-grammar-java.so.5+0x3eeb)
    #12 0x5572ef94c3f6 in parse_one_sent tests/multi-java.cc:32
    #13 0x5572ef94c884 in parse_sents tests/multi-java.cc:76
    #14 0x5572ef954318 in void std::__invoke_impl<void, void (*)(int, int), 
int, int>(std::__invoke_other, void (*&&)(int, int), int&&, int&&) 
/usr/include/c++/12/bits/invoke.h:61
    #15 0x5572ef954083 in std::__invoke_result<void (*)(int, int), int, 
int>::type std::__invoke<void (*)(int, int), int, int>(void (*&&)(int, int), 
int&&, int&&) /usr/include/c++/12/bits/invoke.h:96
    #16 0x5572ef953d21 in void std::thread::_Invoker<std::tuple<void (*)(int, 
int), int, int> >::_M_invoke<0ul, 1ul, 2ul>(std::_Index_tuple<0ul, 1ul, 2ul>) 
/usr/include/c++/12/bits/std_thread.h:252
    #17 0x5572ef953aeb in std::thread::_Invoker<std::tuple<void (*)(int, int), 
int, int> >::operator()() /usr/include/c++/12/bits/std_thread.h:259
    #18 0x5572ef953aa3 in 
std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(int, int), 
int, int> > >::_M_run() /usr/include/c++/12/bits/std_thread.h:210
    #19 0x7fbc4ecd44a2  (/lib/x86_64-linux-gnu/libstdc++.so.6+0xd44a2)
    #20 0x7fbc4eea7fd3  (/lib/x86_64-linux-gnu/libc.so.6+0x88fd3)
    #21 0x7fbc4ef2866b  (/lib/x86_64-linux-gnu/libc.so.6+0x10966b)

0x611000004af8 is located 31 bytes to the right of 217-byte region 
[0x611000004a00,0x611000004ad9)
allocated by thread T0 here:
    #0 0x7fbc4f8b89cf in __interceptor_malloc 
../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:69
    #1 0x7fbc4dd46659 in default_malloc 
(/root/pcre2/.libs/libpcre2-8.so.0+0x346659)
    #2 0x7fbc4dd40985 in pcre2_compile_8 
(/root/pcre2/.libs/libpcre2-8.so.0+0x340985)
    #3 0x7fbc4f292a23 in reg_comp dict-common/regex-morph.c:191
    #4 0x7fbc4f294490 in compile_regexs dict-common/regex-morph.c:373
    #5 0x7fbc4f278e9d in afdict_init dict-common/dict-impl.c:765
    #6 0x7fbc4f2a0253 in dictionary_six_str dict-file/dictionary.c:240
    #7 0x7fbc4f2a0710 in dictionary_six dict-file/dictionary.c:281
    #8 0x7fbc4f2a084d in dictionary_create_from_file dict-file/dictionary.c:307
    #9 0x7fbc4f26e9e5 in dictionary_create_lang dict-common/dict-common.c:134
    #10 0x7fbc50098930  
(/root/link-grammar-5.11.0~dfsg/bindings/java-jni/.libs/liblink-grammar-java.so.5+0x3930)
    #11 0x7fbc50099218 in Java_org_linkgrammar_LinkGrammar_init 
(/root/link-grammar-5.11.0~dfsg/bindings/java-jni/.libs/liblink-grammar-java.so.5+0x4218)
    #12 0x5572ef94ca93 in main tests/multi-java.cc:85
    #13 0x7fbc4ee46189  (/lib/x86_64-linux-gnu/libc.so.6+0x27189)

Thread T10 created by T0 here:
    #0 0x7fbc4f849726 in __interceptor_pthread_create 
../../../../src/libsanitizer/asan/asan_interceptors.cpp:207
    #1 0x7fbc4ecd4578 in 
std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, 
std::default_delete<std::thread::_State> >, void (*)()) 
(/lib/x86_64-linux-gnu/libstdc++.so.6+0xd4578)
    #2 0x5572ef94cb80 in main tests/multi-java.cc:94
    #3 0x7fbc4ee46189  (/lib/x86_64-linux-gnu/libc.so.6+0x27189)

SUMMARY: AddressSanitizer: heap-buffer-overflow 
(/root/pcre2/.libs/libpcre2-8.so.0+0x492bb3) in match
Shadow bytes around the buggy address:
  0x0c227fff8900: 00 00 00 00 00 00 00 00 01 fa fa fa fa fa fa fa
  0x0c227fff8910: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00
  0x0c227fff8920: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c227fff8930: 00 00 00 01 fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c227fff8940: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0c227fff8950: 00 00 00 00 00 00 00 00 00 00 00 01 fa fa fa[fa]
  0x0c227fff8960: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
  0x0c227fff8970: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c227fff8980: fd fd fd fd fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c227fff8990: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c227fff89a0: 00 00 00 00 00 00 00 00 00 00 00 01 fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==107369==ABORTING
FAIL multi-java (exit status: 1)

============================================================================
Testsuite summary for link-grammar 5.11.0
============================================================================
# TOTAL: 5
# PASS:  3
# SKIP:  0
# XFAIL: 0
# FAIL:  2
# XPASS: 0
# ERROR: 0
============================================================================
See tests/test-suite.log
Please report to https://github.com/opencog/link-grammar
============================================================================

Reply via email to