‘regexp-exec’ sometimes gets match boundaries wrong when operating on a
Unicode string but in a C locale (this is with
af96820e072d18c49ac03e80c6f3466d568dc77d):
--8<---------------cut here---------------start------------->8---
scheme@(guile-user)> ,use(ice-9 regex)
scheme@(guile-user)> (setlocale LC_ALL "C")
$52 = "C"
scheme@(guile-user)> (string-match "start (.*)"
(string-append "start "
(string (integer->char
1002))))
$53 = #("start \u03ea" (0 . 8) (6 . 8))
scheme@(guile-user)> (match:substring $53 1)
ice-9/boot-9.scm:1683:22: In procedure raise-exception:
Value out of range 6 to< 7: 8
Entering a new prompt. Type `,bt' for a backtrace or `,q' to continue.
--8<---------------cut here---------------end--------------->8---
The attached program produces more failures at random. (The example
above works well under a UTF-8 locale.)
So I believe ‘fixup_multibyte_match’ isn’t quite correct.
Ludo’.
PS: This originates in <https://issues.guix.gnu.org/77283>.
(use-modules (ice-9 regex))
(define rx
(make-regexp "^start (.*)"))
(setlocale LC_ALL "C")
(let loop ()
(let* ((i (+ 256 (random (expt 2 10))))
(str (string-append "start " (string (integer->char i)))))
(with-exception-handler
(lambda (exc)
(pk 'exc exc '<-- i)
(display-backtrace (make-stack #t) (current-error-port))
(exit 1))
(lambda ()
(match:substring (regexp-exec rx str) 1)))
(loop)))