Hi, Am Samstag, den 09.03.2013, 11:52 -0400 schrieb Joey Hess: > Joachim Breitner wrote: > > regex-compat is but a thin layour around regex-posix, which states > > Note that the posix library works with single byte characters, > > and does not understand Unicode. If you need Unicode support you > > will have to use a different backend.¹ > > Right. However, ò is not actually unicode, I think it's ISO8859-15. > > Also I'm not trying to do anything that requires knowledge of unicode. > Even if the library sees [byte, byte], "^.*$" should still match all > the bytes.
The library should actually see "\242\0", and gdb verifies that this is in the CString. Nevertheless, I cannot reproduce this behavior in C: #include <sys/types.h> #include <regex.h> #include <stdio.h> main () { regex_t r; regcomp(&r, ".", 0); char *s = "\242"; int i = regexec(&r, s, 0, NULL, 0); printf("%d\n", i); } prints 0, i.e. match succeeded. But on the lowest layer above the FFI, I the strange behaviour already occurs: Prelude Foreign.C.String Text.Regex.Posix.Wrap> cs <- newCAString "\242" Prelude Foreign.C.String Text.Regex.Posix.Wrap> cp <- newCAString "." Prelude Foreign.C.String Text.Regex.Posix.Wrap> Prelude Foreign.C.String Text.Regex.Posix.Wrap> Right r2 <- wrapCompile 0 0 cp Prelude Foreign.C.String Text.Regex.Posix.Wrap> Prelude Foreign.C.String Text.Regex.Posix.Wrap> wrapTest r2 cs Right False It is False for \128 and True for \127 The code in question is here: http://hackage.haskell.org/packages/archive/regex-posix/0.95.2/doc/html/src/Text-Regex-Posix-Wrap.html Changing the regex to "^$" or "^.*$" does not make a difference, i.e. the string is not just turned to the empty string. Clearly something is broken here. I can reproduce it from within ghc’s address space using gdb: (gdb) call malloc(32) $7 = 64943120 (gdb) call regcomp(64943120, ".", 0) $8 = 0 (gdb) call regexec(64943120,"\242",0,0,0) $9 = 1 (gdb) call regexec(64943120,"only_ascii",0,0,0) $10 = 0 And even from gdb while debugging “sleep”. So the behaviour is already there in regexec, but for some reason it is not triggered from C code, but only via some variants of FFI (GHC’s or gdb’s). I’ll leave it at that, as this is not really related to GHC or Haskell any more. Greetings, Joachim -- Joachim "nomeata" Breitner Debian Developer nome...@debian.org | ICQ# 74513189 | GPG-Keyid: 4743206C JID: nome...@joachim-breitner.de | http://people.debian.org/~nomeata
signature.asc
Description: This is a digitally signed message part