Hi,

Am Samstag, den 09.03.2013, 11:52 -0400 schrieb Joey Hess:
> Joachim Breitner wrote:
> > regex-compat is but a thin layour around regex-posix, which states
> >         Note that the posix library works with single byte characters,
> >         and does not understand Unicode. If you need Unicode support you
> >         will have to use a different backend.¹
> 
> Right. However, ò is not actually unicode, I think it's ISO8859-15.
> 
> Also I'm not trying to do anything that requires knowledge of unicode.
> Even if the library sees [byte, byte], "^.*$" should still match all
> the bytes.

The library should actually see "\242\0", and gdb verifies that this is
in the CString. Nevertheless, I cannot reproduce this behavior in C:

#include <sys/types.h>
#include <regex.h>
#include <stdio.h>

main () {
        regex_t r;
        regcomp(&r, ".", 0);
        char *s = "\242";
        int i = regexec(&r, s, 0, NULL, 0);
        printf("%d\n", i);
}

prints 0, i.e. match succeeded.

But on the lowest layer above the FFI, I the strange behaviour already
occurs:

Prelude Foreign.C.String Text.Regex.Posix.Wrap> cs <- newCAString "\242"
Prelude Foreign.C.String Text.Regex.Posix.Wrap> cp <- newCAString "."
Prelude Foreign.C.String Text.Regex.Posix.Wrap> 
Prelude Foreign.C.String Text.Regex.Posix.Wrap> Right r2 <- wrapCompile 0 0 cp
Prelude Foreign.C.String Text.Regex.Posix.Wrap> 
Prelude Foreign.C.String Text.Regex.Posix.Wrap> wrapTest r2 cs
Right False

It is False for \128 and True for \127

The code in question is here:
http://hackage.haskell.org/packages/archive/regex-posix/0.95.2/doc/html/src/Text-Regex-Posix-Wrap.html

Changing the regex to "^$" or "^.*$" does not make a difference, i.e.
the string is not just turned to the empty string. Clearly something is
broken here.

I can reproduce it from within ghc’s address space using gdb:

(gdb) call malloc(32)
$7 = 64943120
(gdb) call regcomp(64943120, ".", 0)
$8 = 0
(gdb) call regexec(64943120,"\242",0,0,0)
$9 = 1
(gdb) call regexec(64943120,"only_ascii",0,0,0)
$10 = 0

And even from gdb while debugging “sleep”. So the behaviour is already
there in regexec, but for some reason it is not triggered from C code,
but only via some variants of FFI (GHC’s or gdb’s).

I’ll leave it at that, as this is not really related to GHC or Haskell
any more.

Greetings,
Joachim

-- 
Joachim "nomeata" Breitner
Debian Developer
  nome...@debian.org | ICQ# 74513189 | GPG-Keyid: 4743206C
  JID: nome...@joachim-breitner.de | http://people.debian.org/~nomeata

Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to