Hi Joey, Am Freitag, den 08.03.2013, 21:43 -0400 schrieb Joey Hess: > Package: libghc-regex-compat-dev > Version: 0.95.1-2+b1 > Severity: normal > > Prelude Text.Regex> matchRegex (mkRegex $ "^.*$") "o" > Just [] > Prelude Text.Regex> let s = "ò" > Prelude Text.Regex> s > "\242" > Prelude Text.Regex> matchRegex (mkRegex $ "^.*$") s > Nothing > Prelude Text.Regex> matchRegex (mkRegex $ ".") s > Nothing > > I mentioned this to upstream and he said: > > | That looks like it is pushing the unicode text to your system C library for > | matching. This translation is probably making a multibyte C-string and then > | running a non-Unicode aware C-library call. > | > | You will need to check your setup. > | > | It is true there are some bugs in this, but they are in the translating of > | indices, which does not apply here.
regex-compat is but a thin layour around regex-posix, which states Note that the posix library works with single byte characters, and does not understand Unicode. If you need Unicode support you will have to use a different backend.¹ It also makes suggestions for alternative regex libraries: Benchmarking shows the default regex library on many platforms is very inefficient. You might increase performace by an order of magnitude by obtaining libpcre and regex-pcre or libtre and regex-tre. If you do not need the captured substrings then you can also get great performance from regex-dfa. If you do need the capture substrings then you may be able to use regex-parsec to improve performance. For arbtt, where speed is crucial, I made sure that all my strings are UTF8-Encoding ByteStrings and got good results with pcre-light in utf8-mode, but this required some manual plumbing. It seems that only regex-tdfa supports Unicode natively: Depending on the text being searched this package supports Unicode. The [Char] and (Seq Char) text types support Unicode. The ByteString and ByteString.Lazy text types only support ASCII. It is possible to support utf8 encoded ByteString.Lazy by using regex-tdfa and regex-tdfa-utf8 packages together (required the utf8-string package). ² I don’t know how its speed compares to the others, but likely better than regex-posix. If you agree with this analysis, please close the bug. Greetings, Joachim ¹ http://hackage.haskell.org/packages/archive/regex-posix/0.95.2/doc/html/Text-Regex-Posix.html ² http://hackage.haskell.org/packages/archive/regex-tdfa/1.1.8/doc/html/Text-Regex-TDFA.html -- Joachim "nomeata" Breitner Debian Developer nome...@debian.org | ICQ# 74513189 | GPG-Keyid: 4743206C JID: nome...@joachim-breitner.de | http://people.debian.org/~nomeata
signature.asc
Description: This is a digitally signed message part