Bug#702617: [Pkg-haskell-maintainers] Bug#702617: . fails to match certian characters

Joachim Breitner Sat, 09 Mar 2013 00:54:17 -0800

Hi Joey,

Am Freitag, den 08.03.2013, 21:43 -0400 schrieb Joey Hess:
> Package: libghc-regex-compat-dev
> Version: 0.95.1-2+b1
> Severity: normal
> 
> Prelude Text.Regex> matchRegex (mkRegex $ "^.*$") "o"
> Just []
> Prelude Text.Regex> let s = "ò"
> Prelude Text.Regex> s
> "\242"
> Prelude Text.Regex> matchRegex (mkRegex $ "^.*$") s
> Nothing
> Prelude Text.Regex> matchRegex (mkRegex $ ".") s
> Nothing
> 
> I mentioned this to upstream and he said:
> 
> | That looks like it is pushing the unicode text to your system C library for
> | matching.  This translation is probably making a multibyte C-string and then
> | running a non-Unicode aware C-library call.
> | 
> | You will need to check your setup.
> | 
> | It is true there are some bugs in this, but they are in the translating of
> | indices, which does not apply here.

regex-compat is but a thin layour around regex-posix, which states
Note that the posix library works with single byte characters,
and does not understand Unicode. If you need Unicode support you
will have to use a different backend.¹

It also makes suggestions for alternative regex libraries:
Benchmarking shows the default regex library on many platforms
is very inefficient. You might increase performace by an order
of magnitude by obtaining libpcre and regex-pcre or libtre and
regex-tre. If you do not need the captured substrings then you
can also get great performance from regex-dfa. If you do need
the capture substrings then you may be able to use regex-parsec
to improve performance.

For arbtt, where speed is crucial, I made sure that all my strings are
UTF8-Encoding ByteStrings and got good results with pcre-light in
utf8-mode, but this required some manual plumbing.

It seems that only regex-tdfa supports Unicode natively:
Depending on the text being searched this package supports
Unicode. The [Char] and (Seq Char) text types support Unicode.
The ByteString and ByteString.Lazy text types only support
ASCII. It is possible to support utf8 encoded ByteString.Lazy by
using regex-tdfa and regex-tdfa-utf8 packages together (required
the utf8-string package). ²
I don’t know how its speed compares to the others, but likely better
than regex-posix.

If you agree with this analysis, please close the bug.

Greetings,
Joachim

¹
http://hackage.haskell.org/packages/archive/regex-posix/0.95.2/doc/html/Text-Regex-Posix.html
²
http://hackage.haskell.org/packages/archive/regex-tdfa/1.1.8/doc/html/Text-Regex-TDFA.html

--
Joachim "nomeata" Breitner
Debian Developer
nome...@debian.org | ICQ# 74513189 | GPG-Keyid: 4743206C
JID: nome...@joachim-breitner.de | http://people.debian.org/~nomeata

signature.asc
Description: This is a digitally signed message part

Bug#702617: [Pkg-haskell-maintainers] Bug#702617: . fails to match certian characters

Reply via email to