On Friday, October 08, 2010 14:13:36 [email protected] wrote: > I've been running into a few problems with regular expressions in D. One > of the issues I've had recently is matching strings with non ascii > characters. As an example: > > auto re = regex( `(.*)\.txt`, "i" ); > re.printProgram(); > auto m = match( "bà.txt", re ); > writefln( "'%s'", m.captures[1] ); > > When I run this I get the following error: > > dchar decode(in char[], ref size_t): Invalid UTF-8 sequence [160 46 116 > 120] around index 0 > printProgram() > 0: REparen len=1 n=0, pc=>10 > 9: REanystar > 10: REistring x4, '.txt' > 19: REend > > While investigating the cause, I noticed that during execution of many > of the regex instructions (e.g. REanystar), the source is advanced with: > > src++; > > However in other cases (REanychar), it is advanced with: > > src += std.utf.stride(input, src); > > I found that by replacing the code REanystar with stride, the code > worked as expected. Although I can't claim to have a solid understanding > of the code, it seems to me that most of the cases of src++ should be > using stride instead. > > Is this correct, or have I made some silly mistake and got completely > the wrong end of the stick?
Well, without looking at the code, I can't say for certain what's going on, but using ++ with chars or wchars is definitely wrong in virtually all cases. stride() will actually go to the next code point, while ++ will just go to the next code unit, which could be in the middle of a code point. - Jonathan M Davis
