Re: UTF-8 support?

John Blumel Sat, 30 Apr 2005 15:48:51 -0700


On Apr 30, 2005, at 6:06pm, Sherm Pendley wrote:

OK. So does this mean that substr() just doesn't/can't handle wide characters as characters but only as bytes?
No, it doesn't mean that. Substr() handles wide characters just fine - the bug in the code you posted had nothing to do with encoding. When I tested it, it (mis)behaved identically with both ASCII and UTF8-encoded Japanese text.

OK, here's the code without the bug written into the example (which is inside a foreach loop that is looping through a long list of keywords),

... while ($articleWorkText =~ m/\b$kWord\b/igs) { $position = pos($articleWorkText) - length($kWord); $matchedText = substr($articleWorkText, $position, length($kWord)); $matchedText =~ s/ /_/g; substr($patternSpace, $position, length($matchedText)) = $matchedText; } ...

Which works fine in most cases but, if there is a wide character in $articleWorkText before the matched text, then $position, as used by substr() ends up being in front of the $position as calculated from pos(). If I open the file in TextEdit, the pos() derived position seems to be correct while the position that substr() seems to use is one character earlier and this only happens when there is a wide character preceding the match.

Now maybe this would be better written using the $1, $2, ... variables but I still don't understand the discrepancy between the pos() position and the substr() position


John Blumel

Re: UTF-8 support?

Reply via email to