Re: UTF-8 support?

Sherm Pendley Sat, 30 Apr 2005 13:08:30 -0700

On Apr 30, 2005, at 2:10 PM, John Blumel wrote:

use utf8;
...
$blah =~ m/<wide_regex>/g;
$position = pos $blah;

seems to give the correct character position but,

$matched = substr($blah,
                  $position - length($blah),
                  length($blah));

doesn't put the matched text into $matched when there are wide characters in $blah

Nor should it, even if the text is plain old ASCII - there's a bug in the above code that has nothing to do with string encoding.

Take a ten-character ASCII string: 'abcdefghij'. Match it for 'fgh', and $position will be 8, as expected.

So, the length of $blah is 10, and $position is 8. So, the above call to substr amounts to:

    $matched = substr('abcdefghij', -2, 10);

Which returns 'ij'; everything in the string *after* what was matched by the regex.

As the docs for substr() state, if the offset (second argument) is negative, the offset is taken from the end of the string, and if the combination of offset and length is partially outside of the string, the portion inside the string is returned. With an offset of -2, it's obviously impossible to take ten characters beginning two from the end, so only the remaining two are returned.

But really, why bother with substr() at all? Just parenthesize the regex, and store the results in a list:

    my @matched = ($blah =~ m/(<regex>)/g);

That will return a list of all the strings matching the expression.

If, after such a match you need to know the positions of the matched strings in $blah, have a look at @- or @+.

sherm--

Cocoa programming in Perl: http://camelbones.sourceforge.net
Hire me! My resume: http://www.dot-app.org

Re: UTF-8 support?

Reply via email to