On Apr 30, 2005, at 6:48 PM, John Blumel wrote:
OK, here's the code without the bug written into the example (which is inside a foreach loop that is looping through a long list of keywords),
...
while ($articleWorkText =~ m/\b$kWord\b/igs)
{
$position = pos($articleWorkText) - length($kWord);
$matchedText = substr($articleWorkText, $position, length($kWord));
$matchedText =~ s/ /_/g;
substr($patternSpace, $position, length($matchedText)) = $matchedText;
}
...
Which works fine in most cases but, if there is a wide character in $articleWorkText before the matched text, then $position, as used by substr() ends up being in front of the $position as calculated from pos().
How are $articleWorkText and $kWord being read into your app? Perl handles a variety of text encodings, but it does need to be told about the encoding to use.
If you're reading them from a file, you need to make certain to tell Perl that the file is UTF8 (or whatever) encoded. You can use Perl's three-argument open() for that:
open(FH, '<:utf8', '/path/to/file') or die;
Have a look at perluniintro and perlunicode if you haven't already.
See also the -C switch in perlrun - you can use that to specify that stdin and/or stdout should be regarded as UTF8, or make UTF8 the default encoding for all i/o streams.
sherm--
Cocoa programming in Perl: http://camelbones.sourceforge.net Hire me! My resume: http://www.dot-app.org