Shigeki Moro <[EMAIL PROTECTED]> writes:
> Dear subscribers,
>
> I wrote a report in Japanese concerned with the management of Devanagari
> (one of the Indic scripts) characters on Perl 5.6.
>
> http://www.ya.sakura.ne.jp/~moro/resources/indic_on_perl5.6/index.html
>
> For example, using utf8, splitting a Devanagari word 'vij~naana' into
> character semantics results in 'va + (i) + ja + (viraama) + ~na + (aa) +
> na'.
>
> It seems to me that Perl divides a combined character into the base
> character and the combining character(s), and doesn't regard a combined
> character as one character.
Yes. After all, in some cases, you do want to manipulate base and
combining chars separately, which would be impossible if they
were treated as a single characters.
To split into (base char + combining chars) sequences
split /(?=\PM)/ $string
should work.
[
\pM matches combining chars
\PM matches non-combining (base) characters
So this says - split the string using the beginning of a base
character as the delimiter
]
Regards,
Owen