Hello. Here is a regular expression for Unicode-aware Perl (i.e. Perl 5.8.0 or later), that matches a single Default Grapheme Cluster, specified by Draft Unicode Technical Report #29, Version 3 (please see $Grapheme below).
cf. Default Grapheme Cluster Boundaries http://www.unicode.org/reports/tr29/tr29-3.html#Regular_Expressions #!Perl $Any = qr/./s; $CRLF = qr/(?:\cM\cJ)/; $Control = qr/[\p{Zl}\p{Zp}\p{Cc}\p{Cf}]/; $Extend = qr/[\p{Mn}\p{Me}\p{OtherGraphemeExtend}]/; $HangL = qr/[\x{1100}-\x{115F}]/; # Hangul Jamo Leading Consonant $HangV = qr/[\x{1160}-\x{11A2}]/; # Hangul Jamo Vowel $HangT = qr/[\x{11A8}-\x{11F9}]/; # Hangul Jamo Trailing Consonant $HangS = qr/[\x{AC00}-\x{D7A3}]/; # Hangul Syllable $cHangLV = join '', map sprintf("\\x{%04X}", 0xAC00 + 28*$_), 0..19*21-1; $HangLV = qr/[$cHangLV]/; # Hangul Syllable LV $HangLVT = qr/(?:(?!$HangLV)$HangS)/; # Hangul Syllable LVT $Hangul = qr/(?:$HangL*(?:$HangLV$HangV*|$HangV+|$HangLVT)$HangT* | $HangL+ | $HangT+ )/x; $Grapheme = qr/(?:$CRLF|$Control|(?:$Hangul|$Any)$Extend*)/; =begin My humble String::Multibyte, originally developped for multiple-byte characters with an old, byte-oriented Perl, now copes with multiple-character graphemes powered by the newest Unicode support of Perl. =cut use 5.8.0; use String::Multibyte; $gop = String::Multibyte->new({ charset => 'Grapheme-Oriented Perl', regexp => $Grapheme, # as above }); print "\x{AC00}\x{11A8}:\cM\cJ:\x{3042}:A\x{300}\x{301}:\cM:\0:\x{300}" eq join(':' => $gop->strsplit("", "\x{AC00}\x{11A8}\cM\cJ\x{3042}A\x{300}\x{301}\cM\0\x{300}")) ? "ok" : "not ok", " 1\n"; print "\x{300}\0\cMA\x{300}\x{301}\x{3042}\cM\cJ\x{AC00}\x{11A8}" eq $gop->strrev( "\x{AC00}\x{11A8}\cM\cJ\x{3042}A\x{300}\x{301}\cM\0\x{300}") ? "ok" : "not ok", " 2\n"; __END__ NOTE: "\x{AC00}\x{11A8}" is a Hangul syllable cluster. "\cM\cJ" is CRLF, that must be a single grapheme. "A\x{300}\x{301}" is a combining character sequence. the newest String::Multibyte http://search.cpan.org/author/SADAHIRO/String-Multibyte-1.01/ strsplit() works like split(), but not aware of a pattern. e.g. strsplit('*', 'a*bc**xyz') returns a list ('a', 'bc', '', 'xyz'). So $gop->strsplit("", $string) does split a string into graphemes. $gop->strrev() works like scalar(reverse()), but reverses a string grapheme-wise. SADAHIRO Tomoyuki