parameter (ignore_level2) will allow it.
(However the behavior of ignore_level2 is quite different from
so-called caseLevel in UCA etc.)
Regards,
SADAHIRO Tomoyuki
{
+ } elsif ($to_be_pushed) {
push @subWt, [ \...@wt ];
}
}
Regards,
SADAHIRO Tomoyuki
dear all,
most probably I'm missing something quite obvious and very simple,
but I am no expert with Perl and Unicode yet.
I'm making some string replacements with Unicode::Collate
for Unicode 5.2.0.
Thank you,
SADAHIRO Tomoyuki
U.C.D. 5.0.0.
Regards,
SADAHIRO Tomoyuki
[CD]) isn't treated as the sixth letter
following epsilon.
P.S. 11 is represented by iota-alpha, not by kappa,
with the greek numeral system.
cf. http://en.wikipedia.org/wiki/Greek_numerals
Regards,
SADAHIRO Tomoyuki
CPAN:
http://rt.cpan.org/NoAuth/Bugs.html?Dist=Encode
Regards,
SADAHIRO Tomoyuki
.
regards,
SADAHIRO Tomoyuki
{ff}]
CUR = 2
LEN = 4
where PV stands for string and \303\277 is U+00FF in UTF-8.
In UTF-EBCDIC, the output should be different.
regards,
SADAHIRO Tomoyuki
in UTF-EBCDIC too.
If you want to convert an interger to a character according to
Unicode scalar values, you can use pack('U'), but not chr().
For example, pack('U', 0xFF) should correspond to U+00FF
(y with diaeresis), everywhere (both on ASCII and on EBCDIC).
Regards,
SADAHIRO Tomoyuki
Hi
in the case of EBCDIC.
Sastry, would you please do the following codelet on your EBCDIC?
($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ s/[\x89-\x91]/X/g;
is($a, );
Does that work similarly to yours?
($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x89-\x91/X/;
is($a, );
Regards,
SADAHIRO
subroutine is defined or not.
So goto() causes recursive AUTOLOAD calling inifinitely.
The following patch changes utf8.pm and utf8.t.
Regards,
SADAHIRO Tomoyuki
diff -ur perl~/lib/bytes.pm perl/lib/bytes.pm
--- perl~/lib/bytes.pm Wed Sep 03 18:39:15 2003
+++ perl/lib/bytes.pm Thu May 26 23
::Collate::Locale::_locale() does the same thing
as canonical_name(), but that function is internal and not public.
Regards,
SADAHIRO Tomoyuki
.
It may be enhanced sooner or later...
[prerelease] This will be released *after* Perl 5.8.4 (or its RC) will be out.
http://homepage1.nifty.com/nomenclator/perl/Unicode-Collate-0.40.tar.gz
regards,
SADAHIRO Tomoyuki
TO) with '=' (EQUALS SIGN)
since a mathematic negation slash is encoded by U+0338
COMBINING LONG SOLIDUS OVERLAY which is to be removed.
sub remove_accent {
use Unicode::Normalize;
my $s = NFD(shift);
$s =~ s/\pM//g;
return $s;
}
Regards,
SADAHIRO Tomoyuki
removal is provisional and its definition has not
been specified yet, I suppose it to have mapping of Ø to O, etc.
Regards,
SADAHIRO Tomoyuki
know whether Perl
supports multibyte file/path names or not.) in a Japanese Perlers'
mail list.
http://www.freeml.com/message/[EMAIL PROTECTED]/0004467 (in Japanese)
Here is a brief summary (in Japanese).
http://homepage1.nifty.com/nomenclator/perl/shiftjis.htm#file
Regards,
SADAHIRO Tomoyuki
On Tue, 23 Dec 2003 09:47:32 +0900
SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote:
I had talked on this problem (well, I don't know whether Perl
supports multibyte file/path names or not.) in a Japanese Perlers'
mail list.
http://www.freeml.com/message/[EMAIL PROTECTED]/0004467 (in Japanese
(Perl native)
http://search.cpan.org/~pne/Lingua-Klingon-Collate-1.01/
Lingua::JA::Sort::JIS, Japanese, UTF-8
http://search.cpan.org/~sadahiro/Lingua-JA-Sort-JIS-0.04/
ShiftJIS::Collate, Japanese, Shift-JIS
http://search.cpan.org/~sadahiro/ShiftJIS-Collate-1.02/
Regards,
SADAHIRO
in advance for any insight or pointers you can contribute.
Regards,
--
Eric Cholet
Regards,
SADAHIRO Tomoyuki
] # CHOSEONG NIEUN
1113 ; [.18A2.0020.0002][.18A1.0020.0002] # CHOSEONG NIEUN-KIYEOK
Regards,
SADAHIRO Tomoyuki
-MacKorean.html
Regards,
SADAHIRO Tomoyuki
(BOn Wed, 24 Sep 2003 12:12:37 +0200
(Budo [EMAIL PROTECTED] wrote:
(B
(B hello you,
(B
(B excuse me, please help!
(B
(B precondition:
(B * linux redhat 9.0
(B * perl, v5.8.0
(B * dbms sybase 11.9.x (support iso-8859-1 )
(B * string contains german special chars "$B!&!&!&(B
above + circumflex, o + dot above + grave,
and o + dot above + macron.
SADAHIRO Tomoyuki
-unicode/2003-04/msg00028.html
[3] http://www.unicode.org/Public/UNIDATA/HangulSyllableType.txt
[4] http://std.dkuug.dk/JTC1/SC22/WG20/docs/N954.PDF (full);
http://std.dkuug.dk/JTC1/SC22/WG20/docs/N953.PDF (summary)
regards,
SADAHIRO Tomoyuki
is always neglected.
- Fixed: according to S2.1 in UTS #10, a blocked combining character
should not be contracted.
One test in test.t was wrong, then removed.
- Added contract.t.
- (normalization = prenormalized) is able to be used.
Regards,
SADAHIRO Tomoyuki
.nifty.com/nomenclator/perl/Encode-UnicodeNormalization.html
Regards,
SADAHIRO Tomoyuki
Hello.
Well, Unicode::Normalize 0.23 is released.
http://search.cpan.org/author/SADAHIRO/Unicode-Normalize-0.23/
Documentations are also revised at some points.
Thank you,
SADAHIRO Tomoyuki
Call ID Numbers (via RT) [EMAIL PROTECTED] writes:
This is the closest I've been able to come
) {
#End of Patch
SADAHIRO Tomoyuki
. :-)
1946..194F; Nd # [10] LIMBU DIGIT ZERO..LIMBU DIGIT NINE
104A0..104A9 ; Nd # [10] OSMANYA DIGIT ZERO..OSMANYA DIGIT NINE
SADAHIRO Tomoyuki
On Sat, 14 Jun 2003 22:05:24 +0900
SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote:
I write a module that parses a character class
including grouping, intersection, union, and removal (subtraction),
according to Unicode Regular Expression (e.g. [A B], [A-Z - XYZ])
and converts it into a regular
/unicode/reports/tr18/
Thank you,
SADAHIRO Tomoyuki
true, Unicode code points (less than 256)
should be translated into native code points for EBCDIC;
(3) else, any further work should be abandoned.
If a test would failed for EBCDIC,
the code might be broken;
as well as the test itself might be broken.
Thank you,
SADAHIRO Tomoyuki
On Thu, 27 Mar 2003 10:02:28 +0900
Dan Kogai [EMAIL PROTECTED] wrote:
SADAHIRO-san and cp9?? experts,
On Thursday, Mar 27, 2003, at 00:44 Asia/Tokyo, SADAHIRO Tomoyuki wrote:
+U20AC \x80 |0 # EURO SIGN
Is this right? Yes, U20AC is indeed missing from cp936.ucm but see
this;
(snip
://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT
Regards,
SADAHIRO Tomoyuki
|0 # DEGREE FAHRENHEIT
End of patch
sigh, I've made such a patch long before.
cf.
http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2001-09/msg01568.html
Regards,
SADAHIRO Tomoyuki
\xB1 |0
U00A8 \xC6\xD8 |0
U00AF \xA1\xC2 |0
@@ -171,7 +164,6 @@
U00F9 \x88\x7B |0
U00FA \x88\x79 |0
U00FC \x88\xA2 |0
-U00FF \xFF |0 # LATIN SMALL LETTER Y WITH DIAERESIS
U0100 \x88\x56 |0
U0101 \x88\x67 |0
U0112 \x88\x5A |0
Regards,
SADAHIRO Tomoyuki
I often encounter lower-ascii codes
/bp581/lib/Encode.pm line 156.
The message is not 'big5-eten \x88\x71 does not map to Unicode..',
of course (big5-eten.ucm does not define \x88\x71
as a double-byte char), that may be what is expected, though.
Regards,
SADAHIRO Tomoyuki
On Tue, 25 Mar 2003 21:53:13 +0900
SADAHIRO Tomoyuki [EMAIL
due to the appearance of UDCs.
SADAHIRO Tomoyuki
On Fri, 21 Mar 2003 10:52:07 -0500
Mark Lewellen [EMAIL PROTECTED] wrote:
Hi-
I'm looking for recommendations on how to warn about and record
problems
with ill-formed data. Specifically, I'm reading in Big5 data from
multiple files
SADAHIRO Tomoyuki [EMAIL PROTECTED] said:
P.S. Another problem. How can it be determined whether that
user-defined character (UDC hereafter) is single-byte or double-byte?
The file big5-eten.ucm does not contain how to determin the character
length in bytes for an unmapped UDC
})
? ok : not ok, 28\n;
### TESTS END
And many thanks for discovering STRLEN-UV type mismatchings, too.
SADAHIRO Tomoyuki
Thank you for your report.
I was careless about the trap on a non-ASCII platform
like that (a eq \x61) is not true.
So the failed tests are fixed, and some tests
-0.20.tar.gz
http://homepage1.nifty.com/nomenclator/perl/Unicode-Transform.html
SADAHIRO Tomoyuki
I have run the Unicode-Transform module on using perl 5.8.0 on z/OS
where perl's internal unicode format is UTF-EBCDIC, not UTF-8. The test
results are as follows:
/defects/brian/unicode
and will be also distributed from CPAN soon.
SADAHIRO Tomoyuki
)
specifications:
RFC 2045, 6.7
http://www.ietf.org/rfc/rfc2045.txt?number=2045
perl module:
MIME::QuotedPrint
http://search.cpan.org/author/GAAS/MIME-Base64-2.16/
SADAHIRO Tomoyuki
://www.vietstd.org/document/unicode.html
SADAHIRO Tomoyuki
in the version 2.4 nor in CVS
( http://oss.software.ibm.com/cvs/icu/charset/data/ ).
SADAHIRO Tomoyuki
WITH DOT BELOW
regards,
SADAHIRO Tomoyuki
A WITH CIRCUMFLEX AND HOOK ABOVE
U1EAD \xA7 |0 # LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW
U1EBD \xA8 |0 # LATIN SMALL LETTER E WITH TILDE
U1EB9 \xA9 |0 # LATIN SMALL LETTER E WITH DOT BELOW
regards,
SADAHIRO Tomoyuki
/perl/Lingua-FA-MacFarsi-0.02.tar.gz
http://homepage1.nifty.com/nomenclator/perl/Lingua-FA-MacFarsi.html
Lingua::HE::MacHebrew
http://homepage1.nifty.com/nomenclator/perl/Lingua-HE-MacHebrew-0.02.tar.gz
http://homepage1.nifty.com/nomenclator/perl/Lingua-HE-MacHebrew.html
SADAHIRO Tomoyuki
.nifty.com/nomenclator/perl/Lingua-AR-MacArabic-0.01.tar.gz
HTML-ized POD
http://homepage1.nifty.com/nomenclator/perl/Lingua-AR-MacArabic.html
SADAHIRO Tomoyuki
if something wrong.
(at least, the version here doesn't support
embedding or nesting of direction.)
SADAHIRO Tomoyuki
Lingua-AR-MacArabic-0.00.tar.gz
Description: Binary data
'. It is 'dead'. -- Jack Cohen
SADAHIRO Tomoyuki
SADAHIRO Tomoyuki
transliteration
of Japanese, called ROMAJI, as long i. (Long i is usually
represented by ii or i-macron, though.)
If i-circumflex might be dotless-i with circumflex,
but not i with circumflex, i-circumflex should be
a long sound of dotless i, but not long i.
That is also surprising. :)
Regards,
SADAHIRO
Cruces, NM 88003
Regards,
SADAHIRO Tomoyuki
method -change()
to change some tailoring parameters of the collator
Regards,
SADAHIRO Tomoyuki
Unicode intro that will be part of
the Perl 5.8.0 release, I have a copy online for easy access:
http://www.iki.fi/jhi/perluniintro.pod
Regards,
SADAHIRO Tomoyuki
return ISO_8859_2;
}
elsif ($string !~ /\p{^InShift_JIS}/) {
return Shift_JIS;
}
# Trial more ? Well, then add something.
# There is room to tune up in the order of trials.
return Unicode; # abandoned
}
__END__
Regards.
SADAHIRO Tomoyuki
of string) at bleedperl.pl
not ok
ok
ok
Regards,
SADAHIRO Tomoyuki
Hello.
I've prepare Shift_JISX0213 to Unicode 3.2.0 mapping table...
http://homepage1.nifty.com/nomenclator/unicode/sjis0213.zip
But I think it'd be too early to implement anything on it.
Regards,
SADAHIRO Tomoyuki
-unicode/2002-03/msg00076.html
Dan the Encode Maintainer
Regards,
SADAHIRO Tomoyuki
).
Some functions are slower when working on UTF-8 encoded strings than
on byte encoded strings. All functions that need to hop over
Another possibility is, of course, that the demonstrated behaviour is
a vanilla bug and gets fixed before 5.8.0. :-/
--
andreas
Sincerely
SADAHIRO
On Thu, 21 Mar 2002 22:12:33 +0900
SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote:
Nevertheless, we shouldn't distinguish Unicode-ness of hash keys;
otherwise we'd be upset more... :-)
#!perl
use charnames qw(:full);
my $alpha = \N{GREEK SMALL LETTER ALPHA};
# \x{945} = \xCE\xB1 UTF8
/sufficient/efficient/;
$NFC_string = NFC($string);
enjoy.
SADAHIRO Tomoyuki
Regards,
SADAHIRO Tomoyuki
\xA2\xA4\xA4\xA4\xA6\xA4\xA8\xA4\xAA ],
- [euc-jp-0212, han_kana, \x8E\xB1\x8E\xB2\x8E\xB3\x8E\xB4\x8E\xB5 ],
- [euc-jp-0212, macron,
- \x8F\xAA\xA7\x8F\xAA\xB7\x8F\xAA\xC5\x8F\xAA\xD7\x8F\xAA\xE9 ],
);
plan test = $n*@encodings + $n*@encodings*@greek
#End of Patch
sincerely,
SADAHIRO
';
my $class = ref($obj) '::' $subclass;
# carp Loading $file;
bless $obj,$class;
-132,7 +131,6
require Encode::Tcl::Table;
require Encode::Tcl::Escape;
require Encode::Tcl::Extended;
-require Encode::Tcl::HanZi;
1;
__END__
#End of Patch
Regards,
SADAHIRO Tomoyuki
://www.cl.cam.ac.uk/~mgk25/
Regards,
SADAHIRO Tomoyuki
;96FBN;
+F951;CJK COMPATIBILITY IDEOGRAPH-F951;Lo;0;L;964BN;
F952;CJK COMPATIBILITY IDEOGRAPH-F952;Lo;0;L;52D2N;
F953;CJK COMPATIBILITY IDEOGRAPH-F953;Lo;0;L;808BN;
F954;CJK COMPATIBILITY IDEOGRAPH-F954;Lo;0;L;51DCN;
End of patch
Regards,
SADAHIRO Tomoyuki
.
$str = pack 'U*', unpack 'U0U*', $str;
If you have some interest in Perl 5.7 (or later),
try the utf8::decode() function.
There are some other ways to flag a string as UTF8,
but one needs an XS code;
one doesn't work on later versions
Regards, SADAHIRO Tomoyuki
-defined algorithms.
- maybe unfamiliar (not only to western people)
in comparison with LATIN, GREEK, HAN, etc. scripts.
would it be better to gather those functions into one module?
-
Regards, SADAHIRO Tomoyuki
On Mon, 13 Aug 2001 10:07:42 -0500
Jarkko Hietaniemi [EMAIL PROTECTED] wrote:
On Mon, Aug 13, 2001 at 10:35:32PM +0900, SADAHIRO Tomoyuki wrote:
Hello, everyone.
Sort::UCA 0.04 has been uploaded on CPAN.
snip
To Do: conformance tests of Unicode 3.1.1 Beta
(at present it's DRAFT
On Thu, 09 Aug 2001 22:30:16 +0200
Bjoern Hoehrmann [EMAIL PROTECTED] wrote:
* SADAHIRO Tomoyuki wrote:
How about the following interface?
| $normalized_string = normalize($raw_string)
|
| You can use this function only if the normalization form
| you require is specified in the Cuse
no Text::Unicode tree on
CPAN but there is a Unicode:: tree and it fits quite well
there. The normalize function increases readabilty and
looks nicer.
Is it expectable, that Perl will normalize everything it
puts out by itself or will we have to use this module?
Regards, SADAHIRO Tomoyuki
::Util.pm (available via CPAN)
It also runs on Perl 5.6,
even if unicode/*.* are for unicode 3.0.1.
But NormalizationTest of unicode 3.1 requires
those for unicode 3.1.0 in the distribution of Perl 5.7.2.
Regards, SADAHIRO Tomoyuki
into Word, but all my other programs receive question marks).
How about Outlook Express?
Another choice is copying into a TEXTAREA field
of a FORM in the browser followed by writing it
in a file via CGI on your local machine.
Regards, SADAHIRO Tomoyuki
E-mail: [EMAIL PROTECTED]
$result = $uca-cmp($a, $b); # returns 1, 0, or -1.
SEE ALSO
http://www.unicode.org/unicode/reports/tr10/
But this is Alpha version.
Any feature (including module name) may be changed.
Please comment on it.
regards, SADAHIRO Tomoyuki
passing a character outside Hangul syllable in
shouldn't be carped or croaked,
since it supposes the return value would be *always* checked.
regards,
SADAHIRO Tomoyuki
E-mail: [EMAIL PROTECTED]
, SADAHIRO Tomoyuki
)
parseHangulName($name, Short | Medium | Long)
* Short allows it to accept names like GA.
Medium, like HANGUL SYLLABLE GA. (or Default?)
Long, like HANGUL SYLLABLE KIYEOK-A.
The mode may be used in getHangulName.
BTW, are there any formats other than the above three?
Regards, SADAHIRO Tomoyuki
,
SADAHIRO Tomoyuki
E-mail: [EMAIL PROTECTED]
URL: http://homepage1.nifty.com/nomenclator/perl/
'
for 'shiftjis' and 'cp932' work.
(I found mapping of shiftjis.enc differs from
that of Unicode's EASTASIA/JIS/SHIFTJIS.txt
by 3 codepoints.)
Regards,
SADAHIRO Tomoyuki
E-mail: [EMAIL PROTECTED]
URL: http://homepage1.nifty.com/nomenclator/
80 matches
Mail list logo