Re: How to use Unicode::Collate in multilinguage apps?
Sadahiro Tomoyuki wrote: > On Mon, 29 Mar 2004 23:44:00 +0100 > Rich <[EMAIL PROTECTED]> wrote: > >> Using the multi-lingual server scenario I was initially discussing, would >> one of the following usages be correct (yes, it's just pseudocode and >> exists in a world where no errors ever occur!): > > Though I have not worked with any multitasking application, > I suppose a possible snag is the size of DUCET (the file named > allkeys.txt) which should cause slowness of construction of > a collator and large memory use for storage. Yes, the size of allkeys.txt is an issue - I did a Data dump of a Unicode::Collate instance and it's pretty big! >> 1) >> >> my %collators; >> >> for ( $server_loop ) >> { >>my $lang_tag = Server->requested_lang_tag; >> >>my $collator = $collators{$lang_tag} >> ||= Unicode::Collate::Locale->new(locale => $lang_tag); >> >>... >> } > > 1) creates a new collator if $lang_tag value is new. > Say when the old one was 'en' (English) and the new one was 'it' > (Italian), Unicode::Collate::Locale->new will return a default collator > each time. I.e. $collators{en} and $collators{it} work as same but memory > is not shared. Good point! > When Unicode::Collate->new is called, all the data generated by parsing > of a table file are stored in a collator which is a blessed hash. > The reason why so is, as I thinked, if (a part of) data newly created > are stored in other places, say, in a cache at the package namespace > (e.g. something like %Unicode::Collate::Cache), it might cause some > problem on handling memory in the cache by users outside the package. > > I think parhaps it should be necessary that a user can determine > whether two (or more) $lang_tag values create the same collator or not. > > my $lang_tag = Server->requested_lang_tag; > my $canonical = Unicode::Collate::Locale::canonical_name($lang_tag); > > # if $canonical is same as an old one, the collator for it should be > # same. After seeing if $canonical is new, a collator can be created. > # The function name leaves room for reconsideration. Yes, makes sense, but I'm starting to wonder if Unicode::Collate is too heavyweight a solution. Perhaps something based around Sort::ArbBiLex might produce good enough results for most languages. Thanks for the reply -- Rich [EMAIL PROTECTED]
Re: How to use Unicode::Collate in multilinguage apps?
On Mon, 29 Mar 2004 23:44:00 +0100 Rich <[EMAIL PROTECTED]> wrote: > I now realise that some per-language tailoring would be needed for sensible > results. Unicode::Collate::Locale seems like the kind of think I was > looking for, and any tailoring is better than none :) > > Using the multi-lingual server scenario I was initially discussing, would > one of the following usages be correct (yes, it's just pseudocode and > exists in a world where no errors ever occur!): Though I have not worked with any multitasking application, I suppose a possible snag is the size of DUCET (the file named allkeys.txt) which should cause slowness of construction of a collator and large memory use for storage. > 1) > > my %collators; > > for ( $server_loop ) > { >my $lang_tag = Server->requested_lang_tag; > >my $collator = $collators{$lang_tag} > ||= Unicode::Collate::Locale->new(locale => $lang_tag); > >... > } 1) creates a new collator if $lang_tag value is new. Say when the old one was 'en' (English) and the new one was 'it' (Italian), Unicode::Collate::Locale->new will return a default collator each time. I.e. $collators{en} and $collators{it} work as same but memory is not shared. When Unicode::Collate->new is called, all the data generated by parsing of a table file are stored in a collator which is a blessed hash. The reason why so is, as I thinked, if (a part of) data newly created are stored in other places, say, in a cache at the package namespace (e.g. something like %Unicode::Collate::Cache), it might cause some problem on handling memory in the cache by users outside the package. I think parhaps it should be necessary that a user can determine whether two (or more) $lang_tag values create the same collator or not. my $lang_tag = Server->requested_lang_tag; my $canonical = Unicode::Collate::Locale::canonical_name($lang_tag); # if $canonical is same as an old one, the collator for it should be # same. After seeing if $canonical is new, a collator can be created. # The function name leaves room for reconsideration. Now Unicode::Collate::Locale::_locale() does the same thing as canonical_name(), but that function is internal and not public. Regards, SADAHIRO Tomoyuki
Re: How to use Unicode::Collate in multilinguage apps?
Sadahiro Tomoyuki wrote: > I write Unicode::Collate::Locale (tentatively) for linguistic tailoring > of UCA. To use it, Unicode::Collate should search allkeys.txt > from any directories in @iNC (at present it searchs table files > only under the directory where it locates.) > So Unicode::Collate::Locale should require Unicode::Collate 0.40 or later, > which is not released yet, but a prerelease is available as shown below. > > [tarball] > http://homepage1.nifty.com/nomenclator/perl/Unicode-Collate-Locale-0.01.tar.gz > [doc] > http://homepage1.nifty.com/nomenclator/perl/Unicode-Collate-Locale.html >Sorry, now tailoring of only few languages are implemented. >It may be enhanced sooner or later... > > [prerelease] This will be released *after* Perl 5.8.4 (or its RC) will be > [out. > http://homepage1.nifty.com/nomenclator/perl/Unicode-Collate-0.40.tar.gz Thank you and Jarkko for your replies. I now realise that some per-language tailoring would be needed for sensible results. Unicode::Collate::Locale seems like the kind of think I was looking for, and any tailoring is better than none :) Using the multi-lingual server scenario I was initially discussing, would one of the following usages be correct (yes, it's just pseudocode and exists in a world where no errors ever occur!): 1) my %collators; for ( $server_loop ) { my $lang_tag = Server->requested_lang_tag; my $collator = $collators{$lang_tag} ||= Unicode::Collate::Locale->new(locale => $lang_tag); ... } 2) my $prev_lang; my $collator; for ( $server_loop ) { my $lang_tag = Server->requested_lang_tag; unless ( $lang_tag eq $prev_lang ) { $prev_lang = $lang; $collator = Unicode::Collator::Locale->new(locale => $lang_tag); } ... } Which would be the preferred way of handling this (or are both wrong)? Again, thanks for your replies. -- Rich [EMAIL PROTECTED]
Re: How to use Unicode::Collate in multilinguage apps?
> I think, for a script representing usually one language, > allkeys.txt defines fairly acceptable collation order. > For example, order of hiragana and katakana is approximately > compliant with the custom of the Japanese language. > > In contrast, for a script representing many languages > (say, the Latin script), tailoring may be often necessary. > > E.g. 'Ä' is sorted as A-umlaut (sometimes as 'AE') in German, > and as one of additional letters ordered after 'Z' in some > northern-european languages. Yup, that is the case in Finnish and Swedish, and Danish and Norwegian do similar things with their "a" and "o" equivalents. This means it is logically impossible to sort a list containing both German and Swedish names "right". Many European languages sort some consonant+h after the base consonant as a separate "letter", and so forth. And I believe many the CJK languages have in fact several (and differing) customary sorting sorters. Even when staying within a single language one must decide whether one does things like "dictionary sorting" (spaces etc. removed), and how do lowercase and uppercase sort (A < B < a, A < a < B, a < A < B, or a == A < B), what one does with things like articles, etc. So one must always either accept "a good enough" sorting, or one must customize more or less heavily. > But according to Unicode default collation, 'Ä' is ordered > as a modified 'A' and equal to 'A' at the primary level. > -- Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen
Re: How to use Unicode::Collate in multilinguage apps?
On Thu, 25 Mar 2004 22:29:08 + Rich <[EMAIL PROTECTED]> wrote: > Hello > > How should collation be handled in multitasking, multilingual applications - > in particular forking servers such as apache/mod_perl based web apps? > > I can assume the following: > > 1) I'll know the preferred language via a RFC2616 language tag. > 2) All data will be utf8 encoded Unicode. > 3) The required language may differ for each request. > > I guess Unicode::Collate is the way to go, so can I simply have one > Unicode::Collate instance per process using the default allkeys.txt table > file? > > Will that give sensible results for most (all?) languages, or do I need to > customise the collator on the fly when more 'exotic' (for want of a better > word) languages are requested? Are there other reasons, such as size and/or > performance issues, why the default allkeys.txt file may not be the way to > go? I think, for a script representing usually one language, allkeys.txt defines fairly acceptable collation order. For example, order of hiragana and katakana is approximately compliant with the costom of the Japanese language. In contrast, for a script representing many languages (say, the Latin script), tailoring may be often necessary. E.g. 'Ä' is sorted as A-umlaut (sometimes as 'AE') in German, and as one of additional letters ordered after 'Z' in some northern-european languages. But according to Unicode default collation, 'Ä' is ordered as a modified 'A' and equal to 'A' at the primary level. > I must stress that I'm ok with most aspects of i18n/l10n - it's specifically > the correct use of Unicode::Collate in multitasking apps that I'm > interested in. > > Suggestions would be welcome - even more so if they don't involve having to > know the TR10 docs inside out! I write Unicode::Collate::Locale (tentatively) for linguistic tailoring of UCA. To use it, Unicode::Collate should search allkeys.txt from any directories in @iNC (at present it searchs table files only under the directory where it locates.) So Unicode::Collate::Locale should require Unicode::Collate 0.40 or later, which is not released yet, but a prerelease is available as shown below. [tarball] http://homepage1.nifty.com/nomenclator/perl/Unicode-Collate-Locale-0.01.tar.gz [doc] http://homepage1.nifty.com/nomenclator/perl/Unicode-Collate-Locale.html Sorry, now tailoring of only few languages are implemented. It may be enhanced sooner or later... [prerelease] This will be released *after* Perl 5.8.4 (or its RC) will be out. http://homepage1.nifty.com/nomenclator/perl/Unicode-Collate-0.40.tar.gz regards, SADAHIRO Tomoyuki
How to use Unicode::Collate in multilinguage apps?
Hello How should collation be handled in multitasking, multilingual applications - in particular forking servers such as apache/mod_perl based web apps? I can assume the following: 1) I'll know the preferred language via a RFC2616 language tag. 2) All data will be utf8 encoded Unicode. 3) The required language may differ for each request. I guess Unicode::Collate is the way to go, so can I simply have one Unicode::Collate instance per process using the default allkeys.txt table file? Will that give sensible results for most (all?) languages, or do I need to customise the collator on the fly when more 'exotic' (for want of a better word) languages are requested? Are there other reasons, such as size and/or performance issues, why the default allkeys.txt file may not be the way to go? I must stress that I'm ok with most aspects of i18n/l10n - it's specifically the correct use of Unicode::Collate in multitasking apps that I'm interested in. Suggestions would be welcome - even more so if they don't involve having to know the TR10 docs inside out! Cheers, -- Rich [EMAIL PROTECTED]