Re: How to use Unicode::Collate in multilinguage apps?

2004-03-31 Thread Rich
Sadahiro Tomoyuki wrote:

> On Mon, 29 Mar 2004 23:44:00 +0100
> Rich <[EMAIL PROTECTED]> wrote:
> 
>> Using the multi-lingual server scenario I was initially discussing, would
>> one of the following usages be correct (yes, it's just pseudocode and
>> exists in a world where no errors ever occur!):
> 
> Though I have not worked with any multitasking application,
> I suppose a possible snag is the size of DUCET (the file named
> allkeys.txt) which should cause slowness of construction of
> a collator and large memory use for storage.

Yes, the size of allkeys.txt is an issue - I did a Data dump of a
Unicode::Collate instance and it's pretty big!

>> 1)
>> 
>>  my %collators;
>> 
>>  for ( $server_loop )
>>  {
>>my $lang_tag = Server->requested_lang_tag;
>> 
>>my $collator   = $collators{$lang_tag}
>> ||= Unicode::Collate::Locale->new(locale => $lang_tag);
>> 
>>...
>>  }
> 
> 1) creates a new collator if $lang_tag value is new.
> Say when the old one was 'en' (English) and the new one was 'it'
> (Italian), Unicode::Collate::Locale->new will return a default collator
> each time. I.e. $collators{en} and $collators{it} work as same but memory
> is not shared.

Good point!

> When Unicode::Collate->new is called, all the data generated by parsing
> of a table file are stored in a collator which is a blessed hash.
> The reason why so is, as I thinked, if (a part of) data newly created
> are stored in other places, say, in a cache at the package namespace
> (e.g. something like %Unicode::Collate::Cache), it might cause some
> problem on handling memory in the cache by users outside the package.
> 
> I think parhaps it should be necessary that a user can determine
> whether two (or more) $lang_tag values create the same collator or not.
> 
> my $lang_tag = Server->requested_lang_tag;
> my $canonical = Unicode::Collate::Locale::canonical_name($lang_tag);
> 
> # if $canonical is same as an old one, the collator for it should be
> # same. After seeing if $canonical is new, a collator can be created.
> # The function name leaves room for reconsideration.

Yes, makes sense, but I'm starting to wonder if Unicode::Collate is too
heavyweight a solution. Perhaps something based around Sort::ArbBiLex might
produce good enough results for most languages.

Thanks for the reply
-- 
Rich
[EMAIL PROTECTED]


Re: How to use Unicode::Collate in multilinguage apps?

2004-03-30 Thread SADAHIRO Tomoyuki

On Mon, 29 Mar 2004 23:44:00 +0100
Rich <[EMAIL PROTECTED]> wrote:

> I now realise that some per-language tailoring would be needed for sensible
> results. Unicode::Collate::Locale seems like the kind of think I was
> looking for, and any tailoring is better than none :)
> 
> Using the multi-lingual server scenario I was initially discussing, would
> one of the following usages be correct (yes, it's just pseudocode and
> exists in a world where no errors ever occur!):

Though I have not worked with any multitasking application,
I suppose a possible snag is the size of DUCET (the file named
allkeys.txt) which should cause slowness of construction of
a collator and large memory use for storage.

> 1)
> 
>  my %collators;
> 
>  for ( $server_loop )
>  {
>my $lang_tag = Server->requested_lang_tag;
> 
>my $collator   = $collators{$lang_tag} 
> ||= Unicode::Collate::Locale->new(locale => $lang_tag);
> 
>...
>  }

1) creates a new collator if $lang_tag value is new.
Say when the old one was 'en' (English) and the new one was 'it' (Italian),
Unicode::Collate::Locale->new will return a default collator each time.
I.e. $collators{en} and $collators{it} work as same but memory is not shared.

When Unicode::Collate->new is called, all the data generated by parsing
of a table file are stored in a collator which is a blessed hash.
The reason why so is, as I thinked, if (a part of) data newly created
are stored in other places, say, in a cache at the package namespace
(e.g. something like %Unicode::Collate::Cache), it might cause some problem
on handling memory in the cache by users outside the package.

I think parhaps it should be necessary that a user can determine
whether two (or more) $lang_tag values create the same collator or not.

my $lang_tag = Server->requested_lang_tag;
my $canonical = Unicode::Collate::Locale::canonical_name($lang_tag);

# if $canonical is same as an old one, the collator for it should be
# same. After seeing if $canonical is new, a collator can be created.
# The function name leaves room for reconsideration.

Now Unicode::Collate::Locale::_locale() does the same thing
as canonical_name(), but that function is internal and not public.

Regards,
SADAHIRO Tomoyuki



Re: How to use Unicode::Collate in multilinguage apps?

2004-03-30 Thread Rich
Sadahiro Tomoyuki wrote:



> I write Unicode::Collate::Locale (tentatively) for linguistic tailoring
> of UCA. To use it, Unicode::Collate should search allkeys.txt
> from any directories in @iNC (at present it searchs table files
> only under the directory where it locates.)
> So Unicode::Collate::Locale should require Unicode::Collate 0.40 or later,
> which is not released yet, but a prerelease is available as shown below.
> 
> [tarball]
>
http://homepage1.nifty.com/nomenclator/perl/Unicode-Collate-Locale-0.01.tar.gz
> [doc]
> http://homepage1.nifty.com/nomenclator/perl/Unicode-Collate-Locale.html
>Sorry, now tailoring of only few languages are implemented.
>It may be enhanced sooner or later...
> 
> [prerelease] This will be released *after* Perl 5.8.4 (or its RC) will be
> [out.
> http://homepage1.nifty.com/nomenclator/perl/Unicode-Collate-0.40.tar.gz

Thank you and Jarkko for your replies.

I now realise that some per-language tailoring would be needed for sensible
results. Unicode::Collate::Locale seems like the kind of think I was
looking for, and any tailoring is better than none :)

Using the multi-lingual server scenario I was initially discussing, would
one of the following usages be correct (yes, it's just pseudocode and
exists in a world where no errors ever occur!):

1)

 my %collators;

 for ( $server_loop )
 {
   my $lang_tag = Server->requested_lang_tag;

   my $collator   = $collators{$lang_tag} 
||= Unicode::Collate::Locale->new(locale => $lang_tag);

   ...
 }


2)

  my $prev_lang;
  my $collator;

  for ( $server_loop )
  {
my $lang_tag = Server->requested_lang_tag;

unless ( $lang_tag eq $prev_lang )
{
  $prev_lang = $lang;
  $collator  = Unicode::Collator::Locale->new(locale => $lang_tag);
}

...
  }


Which would be the preferred way of handling this (or are both wrong)?

Again, thanks for your replies.
-- 
Rich
[EMAIL PROTECTED]


Re: How to use Unicode::Collate in multilinguage apps?

2004-03-28 Thread Jarkko Hietaniemi
> I think, for a script representing usually one language,
> allkeys.txt defines fairly acceptable collation order.
> For example, order of hiragana and katakana is approximately
> compliant with the custom of the Japanese language.
> 
> In contrast, for a script representing many languages
> (say, the Latin script), tailoring may be often necessary.
> 
> E.g. 'Ä' is sorted as A-umlaut (sometimes as 'AE') in German,
> and as one of additional letters ordered after 'Z' in some
> northern-european languages.

Yup, that is the case in Finnish and Swedish, and Danish and Norwegian
do similar things with their "a" and "o" equivalents.  This means it is
logically impossible to sort a list containing both German and Swedish
names "right". Many European languages sort some consonant+h after the
base consonant as a separate "letter", and so forth.  And I believe many
the CJK languages have in fact several (and differing) customary sorting
sorters.

Even when staying within a single language one must decide whether one
does things like "dictionary sorting" (spaces etc. removed), and how do
lowercase and uppercase sort (A < B < a, A < a < B, a < A < B, or
a == A < B), what one does with things like articles, etc.

So one must always either accept "a good enough" sorting, or one must
customize more or less heavily.

> But according to Unicode default collation, 'Ä' is ordered
> as a modified 'A' and equal to 'A' at the primary level.
> 

-- 
Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen


Re: How to use Unicode::Collate in multilinguage apps?

2004-03-27 Thread SADAHIRO Tomoyuki

On Thu, 25 Mar 2004 22:29:08 +
Rich <[EMAIL PROTECTED]> wrote:

> Hello
> 
> How should collation be handled in multitasking, multilingual applications -
> in particular forking servers such as apache/mod_perl based web apps?
> 
> I can assume the following:
> 
> 1) I'll know the preferred language via a RFC2616 language tag.
> 2) All data will be utf8 encoded Unicode.
> 3) The required language may differ for each request.
> 
> I guess Unicode::Collate is the way to go, so can I simply have one
> Unicode::Collate instance per process using the default allkeys.txt table
> file? 
> 
> Will that give sensible results for most (all?) languages, or do I need to
> customise the collator on the fly when more 'exotic' (for want of a better
> word) languages are requested? Are there other reasons, such as size and/or
> performance issues, why the default allkeys.txt file may not be the way to
> go?

I think, for a script representing usually one language,
allkeys.txt defines fairly acceptable collation order.
For example, order of hiragana and katakana is approximately
compliant with the costom of the Japanese language.

In contrast, for a script representing many languages
(say, the Latin script), tailoring may be often necessary.

E.g. 'Ä' is sorted as A-umlaut (sometimes as 'AE') in German,
and as one of additional letters ordered after 'Z' in some
northern-european languages.
But according to Unicode default collation, 'Ä' is ordered
as a modified 'A' and equal to 'A' at the primary level.

> I must stress that I'm ok with most aspects of i18n/l10n - it's specifically
> the correct use of Unicode::Collate in multitasking apps that I'm
> interested in.
> 
> Suggestions would be welcome - even more so if they don't involve having to
> know the TR10 docs inside out!

I write Unicode::Collate::Locale (tentatively) for linguistic tailoring
of UCA. To use it, Unicode::Collate should search allkeys.txt 
from any directories in @iNC (at present it searchs table files
only under the directory where it locates.)
So Unicode::Collate::Locale should require Unicode::Collate 0.40 or later,
which is not released yet, but a prerelease is available as shown below.

[tarball]
http://homepage1.nifty.com/nomenclator/perl/Unicode-Collate-Locale-0.01.tar.gz
[doc]
http://homepage1.nifty.com/nomenclator/perl/Unicode-Collate-Locale.html
   Sorry, now tailoring of only few languages are implemented.
   It may be enhanced sooner or later...

[prerelease] This will be released *after* Perl 5.8.4 (or its RC) will be out.
http://homepage1.nifty.com/nomenclator/perl/Unicode-Collate-0.40.tar.gz

regards,
SADAHIRO Tomoyuki



How to use Unicode::Collate in multilinguage apps?

2004-03-26 Thread Rich
Hello

How should collation be handled in multitasking, multilingual applications -
in particular forking servers such as apache/mod_perl based web apps?

I can assume the following:

1) I'll know the preferred language via a RFC2616 language tag.
2) All data will be utf8 encoded Unicode.
3) The required language may differ for each request.

I guess Unicode::Collate is the way to go, so can I simply have one
Unicode::Collate instance per process using the default allkeys.txt table
file? 

Will that give sensible results for most (all?) languages, or do I need to
customise the collator on the fly when more 'exotic' (for want of a better
word) languages are requested? Are there other reasons, such as size and/or
performance issues, why the default allkeys.txt file may not be the way to
go?

I must stress that I'm ok with most aspects of i18n/l10n - it's specifically
the correct use of Unicode::Collate in multitasking apps that I'm
interested in.

Suggestions would be welcome - even more so if they don't involve having to
know the TR10 docs inside out!

Cheers,
-- 
Rich
[EMAIL PROTECTED]