[Bug 164] Support collation by a certain locale (sorting order of characters)

bugzilla-daemon Fri, 26 Aug 2011 14:49:55 -0700

https://bugzilla.wikimedia.org/show_bug.cgi?id=164


--- Comment #228 from Philippe Verdy <verd...@wanadoo.fr> 2011-08-26 21:49:23 
UTC ---
I am currently working on a Javascript implementation of UCA (i.e. with full
support of the DUCET, at least 4 collation levels, and support for
language-based tailorings). This is a very difficult task if we need
performance. The trick is to use efficient lookup tables, but the difficulty is
being able to update them (for tailorings); also performance will highly depend
on how the DUCET is represented in Javascript, so that it can be initialized
sufficiently fast that you won't notice this init time, when a collator object
gets instanciated.

I'm still not sure that performance will be good enough (even with today's much
faster Javascript implementations), and may be some optional cooperation with a
server-side script will be needed (or possibly by using a client-side native
plugin interfaced with Javascript, such as Flash) could limit the number of
JSON requests to a server in order to provide the tailored lookup tables. How
to detect ans use client-side plugins in an open way.

I really hope that someday, the Javascript/ECMAScript standard will specify a
common standard interface to collators. I've already looked for help from
authors of jQuery, where such common interface would become part of its "tools"
library (even if this has nothing in common with working in the HTML DOM, that
is the main focus of jQuery).

For now there exists absolutely no reliable implementation of UCA in Javascript
working only on the client-side (but there does exist already some Javascript
libraries that in fact use JSON or XML requests to a server, in order to build
or retreive the necessary lookup tables).

For practical reasons, a server-side JSON/XML request could be optimized to
reduce the volume of data exchanged (for example by providing only some parts
of the collation elements, these data being then requested on demand even after
the collator has already been -partly- initialized, and then cached in the
collator object). But this is a complication in the design that for now I don't
want to investigate too early, before I get a good image of the performances
effectively needed in practice.

May be it will be just enough to initialize only a part of the DUCET for a
particular language and its specific tailoring, sorting all the other many
characters with default weights (not necessarily from the large DUCET).

My implementation will provide at least 3 interfaced functions:
- instantiating a collator for a given language & sort order
- computing collation keys from strings, for more efficient sorts of large
tables (even if we use a QuickSort, the same strings are compared multiple
times with other strings, so it may help to compute their collation keys only
once); in addition, it may not be necessary to compute the full collation key,
but only keys up to the level that's sufficient to perform a specific compare;
collation keys can then be computed partly, on demand, and cached, instead of
being fully computed for the full string at all collation levels; in addition,
not all table sorts may require the maximum collation level supported, so
collation keys don't necessarily have to be long (in memory);
- the tradeoff for precomputing keys instead of compring keys on the fly, is
highly dependant with the table sizes: for small number of rows to be sorted,
you gain little with precomputed keys, you just waste memory and get lower
performance. So the interface will also include a 3rd function: comparing
strings using this collator, without necessarily having to compute the
collation keys.

I estimate that the tradeoff limit between precomputed collation keys and
direct compares is for tables of about 100 rows, for which it may be helpful to
provide a very fast response time when sorting them, but I may be wrong,
because these small tables will not require precomputing and storing a lot of
collation keys (so this first step before the QuickSort loop may be
insignificant). The real limit could be memory use in very large tables for
precomputing, storing or caching the collation keys; but such very large tables
will probably be very uncommon in tables generated from Wiki pages (most of
these tables would not include more than 500 rows, and storing 500 collation
keys is not really a problem in today's browsers, except possibly in
smartphones with limited memory resources and slow processors, compared to
laptops, notebooks and tablets).

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 164] Support collation by a certain locale (sorting order of characters)

Reply via email to