Re: Tajik alphabet code

Philippe Verdy Mon, 01 Mar 2004 18:03:57 -0800

From: "Peter Kirk" <[EMAIL PROTECTED]>
> Windows 2000/XP and Office need no adaptation, just fonts and keyboards.
> Well, the menus do need localisation, and obviously that is a
> significant issue (although I guess most Tajiks know or can easily learn
> the Russian for "File", "View", "Help" etc).
>
> Issues of localisation of non-Unicode software are off topic for this
> list, surely.


Here again the localization of the interface is not the main issue. I do agree
that a program interface in Russian would work for most Tajik peoples. The main
problem is for the documents that people creater themselves in their language
for their own use and for interchange with others.

This includes all the various tools used to create personal webpages, sending
emails, and instant messaging, but also creating printed documents for snail
mail, publishing books, writing papers, feeding databases...

And also using the various databases that have been created with various
encodings well adapted for Russian but not necessarily for Tajik, and the
difficulty to interchange this legacy data and use them with the tools they have
(the main problem is not in the standard office programs but in the
business-specific softwares, which may have been developed with Russian
standards or with legacy tools developed by lazy US programmers that just
considered the case of handling English and a few Western European languages,
and forgot the case of Cyrillic alphabet variants).

To use these softwares that are still needed but difficult or expensive to
adapt, there's a need to merge data from various sources which may have used
several "personal" 8-bit encodings usable in some limited domain and transcode
them into a common and well-accepted 8-bit encoding. Suppose this common 8-bit
encoding is the ISO-8859 Cyrillic charset, then some Tajik characters present in
this legacy data won't map well and there may be alteration of the data (which
may be a serious issue if this data has some legal value, or is used for
identification of persons or services or marks).

Going to Unicode is of course a longer term target, but for now there will
remain lots of use of 8-bit processing in softwares or devices before they are
replaced with more modern ones (in fact I do think that Western programmers will
continue for a very long time to be lazy, until classic C or C++ development is
completely deprecated and will continue to produce software processing only
single-byte coded characters, simply because the OS they use themselves are
processing only 8-bit coded chars in its API, notably in POSIX services and
Linux/Unix kernels where a "char" is a byte, as well as in many open protocols
for the Internet). Using UTF-8 is a solution but not the simplest one for
programmers and they are lazy in the code they produce and test, and they will
too often forget the necessary code to handle multibyte sequences correctly,
notably if there are security issues like possible buffer overruns.

I took the case of Tajik, but this may be true for every language that needs
more than just the ISO-8859-1 character subset. In many cases, a standardized
ISO-8859 variant may help solve the immediate problem found in many countries,
with the notable exception of China, Korea and Japan which always need large
subsets and where programmers are used to not be lazy and to process MBCS
sequences (including UTF-8 for Unicode) correctly.

Re: Tajik alphabet code

Reply via email to