Multiple languages (was Re: Cyrillic characters)

Joel Rees Mon, 11 Mar 2002 20:20:02 -0800

Sort of off topic, but here goes:

Jim Philips wrote:


> I am trying to understand how to store both Latin and Cyrillc characters
> in a database. I built in support for koi-8 and win1251, but I don't
> seem to be getting real support for Cyrillic. Cyrillic characters are
> stored as ASCII

Uhmmm, unless there is some parsing and conversion going on along with the
storing, "storing as ASCII" really doesn't mean anything. A 0x5c (decimal
92) is neither back slash nor the yen (JPY) mark. If you are reading text
from an American source (but not EBCDIC), you think it is backslash, but if
you are reading text from a Japanese (shift-JIS) source you think it is JPY.
What you think of it when reading Cyrillic, I don't recall off-hand.

> and are echoed back on the Web page that way.

At any rate, this is what the browser is doing, not what MySQL is doing. You
have to explicitly tell the browser to interpret those bits as one thing or
another. Most browsers won't let you display more than one character
set/encoding at a time.

This can be fudged a bit in certain cases. Shift-JIS, for instance, includes
most of US-ASCII in its original positions. (Tilde is the other exception,
but without clear rules.) But you still get funny treatment of English text
when you're mixing Japanese and English. Word breaks versus line breaks is a
place where things often break down.

Unicode should shortly provide the ability to display characters from
multiple languages in a single web document (if your machine has the fonts),
but it will probably take a bit longer to enable multiple locales in a
single web document. Switching between multiple sets of parsing rules on the
fly is a bit of a pain. And, just what multiple locales in a single document
should mean is not yet really well understood, either. Many Japanese people
think it is wrong to respect word breaks for English words at the end of the
line, for instance.

> Will MySQL
> support storing both Latin and Cyrillic characters in the same database?

I understand that the particular encoding of Cyrillic you are looking at is
an 8-bit encoding, and does not allocate any version of Latin in the
encoding. I think you might be able to use the HTTP encoding headers to
specify one frame as Cyrillic and another as Latin, but you may find some
browsers that choke on it.

MySQL, from what I understand, requires the encoding and the parsing rules
to be specified together in the initializations, so you won't be able to
switch from one locale to another on the fly.

(Someone confirm that, please?)

So, for sorting, collation, some aspects of indexing, case
conversion/sensitivity, etc., you will have to live with what the rules for
the one do to the other, if you use a single database server. For instance,
if you need to sort on both English and Cyrillic at the same time, you'll
need to do some extra work, maybe use C/perl/PHP/whatever to sort some
things in RAM.

For simple storage, however, you really don't care what the locale is. A
blob of English text in a record that is otherwise Cyrillic might not really
pose any problems. (But you'll want to think it out carefully beforehand.)

In other words, it's going to require some experimenting. Let us know how it
goes.

Joel Rees
Alps Giken Kansai Systems Develoment
Suita, Osaka

query



---------------------------------------------------------------------
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/           (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Multiple languages (was Re: Cyrillic characters)

Reply via email to