On Thu, 2 Aug 2001, Vuillemot, Ward W wrote:
> Date: Thu, 2 Aug 2001 07:58:28 -0700
> From: "Vuillemot, Ward W" <[EMAIL PROTECTED]>
> To: 'Andrew McNaughton' <[EMAIL PROTECTED]>,
> "'[EMAIL PROTECTED]'" <[EMAIL PROTECTED]>
> Subject: RE: DBD:mysql and UNICODE
>
> Just so I understand. . .and I think I understood UNICODE BEFORE I started reading
>all the literature that seemed to confuse the matter. :)
>
> UNICODE is a character encoding ...
Wrong. Unicode is not a character encoding. There are many different
character encodings which are used to encode unicode, notably utf-8,
ucs16 and perl's own utf-8 like encoding.
Unicode is firstly a character set, relating glyphs to numerical codes,
and it also embodies many formal rules for problems like sorting text,
text flow order, combining characters and so forth.
Unicode defines the meaning of a sequence of numbers representing the
text. The character encoding defines how those numbers are represented as
a sequence of bytes.
> ... that can handle any character irrespective of language
> When I output to the web I will need to convert UNICODE to some appropriate
>character-set based upon the language selection.
You need to first handle any characters in the unicode text which are not
representable in the character set you are able to represent in your
output character encoding. Typically you would either replace difficult
characters with some sort of place holder character, or fall back to
something you can represent. Depending on the software you are using, you
might be using any of a number of representations at this stage, with
utf-8 or perl's approximation to it being the most likely, but some CPAN
code uses ucs16. Conversion to the target encoding is a seperate but
related step.
> Is this correct? Or can this be done automatically. . .or at least, can I just
>avoid it and send the UNICODE data directly to a web-browser and let the browser do
>whatever is necessary. As I intend to develop a system that can handle an arbitrary
>number of languages, I want let the code handle any language without me necessarily
>having to add more and more code to support it -- I would love it if I could just
>choose one flavor -- UNICODE -- and that be it. But hey, I know I do not live in an
>ideal world. . . ;)
Take a look in CPAN.
perl -MCPAN -e 'i /Unicode::/'
Andrew McNaughton
> I do appreciate your help.
>
> Thanks,
> Ward
>
> -----Original Message-----
> From: Andrew McNaughton [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, August 01, 2001 9:27 PM
> To: Vuillemot, Ward W
> Subject: Re: DBD:mysql and UNICODE
>
>
>
>
> On Wed, 1 Aug 2001, Vuillemot, Ward W wrote:
>
> > Date: Wed, 1 Aug 2001 15:57:16 -0700
> > From: "Vuillemot, Ward W" <[EMAIL PROTECTED]>
> > To: "'[EMAIL PROTECTED]'" <[EMAIL PROTECTED]>
> > Subject: DBD:mysql and UNICODE
> >
> > I am looking to develop a set of databases that can handle
> > international character sets. For example, I want to have menu items
> > that can be changed on the fly from, say, English to Japanese to
> > German to Chinese.
> >
> > Should I create a table that correlates each language with a UNICODE
> > set? And then create a table where each row is for a specific
> > language and the columns being the individual entries? After that,
> > can I use a lookup into the first table based on the key of the second
> > table to determine what type of UNICODE character-set it is. (sorry,
> > I am typing out load as it were ;) ).
>
> Your character set in the database *is* unicode. There's only one unicode
> character set. All other common to medium-rare character sets are subsets
> of that one big set. Keep things simple and store nothing in your
> database that's not in unicode.
>
> You could store your strings as you say, but I'd be inclined to have every
> string in its own row, and have a column which identifies the language.
>
> For a given language (eg english), there might be multiple possible
> character encodings (eg iso-8859-1, cp1252, utf-8), and you might choose
> to support more than one in your web output. You might store
> language/character encoding combinations in your database, but character
> encoding and character set are not to be confused.
>
>
--
Andrew McNaughton
Scoop Media Ltd
[EMAIL PROTECTED]
"Every year the international financial system kills more people than the
second world war. But at least Hitler was mad ... "
-- Ken Livingstone