At 12:26 PM 6/06/2007, you wrote:
> > Which is exactly correct.  It's only confusing to those who don't
> > understand that not all characters/character sets encode characters
> > with single-byte encodings.
>
>It's confusing because the "n" represents a buffer size.

The "n" represents _a number_ which (by the 
standard) is the number of CHARACTERS (not a 
buffer size...SQL definitions are independent of any programming interface).

>Even allocating
>four bytes you cannot store 1 "character" because a character is such a
>confusing word to begin with. Is "é" a "character"?

Yes.  And it is a different character from "e", 
"ê" or "è".  Each has its own distinct encoding 
because it is a distinct character.  The fact 
that all of them use a basic image that looks to 
you like "e" isn't relevant (except in an accent-insensitive collation, natch!)

>What if I store it as
>U+0065 U+0301? That's two Unicode codepoints for one logical character.

What if?  You would simply be storing two Unicode 
codepoints that might have some meaning in some 
Unicode collation somewhere.  It wouldn't be 
meaningful otherwise.  Do you perhaps think a 
"logical character" has to do with the  graphical 
representation? (it doesn't).  You can't, for 
example, supply a chain of Unicode codepoints and 
hope it is a hack to enable you to use a font 
that doesn't support the characters you want to represent...


> > The cause of the problem here is not the length of the data stored
> > but that the client is set to expect single-byte encoding (by using
> > character set NONE)
>
>I agree that's a problem, however I disagree that the server should be
>returning four bytes when the UTF-8 encoded value of "e" (or whatever ASCII
>character you like) is only one byte long.

If anything, it's a shortcoming of the 
implementation of strings.  Strings have to be 
stored "somehow".  What you have to work with are 
data types (char and varchar, each with its 
particular rules that the engine knows about and 
that application language interface interpreters 
like the .NET driver know about) and string size 
(maximum length, defined by a number which is the 
maximum number of characters allowed).

The assumption is that 1 byte==1 character unless 
you specify otherwise (by defining a character 
set for which character boundaries are > 
1).  Both the client and the server will refer to 
the character set mapping to perform 
transliteration;  but there is nothing about the 
way strings are implemented in SQL that can 
support identifying variable boundaries for the 
"characters" that are represented by the 
sequences of bytes within the string.  There are 
text editors that *can* do that, but a database 
engine is not a text editor....so....UTF8 
characters are stored as 4 bytes with 
left-to-right significance.  By convention, all 
character sets (including Unicode) that support 
the unaccented Roman characters (and the 
so-called "Arabic" numerals) use the same 7-bit 
encoding for the leftmost byte, viz. the hex range 30 to 5A and 61 to 7A.


>What I mean is, even if you changed the connection string character set to
>"UTF8", 0x65 0x00 0x00 0x00 represents four UTF-8 characters (that is,
>U+0065 U+0000 U+0000 U+0000).

According to what, I wonder?  A Unicode editor 
that understands that convention?  All a database 
engine can do is take a sequence of input or 
output codes and, if necessary, transliterate 
that sequence according to some rules, that are 
packaged for it as "character sets" and "collate 
sequences".  (Firebird 2 supports two collate 
sequences for UTF8.)  It doesn't store 
"characters" and it doesn't return "characters".

Hmmm, this has turned into a magnum opus.  I must get on with work...

Helen


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Firebird-net-provider mailing list
Firebird-net-provider@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/firebird-net-provider

Reply via email to