On Thursday, February 4, 2016 at 5:33:35 AM UTC, Scott Jones wrote:
>
> SQLCHAR is for encodings with 8-bit code units.  It doesn't imply ASCII or 
> UTF-8 (probably one of the more common character sets used with that is 
> actually Microsoft's CP1252, which is often mistakenly described as ANSI 
> Latin-1 - of which it is a superset).
>

When I read that, I thought, that must not be true.. You can't have s 
superset (add letters) without dropping others (implying a superset of a 
subset), so I looked up:

https://en.wikipedia.org/wiki/Windows-1252
"differs from the IANA's ISO-8859-1 by using displayable characters rather 
than control characters in the 80 to 9F (hex) range. Notable additional 
characters are curly quotation marks, the Euro sign, and all the printable 
characters that are in ISO 8859-15.
[..]
This is now standard behavior in the HTML 5 
<https://en.wikipedia.org/wiki/HTML_5> specification, which requires that 
documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 
encoding.[1] <https://en.wikipedia.org/wiki/Windows-1252#cite_note-1> In 
January 2016 1.0% of all web sites use Windows-1252."

Still, despite this 1.0% I think we should support this encoding (in a way, 
if not its own 8-bit-only type (I'm not sure we need to support any other 
8-bit one); it's no longer just some Microsoft thing as I assumed..), as it 
is ideal for most of Europe (and even the US/world because of "curly 
quotation"). I've been thinking of doing a sting-type, that does the same 
as Python, encodes in 8-bit when possible, possibly 7-bit (then it can 
still say it's UTF-8 and fast indexing is known, note the strings are 
immutable).

It wouldn't surprise me that "UTF-8" would sometimes, incorrectly, include 
this as the "Latin-1" subset..

I wander if this screws up sorting.. It's not like the exact position of 
the Euro sign is to important in alphabetical sorting. I could argue it be 
sorted with E e but I assume just after A-Z a-z if ok for most..

I had never heard of "control characters in the 80 to 9F (hex) range", 
assuming then it's a very obscure/ancient thing that can be assumed to be 
never used anymore..


Even when something says it is UTF-8, it frequently is not *really* valid 
> UTF-8, for example, there are two common variations of UTF-8, CESU-8, used 
> by MySQL and others, which encodes any non-BMP code point using the two 
> UTF-16 surrogate pairs, i.e. to 6 bytes instead of the correct 4-byte UTF-8 
> sequence, and Java's Modified UTF-8, which is the same as CESU-8, plus 
> embedded \0s are encoded in a "long" form (0xc0 0x80)
>

Not only those..

I thought the WTF variant (important for us, because of Windows-filenames?) 
of UTF-8 was a joke/vandalism at Wikipedia until I read more closely on 
this I just saw:

https://en.wikipedia.org/wiki/UTF-8#WTF-8

-- 
Palli.

Reply via email to