On Thursday, February 4, 2016 at 5:33:35 AM UTC, Scott Jones wrote: > > SQLCHAR is for encodings with 8-bit code units. It doesn't imply ASCII or > UTF-8 (probably one of the more common character sets used with that is > actually Microsoft's CP1252, which is often mistakenly described as ANSI > Latin-1 - of which it is a superset). >
When I read that, I thought, that must not be true.. You can't have s superset (add letters) without dropping others (implying a superset of a subset), so I looked up: https://en.wikipedia.org/wiki/Windows-1252 "differs from the IANA's ISO-8859-1 by using displayable characters rather than control characters in the 80 to 9F (hex) range. Notable additional characters are curly quotation marks, the Euro sign, and all the printable characters that are in ISO 8859-15. [..] This is now standard behavior in the HTML 5 <https://en.wikipedia.org/wiki/HTML_5> specification, which requires that documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 encoding.[1] <https://en.wikipedia.org/wiki/Windows-1252#cite_note-1> In January 2016 1.0% of all web sites use Windows-1252." Still, despite this 1.0% I think we should support this encoding (in a way, if not its own 8-bit-only type (I'm not sure we need to support any other 8-bit one); it's no longer just some Microsoft thing as I assumed..), as it is ideal for most of Europe (and even the US/world because of "curly quotation"). I've been thinking of doing a sting-type, that does the same as Python, encodes in 8-bit when possible, possibly 7-bit (then it can still say it's UTF-8 and fast indexing is known, note the strings are immutable). It wouldn't surprise me that "UTF-8" would sometimes, incorrectly, include this as the "Latin-1" subset.. I wander if this screws up sorting.. It's not like the exact position of the Euro sign is to important in alphabetical sorting. I could argue it be sorted with E e but I assume just after A-Z a-z if ok for most.. I had never heard of "control characters in the 80 to 9F (hex) range", assuming then it's a very obscure/ancient thing that can be assumed to be never used anymore.. Even when something says it is UTF-8, it frequently is not *really* valid > UTF-8, for example, there are two common variations of UTF-8, CESU-8, used > by MySQL and others, which encodes any non-BMP code point using the two > UTF-16 surrogate pairs, i.e. to 6 bytes instead of the correct 4-byte UTF-8 > sequence, and Java's Modified UTF-8, which is the same as CESU-8, plus > embedded \0s are encoded in a "long" form (0xc0 0x80) > Not only those.. I thought the WTF variant (important for us, because of Windows-filenames?) of UTF-8 was a joke/vandalism at Wikipedia until I read more closely on this I just saw: https://en.wikipedia.org/wiki/UTF-8#WTF-8 -- Palli.