I just have have been clearer - it is just a superset of the printable 
characters, but, as it reuses assigned (even though pretty much never used) 
control character positions, it is not truly a superset.  ASCII is a 7-bit 
subset of ANSI Latin-1, which is an 8-bit subset of UCS-2, which is a 
16-bit subset that can represent only the BMP, which is a subset of the 
Unicode code points (which need 21 bits).

On Thursday, February 4, 2016 at 4:52:33 AM UTC-5, Páll Haraldsson wrote:
>
> On Thursday, February 4, 2016 at 5:33:35 AM UTC, Scott Jones wrote:
>>
>> SQLCHAR is for encodings with 8-bit code units.  It doesn't imply ASCII 
>> or UTF-8 (probably one of the more common character sets used with that is 
>> actually Microsoft's CP1252, which is often mistakenly described as ANSI 
>> Latin-1 - of which it is a superset).
>>
>
> When I read that, I thought, that must not be true.. You can't have s 
> superset (add letters) without dropping others (implying a superset of a 
> subset), so I looked up:
>
> https://en.wikipedia.org/wiki/Windows-1252
> "differs from the IANA's ISO-8859-1 by using displayable characters rather 
> than control characters in the 80 to 9F (hex) range. Notable additional 
> characters are curly quotation marks, the Euro sign, and all the printable 
> characters that are in ISO 8859-15.
> [..]
> This is now standard behavior in the HTML 5 
> <https://en.wikipedia.org/wiki/HTML_5> specification, which requires that 
> documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 
> encoding.[1] <https://en.wikipedia.org/wiki/Windows-1252#cite_note-1> In 
> January 2016 1.0% of all web sites use Windows-1252."
>
> Still, despite this 1.0% I think we should support this encoding (in a 
> way, if not its own 8-bit-only type (I'm not sure we need to support any 
> other 8-bit one); it's no longer just some Microsoft thing as I assumed..), 
> as it is ideal for most of Europe (and even the US/world because of "curly 
> quotation"). I've been thinking of doing a sting-type, that does the same 
> as Python, encodes in 8-bit when possible, possibly 7-bit (then it can 
> still say it's UTF-8 and fast indexing is known, note the strings are 
> immutable).
>
> It wouldn't surprise me that "UTF-8" would sometimes, incorrectly, include 
> this as the "Latin-1" subset..
>
> I wander if this screws up sorting.. It's not like the exact position of 
> the Euro sign is to important in alphabetical sorting. I could argue it be 
> sorted with E e but I assume just after A-Z a-z if ok for most..
>
> I had never heard of "control characters in the 80 to 9F (hex) range", 
> assuming then it's a very obscure/ancient thing that can be assumed to be 
> never used anymore..
>
>
> Even when something says it is UTF-8, it frequently is not *really* valid 
>> UTF-8, for example, there are two common variations of UTF-8, CESU-8, used 
>> by MySQL and others, which encodes any non-BMP code point using the two 
>> UTF-16 surrogate pairs, i.e. to 6 bytes instead of the correct 4-byte UTF-8 
>> sequence, and Java's Modified UTF-8, which is the same as CESU-8, plus 
>> embedded \0s are encoded in a "long" form (0xc0 0x80)
>>
>
> Not only those..
>
> I thought the WTF variant (important for us, because of 
> Windows-filenames?) of UTF-8 was a joke/vandalism at Wikipedia until I read 
> more closely on this I just saw:
>
> https://en.wikipedia.org/wiki/UTF-8#WTF-8
>

I ever hadn't run across that in my work, but my work was in databases, 
usually Unix (AIX, Solaris, etc) or Linux, not so much on Windows any 
longer, and I added Unicode support before surrogates even existed (they 
were added in Unicode 2.0, but not actually used until Unicode 3.0).
I'm not sure what you'd want to do to convert that for use in Julia? (btw, 
I think the initials of the "format" says it all!)

Reply via email to