Crystal Kolipe wrote in
<Y/[email protected]>:
|Currently it is not possible to use unicode codepoints > 0xFF on the \
|console,
|because our UTF-8 decoding logic is badly broken.
|
|The code in question is in wsemul_subr.c, wsemul_getchar().
|
|The problem is that we calculate the number of bytes in a multi-byte
|sequence by just looking at the high bits in turn:
...
|This is wrong, for several reasons.
Just to note there are also holes, UTF-8 sequences are not
necessarily well-formed (per se -- maybe they are when you control
their generation, of course). It is actually a real mess:
if(LIKELY(x <= 0x7Fu))
c = x;
/* 0xF8, but Unicode guarantees maximum of 0x10FFFFu -> F4 8F BF BF.
* Unicode 9.0, 3.9, UTF-8, Table 3-7. Well-Formed UTF-8 Byte Sequences
*/
else if(LIKELY(x > 0xC0u && x <= 0xF4u)){
if(LIKELY(x < 0xE0u)){
if(UNLIKELY(l < 1))
goto jenobuf;
--l;
c = (x &= 0x1Fu);
}else if(LIKELY(x < 0xF0u)){
if(UNLIKELY(l < 2))
goto jenobuf;
l -= 2;
x1 = x;
c = (x &= 0x0Fu);
/* Second byte constraints */
x = S(u8,*cp++);
switch(x1){
case 0xE0u:
if(UNLIKELY(x < 0xA0u || x > 0xBFu))
goto jerr;
break;
case 0xEDu:
if(UNLIKELY(x < 0x80u || x > 0x9Fu))
goto jerr;
break;
default:
if(UNLIKELY((x & 0xC0u) != 0x80u))
goto jerr;
break;
}
c <<= 6;
c |= (x &= 0x3Fu);
}else{
if(UNLIKELY(l < 3))
goto jenobuf;
l -= 3;
x1 = x;
c = (x &= 0x07u);
/* Third byte constraints */
x = S(u8,*cp++);
switch(x1){
case 0xF0u:
if(UNLIKELY(x < 0x90u || x > 0xBFu))
goto jerr;
break;
case 0xF4u:
if(UNLIKELY((x & 0xF0u) != 0x80u)) /* 80..8F */
goto jerr;
break;
default:
if(UNLIKELY((x & 0xC0u) != 0x80u))
goto jerr;
break;
}
c <<= 6;
c |= (x &= 0x3Fu);
x = S(u8,*cp++);
if(UNLIKELY((x & 0xC0u) != 0x80u))
goto jerr;
c <<= 6;
c |= (x &= 0x3Fu);
}
x = S(u8,*cp++);
if(UNLIKELY((x & 0xC0u) != 0x80u))
goto jerr;
c <<= 6;
c |= x & 0x3Fu;
}else
goto jerr;
--steffen
|
|Der Kragenbaer, The moon bear,
|der holt sich munter he cheerfully and one by one
|einen nach dem anderen runter wa.ks himself off
|(By Robert Gernhardt)