Il 04/02/2013 18:19, Markus Armbruster ha scritto: > + /* 2 Boundary condition test cases */ > + /* 2.1 First possible sequence of a certain length */ > + /* 2.1.5 5 bytes U+200000 */ > + { > + "\"\xF8\x88\x80\x80\x80\"", > + NULL, /* bug: rejected */ > + "\"\\u8200\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ > + "\xF8\x88\x80\x80\x80", > + }, > + /* 2.1.6 6 bytes U+4000000 */ > + { > + "\"\xFC\x84\x80\x80\x80\x80\"", > + NULL, /* bug: rejected */ > + "\"\\uC100\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ > + "\xFC\x84\x80\x80\x80\x80", > + }, > + }, > + /* 2.2.4 4 bytes U+1FFFFF */ > + { > + "\"\xF7\xBF\xBF\xBF\"", > + NULL, /* bug: rejected */ > + "\"\\u7FFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ > + "\xF7\xBF\xBF\xBF", > + }, > + /* 2.2.5 5 bytes U+3FFFFFF */ > + { > + "\"\xFB\xBF\xBF\xBF\xBF\"", > + NULL, /* bug: rejected */ > + "\"\\uBFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ > + "\xFB\xBF\xBF\xBF\xBF", > + }, > + /* 2.2.6 6 bytes U+7FFFFFFF */ > + { > + "\"\xFD\xBF\xBF\xBF\xBF\xBF\"", > + NULL, /* bug: rejected */ > + "\"\\uDFFF\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ > + "\xFD\xBF\xBF\xBF\xBF\xBF", > + }, > + { > + /* \U+1FFFFF */ > + "\"\xF8\x87\xBF\xBF\xBF\"", > + NULL, /* bug: rejected */ > + "\"\\u81FF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ > + "\xF8\x87\xBF\xBF\xBF", > + }, > + { > + /* \U+3FFFFFF */ > + "\"\xFC\x83\xBF\xBF\xBF\xBF\"", > + NULL, /* bug: rejected */ > + "\"\\uC0FF\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ > + "\xFC\x83\xBF\xBF\xBF\xBF", > + }, > + { > + /* \U+0000 */ > + "\"\xF8\x80\x80\x80\x80\"", > + NULL, /* bug: rejected */ > + "\"\\u8000\\uFFFF\\uFFFF\"", /* bug: want "\"\\u0000\"" */ > + "\xF8\x80\x80\x80\x80", > + }, > + { > + /* \U+0000 */ > + "\"\xFC\x80\x80\x80\x80\x80\"", > + NULL, /* bug: rejected */ > + "\"\\uC000\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\u0000\"" */ > + "\xFC\x80\x80\x80\x80\x80", > + },
Rejecting these is not a bug IMO. Unicode is only defined up to U+10FFFF. Codepoints above are not valid UTF-8 at all, and in particular 5/6-byte sequences are never valid UTF-8 (they used to be). But there are indeed other bugs... > + /* 2.1.4 4 bytes U+10000 */ > + { > + "\"\xF0\x90\x80\x80\"", > + "\xF0\x90\x80\x80", > + "\"\\u0400\\uFFFF\"", /* bug: want "\"\\uD800\\uDC00\"" */ > + }, > + /* U+10FFFF */ > + "\"\xF4\x8F\xBF\xBF\"", > + "\xF4\x8F\xBF\xBF", > + "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFF\"" */ > + }, > + { > + /* U+110000 */ > + "\"\xF4\x90\x80\x80\"", > + "\xF4\x90\x80\x80", > + "\"\\u4400\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */ > + }, ...and also some good catches here! In particular U+110000 should be rejected. Paolo