On 03/22/13 15:37, Markus Armbruster wrote: > Laszlo Ersek <ler...@redhat.com> writes: > >> On 03/14/13 18:49, Markus Armbruster wrote: >>> These are all broken, too. >> >> What are "these"? And how are they broken? And how does the patch fix them? > > "These" refers to the subject: noncharacters other than U+FFFE, U+FFFF. > > I agree that I should better explain how they're broken, and what the > patch does to fix them. Will fix on respin. > >>> >>> A few test cases use noncharacters U+FFFF and U+10FFFF. Risks testing >>> noncharacters some more instead of what they're supposed to test. Use >>> U+FFFD and U+10FFFD instead. >>> >>> Signed-off-by: Markus Armbruster <arm...@redhat.com> >>> --- >>> tests/check-qjson.c | 85 >>> +++++++++++++++++++++++++++++++++++++++++++++-------- >>> 1 file changed, 72 insertions(+), 13 deletions(-) >> >> I'm confused about the commit message. There are three paragraphs in it >> (the title, the first paragraph, and the 2nd paragraph). This patch >> modifies different tests: >> >>> diff --git a/tests/check-qjson.c b/tests/check-qjson.c >>> index 852124a..efec1b2 100644 >>> --- a/tests/check-qjson.c >>> +++ b/tests/check-qjson.c >>> @@ -158,7 +158,7 @@ static void utf8_string(void) >>> * consider using overlong encoding \xC0\x80 for U+0000 ("modified >>> * UTF-8"). >>> * >>> - * Test cases are scraped from Markus Kuhn's UTF-8 decoder >>> + * Most test cases are scraped from Markus Kuhn's UTF-8 decoder >>> * capability and stress test at >>> * http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt >>> */ >>> @@ -256,11 +256,11 @@ static void utf8_string(void) >>> "\xDF\xBF", >>> "\"\\u07FF\"", >>> }, >>> - /* 2.2.3 3 bytes U+FFFF */ >>> + /* 2.2.3 3 bytes U+FFFD */ >>> { >>> - "\"\xEF\xBF\xBF\"", >>> - "\xEF\xBF\xBF", >>> - "\"\\uFFFF\"", >>> + "\"\xEF\xBF\xBD\"", >>> + "\xEF\xBF\xBD", >>> + "\"\\uFFFD\"", >>> }, >> >> This is under "2.2 Last possible sequence of a certain length". I guess > > Which is in turn under "2 Boundary condition test cases". > >> this is where you say "last possible sequence of a certain length, >> encoding a character (= non-noncharacter)". OK, p#2. > > Yes. > > The test's purpose is testing the upper bound of 3-byte sequences is > decoded correctly. > > The upper bound is U+FFFF. Since that's a noncharacter, the parser > should reject it (or maybe replace), the formatter should replace it. > Trouble is it could be misdecoded and then rejected / replaced. > > Besides, U+FFFF already gets tested along with the other noncharacters > under "5.3 Other illegal code positions". > > Next in line is U+FFFE, also a noncharacter, also under 5.3. > > Next in line is U+FFFD, which I picked. > > But that gets tested under "2.3 Other boundary conditions"! I guess I > either drop it there, or make this one U+FFFC. > > I think testing U+FFFC here makes sense, because U+FFFD could be > misdecoded, then replaced by U+FFFD. > > What do you think?
I think that we're extending Markus Kuhn's test suite, basically taking random shots at where one specific parser's/formatter's weak spots might be :) That said, with intelligent fuzzing out of scope / capacity, U+FFFC could be a good pick. I also think I'm a quite a useless person to ask for thoughts in this area :) Thanks, Laszlo