[pdf-devel] Tokeniser Module - Unit Test: test cases and suggestions

Pierre FIlot Mon, 19 Oct 2009 22:09:29 -0700

/*COMMENTS ON TOKENISER MODULE*/ (PDF Reference, version 1.7 )


UNIT TEST:

/**************************/
/*function pdf_token_read            */
/**************************/

1- COMMENTS:

test1: comments are ignored (similarly to white space) (already done)
test2: macro "PDF_TOKEN_RET_COMMENT" has to be defined if we need to return them
test3: two exceptions: -%PDF-n.m
               -%%EOF

should we return them all the times whether or not the macro is defined ? Or is 
it an issue left to the caller ? 
test4: test long comments 

question 1: "The comment consists of all characters between the percent sign 
and the end of the line" but in the code (handle_char()) when we are detecting 
'%' we are storing it ? 
question 2: Do we need to store every character in the case we ignore comments 
? would it be more efficient to decide whether we ignore them or consider them 
in the handle_char function instead of the flush_token


2- BOOLEAN:

test1: keyword true and false


3- INTEGER: 

test 1: one or more digits with optional sign
test 2: Limit [+2 ^31-1 ; -2 ^31]

4- REAL: 

test 1: one or more digits with optional sign with a leading, trailing, 
embedded decimal point
test 2: Limit [+3.403 x 10 ^38; -3.403 x 10 ^38]
test 3: 5 is the number of significant decimal digits of precision in 
fractional part

5- STRING:

*literal characters enclosed with "()"
test 1: unbalanced parentheses forbidden
test 2: In a string, if the character immediately following a REVERSE SOLIDUS 
(\) is not one of n, r, t, b, f, (, ), \ or numbers specifying an octal value, 
the REVERSE SOLIDUS should be ignored. (already done)
test 3: In a string, an end-of-line marker appearing within a literal string 
without a preceding REVERSE SOLIDUS shall be treated as a byte value of (0Ah), 
irrespective of whether the end-of-line marker was a CARRIAGE RETURN (0Dh), a 
LINE FEED (0ah), or both.(almost done, left to be tested \n alone)
test 4: "\LF" "\CR""\CR+LF" are not considered part of the string (left to be 
tested "\LF" "\CR") 
test 5: High-order overflow in an octal character representation \ddd in a 
string should be ignored by the tokeniser. (done)
test 6: In an octal character representation \ddd in a string, three octal 
digits shall be used, with leading zeros as needed, if the next character of 
the string is also a digit. Otherwise it can use one or two octal digits.(can 
only be tested on pdf-token-write())

question 1: would it be useful to differentiate hexadecimal and literal string 
as token types. Like that we could check that there is not unbalanced 
parentheses in literal strings (test 1).
question 2: Limit fixed at 32767 characters is valid only inside content 
streams. Couldn't it be longer ? (see Appendix C). Should we introduce a 
continuation as in comments ?


*hexadecimal characters enclosed with "<>"
test 1: In a hexadecimal string, SPACE, HORIZONTAL TAB, CARRIAGE RETURN, LINE 
FEED and FORM FEED shall be ignored by the tokeniser.
test 2: In a hexadecimal string, if there is an odd number of digits, the final 
digit shall be assumed to be 0.(already done)


6- NAMES:

test 1:In a name, A NUMBER SIGN (#) shall (MUST) be written by using its 
2-digit hexadecimal code (23), preceded by a NUMBER SIGN.
test 2: In a name, any character that is a regular character (other than NUMBER 
SIGN) shall be written as itself or by using its 2-digit hexadecimal code, 
preceded by the NUMBER SIGN. (would be useful to automatically test for every 
possible regular character and his octal equivalence).

Do you mean to check that all 2-digit hexadecimal code gives the right regular 
character ? why do you talk about octal values ?  

test 3: In a name, any character that is not a regular character shall (MUST) 
be written using its 2-digit hexadecimal code, preceded by the NUMBER SIGN 
only. (test negative cases with non-regular characters directly included in the 
name).

this test only concerns pdf-token-write, right ? because in pdf-token-read, any 
non regular characters (white spaces or delimiters) ends the NAME token.

test 4: In a name, regular characters that are outside the range EXCLAMATION 
MARK(21h) to TILDE (7Eh) should (RECOMMENDED) be written using the hexadecimal 
notation. (test negative cases)

I don't see what should I do here that is not done before (test 2) ? 

test 5: The token SOLIDUS (a slash followed by no regular characters) 
introduces a unique valid name defined by the empty sequence of characters.
test 6: null character is forbidden (test pdf_token_name_new()) as well as #00 
(test pdf-token-read())

Question 1:  The test to verify that Names token don't contain null characters, 
done with the creation of the token in pdf_token_name_new introduces redundance 
since it is already verified when reading the stream (pdf_token_read). We could 
instead let only pdf_token_read and pdf_token_write functions verify that. This 
question also goes for COMMENTS (eol characters) and KEYWORD tokens 
(non-regular characters).


GENERAL QUESTIONS:
questions 1: in pdf-token.c, why do we add a null character at the end of some 
tokens(Names) and not others (Comments, Strings) (see pdf_token_buffer_new and 
its pdf_bool_t nullterm)
question 2: should we create a test function (START_TEST) for each case (test1, 
test2...), or per token evaluated (COMMENTS, BOOLEAN...), or can we regroup 
them inside the same function as it has already been done in 
torture/unit/base/token/pdf-token-read.c


Thanks in advance
/Pierre


__________________________________________________
Do You Yahoo!?
En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible 
contre les messages non sollicités 
http://mail.yahoo.fr Yahoo! Mail

[pdf-devel] Tokeniser Module - Unit Test: test cases and suggestions

Reply via email to