[pdf-devel] Re: Modifications on pdf_token_read to get token boundaries

Michael Gold Mon, 25 May 2009 16:59:21 -0700

On Tue, May 26, 2009 at 00:53:46 +0200, [email protected] wrote:
...
> We need to be able to determine the boundaries of a token that has
> been read, for error reporting. We cannot rely on the stm used by the
> token reader to determine the beginning position of a read token,
> since it is skipping white characters.


This behaviour could be changed by
 - adding a flag that causes token_read to return whitespace as a token;
   or,
 - adding a function/flag to advance to the beginning of the next token

> We would need to expand the pdf_token_read to communicate both the
> beginning position and the end position in the stm of the last read
> token. It could be done using two extra parameters:
> 
> pdf_status_t pdf_token_read (pdf_token_reader_t reader,
>                              pdf_u32_t flags,
>                              pdf_size_t *beginning_pos,
>                              pdf_size_t *end_pos,
>                              pdf_token_t *token);
> 
> If NULLs are passed then the parameters are not filled.
> 
> An alternative would be to expand the pdf_token_t TAD to include such
> information, but I think it would not be quite appropriate, since it
> is not part of the semantics of the token.

True, I'd rather not include it in the token structure.

> Would this modification be ok with you?

I'm not sure about the API.  If the extra parameters will only be used
in the case of an error, maybe a new function could be added to access
the positions of the last token (to keep pdf_token_read simple); or the
stream methods could be called directly if the caller could manually
skip whitespace.

Also, what would beginning_pos and end_pos mean exactly?  Are they based
on the byte positions of the underlying stream before filtering, or on
the number of bytes actually seen by the tokeniser (after filtering)?
The physical stream position (e.g. as reported by ftell) might not be
useful; for example, if a decompression filter operates on blocks of
data, it could emit many tokens without advancing.

-- Michael

signature.asc
Description: Digital signature

[pdf-devel] Re: Modifications on pdf_token_read to get token boundaries

Reply via email to