On Wednesday, 9 October 2013 at 07:49:55 UTC, Andrei Alexandrescu wrote:
On 10/8/13 11:11 PM, ilya-stromberg wrote:
On Tuesday, 8 October 2013 at 00:16:45 UTC, Andrei Alexandrescu wrote:
To put my money where my mouth is, I have a proof-of-concept tokenizer
for C++ in working state.

http://dpaste.dzfl.pl/d07dd46d

Why do you use "\0" as end-of-stream token:

  /**
* All token types include regular and reservedTokens, plus the null
   * token ("") and the end-of-stream token ("\0").
   */

We can have situation when the "\0" is a valid token, for example for binary formats. Is it possible to indicate end-of-stream another way,
maybe via "empty" property for range-based API?

I'm glad you asked. It's simply a decision by convention. I know no C++ source can contain a "\0", so I append it to the input and use it as a sentinel.

A general lexer should take the EOF symbol as a parameter.

One more thing: the trie matcher knows a priori (statically) what the maximum lookahead is - it's the maximum of all symbols. That can be used to pre-fill the input buffer such that there's never an out-of-bounds access, even with input ranges.


Andrei

So, it's interesting to see a new improved API, because we need a really generic lexer. I think it's not so difficult.

Reply via email to