On 04-Aug-12 14:02, Christophe Travert wrote:
Jonathan M Davis , dans le message (digitalmars.D:174191), a écrit :
On Thursday, August 02, 2012 11:08:23 Walter Bright wrote:
The tokens are not kept, correct. But the identifier strings, and the string
literals, are kept, and if they are slices into the input buffer, then
everything I said applies.

String literals often _can't_ be slices unless you leave them in their
original state rather than giving the version that they translate to (e.g.
leaving \© in the string rather than replacing it with its actual,
unicode value). And since you're not going to be able to create the literal
using whatever type the range is unless it's a string of some variety, that
means that the literals often can't be slices, which - depending on the
implementation - would make it so that that they can't _ever_ be slices.

Identifiers are a different story, since they don't have to be translated at
all, but regardless of whether keeping a slice would be better than creating a
new string, the identifier table will be far superior, since then you only need
one copy of each identifier. So, it ultimately doesn't make sense to use slices
in either case even without considering issues like them being spread across
memory.

The only place that I'd expect a slice in a token is in the string which
represents the text which was lexed, and that won't normally be kept around.

- Jonathan M Davis

I thought it was not the lexer's job to process litterals. Just split
the input in tokens, and provide minimal info: TokenType, line and col
along with the representation from the input. That's enough for a syntax
highlighting tools for example. Otherwise you'll end up doing complex
interpretation and the lexer will not be that efficient. Litteral
interpretation can be done in a second step. Do you think doing litteral
interpretation separately when you need it would be less efficient?

Most likely - since you re-read the same memory twice to do it.

--
Dmitry Olshansky

Reply via email to