Sergey Gromov wrote:
<snip>
Well, you can write a regexp to handle a simple C string.  That is, if
your regexp is matched against the whole file, which is usually not the
case.  Otherwise you'll have troubles with C string:

"foo\
bar"

or D string:

"foo
bar"

So there is a problem if the highlighter works by matching regexps on a line-by-line basis. But matching regexps over a whole file is no harder in principle than matching line-by-line and, when the maximal munch principle is never called to action, it can't be much less efficient. (The only bit of C or D strings that relies on maximal munch is octal escapes.)

Then you want to highlight string escapes and probably format
specifiers.  Therefore you need not simple regexps but hierarchies of
them, and also you need to know where *internals* of the string start
and end.

Let's just concentrate for the moment on the simple process of finding the beginning and end of a string. Here's a snippet of a TextPad syntax file:

StringsSpanLines = Yes
StringStart = "
StringEnd = "
StringEsc = \

A possible snippet of lexer code to handle this (which FAIK might be near enough how TP does it):

if (*c == StringStart) {
    beginHighlightString(c);
    for (++c; *c != StringEnd && *c != '\0'
          &&(StringsSpanLines || *c != '\n'); ++c) {
        if (*c == StringEsc) ++c;
    }
    endHighlightString(c+1);
}

It's simple and it should work. (OK, there are two assumptions made for simplicity: that line breaks are normalised to LF, and that the file is terminated by at least two null bytes in memory, but you get the idea.)

While it doesn't support highlighting of escapes, I can't see this fact as being the reason N++'s developers haven't implemented even this in the generic lexer module. I probably couldn't see it being the reason even if the C lexer did highlight escapes (which it doesn't).

Then you have r"foo" which probably can be handled with regexps.

Then you have q"/foo/" where "/" can be anything.  Still can be handled
by extended regexps, even though they won't be regular expressions in
scientific sense.

Then you have q"{foo}" where "{" and "}" can be any of ()[]<>{}.
Regexps cannot translate while substituting, so you must create regexps
for all possible parens.

Yes, these aspects are more complicated. Both TP and N++ (out of the box, anyway) are probably far from being able to lex D2 properly. But they certainly could do better in supporting D1. Still, once N++ gains access to Scintilla's D lexer, things will certainly be better.

And of course q"BLAH
whatever BLAH here
BLAH", well, probably nice for help texts.

And these are only strings.  Try to write regexp which treats .__15 as
number(.__15), .__foo as operator(.), ident(__foo), and 2..3 as
number(2), operator(..), number(3).
<snip>

We'd need many regexps to handle all possible cases, but a possible set to cover these cases and a few others (listed in a possible order of priority) is:

\._*[0-9][0-9_]*
([1-9][0-9]*)(\.\.)
[0-9]+\.[0-9]*
[1-9][0-9]*
\.\.
\.
[a-zA-Z_][a-zA-Z0-9_]*

Note the use of capturing groups to handle the 2..3 case. Each capturing group would match a token, while in the other cases the whole regexp matches a token.

Stewart.

Reply via email to