[scintilla] Re: Regular expression based generic lexer

Eugene Wed, 31 Aug 2005 10:48:39 -0700

Hello All,

First of all, thank you Neil for introducing my little lexer.  I have fixed 
some bugs on this lexer and will be submitting it to Neil.  I have included 
below a brief description of how this lexer could be used by any potential 
users.


Purpose of this Lexer
=====================
The main purpose of this lexer orginates from my need to highlight very
specialised assembly codes such as those of DSP.  It is meant to be used in
situations where you need to quickly get some reasonable syntax highlighting for
a very specialised language or where you do not wish to create your own
specialised lexer just for that particular language.  This lexer started from
humble beginnings where it only understand a very limited subset of the regular
expression pattern matching and it still is.  That's why I prefer to consider
the lexer as understanding Pseudo Regular Expression (PRE) only and not true
blue regular expressions.


How to Use
==========
To help in the usage of the lexer, a little explanation is required since there
is not much documentation.

Instead of trying to explain the types of syntax highlighting available in the
lexer, I will try to illustrate how to get the lexer to highlight various 
syntax of the C/C++ language.  Note that this is for illustration purposes and 
the highlighting will be limited by LexCPP standards.  But for an exotic 
language that does not have any available support, this is better than 
nothing.  An example test.properties file has been provided for illustration 
below that can highlight files with extension *.test with C syntax like 
highlighting.

Highlighting of C Comments (Using Ranged Lists)
===============================================
C language comments starts with the characters /* and ends with */.  A regular
expression pattern matching that could pick out the comments bounded by /* and
*/ could be as follows:
/[*].*[*]/

The PRE lexer does not support .* pattern matching.  Therefore, to achieve such
highlighting, we make use of the Ranged lists to do the job.  The equivalent of
/[*].*[*]/ for the ranged list would be as follows:
keywords.$(file.patterns.User3)=/[*] [*]/
Here, we assign the ranged list "keywords.$(file.patterns.User3)" to the above
two patterns.  This pair of patterns forms a set of patterns.  For each set of
patterns, the highlighting starts with the first pattern ( i.e. /[*] ) and it
continues the highlighting until it matches the second pattern ( i.e. [*]/ ).

We can have more than one set of patterns assigned to a given ranged list.  For
example,
keywords.$(file.patterns.User3)=/[*] [*]/ // $
style.User3.1=$(colour.code.comment.box),$(font.code.comment.box)

first assigns highlighting to match patterns /[*].*[*]/ and then //.*$
The second pair/set starts highlighting at // and ends at the end of line.  
Both types of comments will get the same style of highlighting as that defined 
in the next line i.e. $(style.User3.1)

There are altogether 6 available Ranged lists where pattern pairs can be
defined.  For the illustrated test.properties file, the variables
keywords.$(file.patterns.User3) thru keywords6.$(file.patterns.User3) are the
Ranged lists.

Highlighting of C keywords (Using Delimited Keyword Lists)
==========================================================
To highlight C language keywords such as if, else, for and while, we will make
use of the delimited keyword list.  For simplicity of illustration, lets assume
that the above is the full set of keywords to be highlighted.  Thus we assign
these keywords onto "keywords13.$(file.patterns.User3)" as follows:

keywords13.$(file.patterns.User3)=if else for while

This will allow for syntax highlighting of individual words such as while.  The
matching criteria for keywords belonging to the delimited keyword list requires
that syntax highlighting only occurs if the word is seperated by at at least 
one delimiter character on both ends.

The default list of delimiter characters are [^a-zA-Z0-9].  In C language, 
words can also be comprised of the underscore character. Therefore, to remove 
the underscore character from the list of delimiter characters, we can define 
the following:
keywords21.$(file.patterns.User3)=_
Thus the list of delimiter characters now becomes [^a-zA-Z0-9_].

Highlighting of C operators (Using Keyword Lists)
=================================================
The syntax highlighting of C operators can be achieved with the help of the
keyword lists.  For illustration, we can assigned the operators onto
"keywords7.$(file.patterns.User3)".

keywords7.$(file.patterns.User3)=% \^ & \* ( ) \- \+ = | { } \[ \] : ; < > , /
\? ! . ~
Strictly speaking, the operator symbols are not really keywords.  Nevertheless,
we use the term keyword loosely to refer to the patterns to match.  Note that
the syntax highlighting of operators (keywords) found in keyword lists does not
need to be delimited by any delimiter characters. Therefore the two plus
operator in the following statements will be highlighted:
a+b + c

As a further optimization, we can specify the operators as follows:

keywords7.$(file.patterns.User3)=[%\^&\*()\-\+=|{}\[\]:;<>,/\?!.~]
Both definition of keywords7.$(file.patterns.User3) will yield the same result,
but the second definition should be more efficient because there is only one
search pattern.

Ambiguities in decision to use which lists
==========================================
Sometimes, syntax highlighting can be achieved in more than one way, or so it 
seems. For example the double quoted string can be highlighted by using the 
ranged lists.

keywords2.$(file.patterns.User3)=" "

It is also possible to highlight it using the keyword lists as below.

keywords12.$(file.patterns.User3)="[^"]*"

There are subtle differences between the above two approaches.  The following
string will be highlighted incorrectly using either of the above approaches.

"This is a string with an escaped double quote \" here"

Both approaches will produce syntax highlighting from the words "This is a
string with an escaped double quote \" while leaving the word "here"
unhighlighted.  The fault is not due to the lexer but rather the way we specify
our regular expression for strings.  The more appropriate pattern matching for
strings would be to make use of the ranged list defined below.

keywords2.$(file.patterns.User3)=" [^\\]"

I don't think there is a suitable regular expressions equivalent for matching
strings using the keyword lists.

Sample Files
============
A set of sample files are provided:
1) test.properties
2) LexUser.test
The test.properties file is the LexUser definition to provide limited C like
syntax highlighting for illustration purposes.  The LexUser.test file is a
sample C++ code file with extension *.test that will activate the LexUser to
perform the C like syntax highlighting based upon definitions in 
test.properties.

A Peek into test.properties
===========================
The following is the content of test.properties that illustrates how lexUser 
can be used to provide limited C syntax highlighting (when compared to LexCPP 
of course).  In the new package, there should be a file SampleText/LexUser.test
that when edited in SciTE will make use of the syntax highlighting definition
found in test.properties.  Also try opening scintilla/src/LexUser.cxx and make 
a comparison between the two.

--------------------------------------------------
#
# Define SciTE settings for User Customisable Syntax Highlighting with Pseudo
# Reg Exp
# Eugene Wong
#
# Pseudo Reg Exp (Pseudo for speed and simplicity of lexer)
# To understand pseudo regular expression better, please refer to regular 
# expression documentations from gawk, sed or python for a more complete 
# explanation
#
# Examples of valid Pseudo Reg Exp
# abc           ==> means exact strings match to abc
# a*            ==> matches zero or more 'a'
# a+            ==> matches one or more 'a'
# a?            ==> matches zero or one 'a'
# a\?           ==> matches the word "a?"
# [azT]         ==> a character match with characters a,z,T within the [] 
brackets
# [a-f]         ==> matches with characters from a to f
# [a-f]+        ==> matches one or more characters from a to f
# [a-f]*        ==> matches zero or more characters from a to f
# [a-f]?        ==> matches zero or one character from a to f
# [\\]          ==> matches character \
# [^0-9]        ==> matches any characters not belonging to 0 to 9
# #if[_\t]0     ==> match two words "#if" and "0" seperated by any amount of 
space
and/or tabs.
# For more examples, please refer to this example properties file itself.
#
# Note:
# These are some of the pseudo quirks!
# _             ==> match either a space or tab character
# [_]           ==> match a space character only
# [\t]          ==> match a tab character
# [\r\n]        ==> match a \r or \n character
# There are more features and quirks that comes along with it. Good Luck!

# C syntax highlighting as an illustration
file.patterns.User3=*.test
filter.User3=Test C highlighting (*.test)|$(file.patterns.User3)|

# User customisable lexers that can be used ranges from User, User2, ..., 
# User10 lexer.$(file.patterns.User3)=User3

# Uncomment this if you need case insensitive syntax highlighting
#ignorecase.User3=y

# This non-delimiter string + alphanumerical characters are considered as valid
# keywords
# All non keyword characters are considered as delimiter characters
# Note: space can never be considered as a delimiter character
keywords21.$(file.patterns.User3)=_

# Ranged lists (keywords ~ keywords6)
# Ranged lists MUST come in pairs
# For each pair, the first word represents the starting keyword to turn on the
# syntax highlighting (eg: /* )
# the second word represents the ending keyword to turn off the highlighting
# (eg: */ )
# Note that the Pseudo Reg Exp does not support syntax like /[*].*[*]/.  The
# closest we can get is with Ranged lists.
keywords.$(file.patterns.User3)=/[*] [*]/ // $
style.User3.1=$(colour.code.comment.box),$(font.code.comment.box)
# Double quoted strings
keywords2.$(file.patterns.User3)=" [^\\]"
style.User3.2=$(colour.string)

#keywords3.$(file.patterns.User3)=
#style.User3.3=
#keywords4.$(file.patterns.User3)=
#style.User3.4=
#keywords5.$(file.patterns.User3)=
#style.User3.5=
#keywords6.$(file.patterns.User3)=
#style.User3.6=


#Keyword lists (keywords7 ~ keywords12)
#Each keyword in the list is as specified by the pseudo regular expression. 
Keyword lists have no
#requirements to the characters appearing immediately before and after each 
#keyword.
keywords7.$(file.patterns.User3)=[%\^&\*()\-\+=|{}\[\]:;<>,/\?!.~] test
style.User3.7=$(colour.operator),bold
#Preprocessor
keywords8.$(file.patterns.User3)=#include[_\t]+["<]?[^">\n\t_]*[">\n\t_]
#define[_\t]+[^_\t\n]* \
#if[n]?def[_\t]+[^_\t\n]* #if #else #elif #endif
style.User3.8=$(colour.preproc)
#keywords9.$(file.patterns.User3)=
#style.User3.9=
#keywords10.$(file.patterns.User3)=
#style.User3.10=
keywords11.$(file.patterns.User3)='[^']*'
style.User3.11=$(colour.char)
#keywords12.$(file.patterns.User3)="[^"]*"
#style.User3.12=$(colour.string)


#Delimited keyword lists (keywords13 ~ keywords20)
#Each keyword specified in the following list has to be delimited by the
delimiter before
#it will be highlighted.
keywords13.$(file.patterns.User3)=and and_eq asm auto bitand bitor bool break \
case catch char class compl const const_cast continue \
default delete do double dynamic_cast else enum explicit export extern false
float for \
friend goto if inline int long mutable namespace new not not_eq \
operator or or_eq private protected public \
register reinterpret_cast return short signed sizeof static static_cast struct
switch \
template this throw true try typedef typeid typename union unsigned using \
virtual void volatile wchar_t while xor xor_eq
style.User3.13=$(colour.keyword),bold
#Number
keywords14.$(file.patterns.User3)=0x[0-9a-fA-F]+ [0-9]+
style.User3.14=$(colour.number)
#keywords15.$(file.patterns.User3)=
#style.User3.15=
#keywords16.$(file.patterns.User3)=
#style.User3.16=
#keywords17.$(file.patterns.User3)=
#style.User3.17=
#keywords18.$(file.patterns.User3)=
#style.User3.18=
#keywords19.$(file.patterns.User3)=
#style.User3.19=
#keywords20.$(file.patterns.User3)=
#style.User3.20=

#Ignore case flag (keyword22)
keywords22.$(file.patterns.User3)=$(ignorecase.User3)

# Highlighting Styles
# style.User.X correspond to keywordsX
# Default
style.User3.0=fore:#000000
style.User3.32=fore:$(font.base)

comment.block.ch=//~
#comment.block.at.line.start.ch=1
comment.stream.start.ch=/*
comment.stream.end.ch=*/
comment.box.start.ch=/*
comment.box.middle.ch= *
comment.box.end.ch= */

statement.indent.$(file.patterns.c.like)=5 case catch class default do else
finally \
for if private protected public struct try union while
statement.end.$(file.patterns.c.like)=10 ;
statement.lookback.$(file.patterns.c.like)=20
block.start.$(file.patterns.c.like)=10 {
block.end.$(file.patterns.c.like)=10 }


_______________________________________________
Scintilla-interest mailing list
[email protected]
http://mailman.lyra.org/mailman/listinfo/scintilla-interest

[scintilla] Re: Regular expression based generic lexer

Reply via email to