Re: Parsing a language with optional spaces

Christian Schoenebeck Mon, 06 Jul 2020 10:35:08 -0700

On Montag, 6. Juli 2020 18:25:56 CEST Maury Markowitz wrote:
> Moving to a new thread - I was surprised I could even post, previous efforts
> were bounced from the list server for no obvious reason. Someone helpfully
> posted for me in the past. And now everything is magically working, so I
> hope you don't all mind the duplicate.
> > On Jul 6, 2020, at 9:04 AM, Christian Schoenebeck
> > <[email protected]> wrote:
> > You would simply add a RegEx pattern & rule like this:
> Consider two snippets in home-computer-era BASIC:
> 
> FOREX=10
> 
> and:
> 
> FOREX=10TO20
> 
> Is the first one a broken FOR statement or a perfectly valid variable
> assignment? (Why not both?!) MS says the former, BBC the later.


To avoid mixing language design aspects with its actual parser implementation;
most programming languages allow 1..n spaces in between keywords of course.
But 0..n spaces is usually only allowed by programming languages if it would
not end in ambiguities. From language design perspective your example:

FOREX=...

would IMO clearly be a variable assignment, never a loop definition. However
originally variable assignments in BASIC were actually like this BTW:

LET foo=10

Which would resolve that ambiguity of your example, if it actually exists, as
I have never seen "FOREX" as valid BASIC loop definition before, if there is,
sources for that specification appreciated.

> In the lex/flex model of longest-match-wins, assuming any reasonable
> definition for your variable pattern, both statements are variable
> assignments and the second fails to parse.
> 
> To match the behaviour of BASIC, one has to complicate the variable pattern.
> "Complicate" varies between adding a tail pattern for every possible
> keyword, or artificially limiting the variables in length, or...
> 
> Is there a better way?
> 
> Ideally, I would love if there was an optional #keyword which is similar to
> #token but has a "higher priority" so they would match first. I suspect
> this would be valuable in a wide variety of tasks, but I'm completely noob
> so I can't say.

On (F)lex side, more complex handling of eating up white spaces is commonly
handled with scanner states (<CONDITIONNAME> in front of patterns) and
pushing/popping states is done as action in other scanner rules by calling
yy_push_state() and yy_pop_state() accordingly.

For instance in this programming language scanner I needed more complicated
white space filtering, as e.g. there are preprocessor statements (including
variable white spaces as well) that should be preprocessed by the scanner
before entering the language parser:
http://svn.linuxsampler.org/cgi-bin/viewvc.cgi/linuxsampler/trunk/src/scriptvm/scanner.l?view=markup

In your particular example, it might also be considerable to simply work
with line start anchor (^) instead:

^ACTUALPATTERN

Best regards,
Christian Schoenebeck

Re: Parsing a language with optional spaces

Reply via email to