Yes - it is not too difficult but takes some thinking through. I have a commercially available C# 3.x lexer/parser/tree walker that includes pre-processing, if you were interested in that.
Jim > -----Original Message----- > From: antlr-interest-boun...@antlr.org [mailto:antlr-interest- > boun...@antlr.org] On Behalf Of Sam Harwell > Sent: Thursday, July 28, 2011 7:26 PM > To: 'chris king' > Cc: antlr-interest@antlr.org > Subject: Re: [antlr-interest] Have I found an Antlr CSharp3 lexer bug > if... > > Fortunately the C# preprocessor is extremely basic, so the task > shouldn't be hard at all. To start with, it's important to understand > that the preprocessor *must* be implemented with the lexer, because the > following is > valid: > > > > #if false > > @" > > #endif > > > > If the @" is processed by the lexer, it will consume the #endif as part > of the verbatim string and mess everything up. Here's what I would do: > > > > . Implement a basic lexer rule to consume the characters > following > the #directive, up to but not including a single line comment marker // > > . Use a separate expression grammar to parse preprocessor > expressions. > > . Set a flag in the lexer if the next code is excluded code. > > . Override NextToken for the lexer, and if the flag is set to > true, > call out to a rule other than mTokens (a basic implementation of lexer > modes). > > > > When I release version 3.4 of the runtime, the Lexer class has a new > method ParseNextToken which can be overridden to help perform this > task. I haven't tested the following, but it's what I would start with > if I wanted to make a C# preprocessor. > > > > fragment PP_DEFINE:; > > fragment PP_UNDEF:; > > fragment PP_IF:; > > fragment PP_ELSE:; > > fragment PP_ENDIF:; > > > > PP_TOKEN > > : {input.CharPositionInLine == 0}? => > > WS? '#' WS? > > ( 'define' {$type=PP_DEFINE;} > > | 'undef' {$type=PP_UNDEF;} > > | 'if' {$type=PP_IF;} > > | 'else' {$type=PP_ELSE;} > > | 'endif' {$type=PP_ENDIF;} > > ) > > ~('\r' | '\n' | '/')* > > ; > > > > fragment > > EXCLUDED_CODE > > : PP_TOKEN > > | ( WS > > | ~(' ' | '\t' | '#')+ > > ) > > {state.type = EXCLUDED_CODE; state.channel = Hidden;} > > ; > > > > WS > > : (' ' | '\t')+ > > ; > > > > > > > > > > partial class CSharpLexer { > > > > private readonly HashSet<string> _definitions = new HashSet<string>(new > string[] { "true" }); > > private readonly Stack<IncludedCodeState> _includedCode = new > Stack<IncludedCodeState>(new IncludedCodeState[] { new > IncludedCodeState(true, true) }); > > private bool _foundToken = false; > > > > public override IToken NextToken() { > > while (true) { > > IToken token = base.NextToken(); > > > > switch (token.Type) { > > case PP_DEFINE: > > if (_includedCode.Peek().IsIncluded) > > { > > if (_foundToken) > > throw new RecognitionException("Cannot > define/undefine preprocessor symbols after first token in file"); > > > > string name = token.Text; > > name = name.Substring(name.IndexOf("define") + > 6).Trim(); > > if (name == "true" || !Regex.IsMatch(name, > @"^[A-Za-z_][\w]*$")) > > throw new RecognitionException("Expected identifier > in preprocessor."); > > > > _definitions.Add(name); > > } > > > > continue; > > > > case PP_UNDEF: > > if (_includedCode.Peek().IsIncluded) > > { > > if (_foundToken) > > throw new RecognitionException("Cannot > define/undefine preprocessor symbols after first token in file"); > > > > string name = token.Text; > > name = name.Substring(name.IndexOf("undef") + > 5).Trim(); > > if (name == "true" || !Regex.IsMatch(name, > @"^[A-Za-z_][\w]*$")) > > throw new RecognitionException("Expected identifier > in preprocessor."); > > > > _definitions.Remove(name); > > } > > > > continue; > > > > case PP_IF: > > { > > string expression = token.Text; > > expression = > expression.Substring(expression.IndexOf("if") + 2).Trim(); > > _includedCode.Push(new > IncludedCodeState(EvaluatePreprocessorExpression(expression), false)); > > } > > continue; > > > > case PP_ENDIF: > > if (_includedCode.Count == 1) > > throw new RecognitionException("Mismatched #endif in > preprocessor."); > > _includedCode.Pop(); > > continue; > > > > case PP_ELSE: > > if (_includedCode.Peek().FoundElseDirective) > > throw new RecognitionException("Mismatched #else in > preprocessor."); > > _includedCode.Push(_includedCode.Pop().ElseState); > > continue; > > > > default: > > if (token.Channel == TokenChannels.Default) > > _foundToken = true; > > return token; > > } > > } > > } > > > > private bool? EvaluatePreprocessorExpression(string expression) { > > if (!_includedCode.Peek().IsIncluded) > > return null; > > throw new NotImplementedException("Use a very simple expression > parser here to parse evaluate the Boolean expression."); > > } > > > > protected override void ParseNextToken() { > > if (!_includedCode.Peek().IsIncluded) > > mEXCLUDED_CODE(); > > else > > base.ParseNextToken(); > > } > > > > public struct IncludedCodeState { > > public readonly bool FoundElseDirective; > > private readonly bool? _isIncluded; > > > > public IncludedCodeState(bool? isIncluded, bool foundElseDirective) > { > > _isIncluded = isIncluded; > > FoundElseDirective = foundElseDirective; > > } > > > > public bool IsIncluded { get { return _isIncluded ?? false; } } > > > > public IncludedCodeState ElseState { > > get { > > if (_isIncluded == null) > > return new IncludedCodeState(_isIncluded, true); > > return new IncludedCodeState(!_isIncluded, true); > > } > > } > > } > > } > > > > Sam > > > > From: chris king [mailto:kingce...@gmail.com] > Sent: Thursday, July 28, 2011 7:05 PM > To: Sam Harwell > Cc: antlr-interest@antlr.org > Subject: Re: Have I found an Antlr CSharp3 lexer bug if... > > > > Sam, thanks so much for taking the time to look at that. If I could, > let me try and explain what I'm trying to do and tell me if you think > it's possible. For my own edification, I'm trying to implement a C# > grammar. I'd like to implement the pre-processor at the moment. > Implementations I've seen generally using only a lexer and use some > type of trick to maintain a stack (e.g. for nested ifdefs and simple > if/elif expressions). I figure why not use a parser to maintain the > stack -- isn't that the reason for existence for parsers anyway? So > that's what I'm trying to do -- use a lexer and parser to implement the > pre-processor. > > > > The big difficulty is changing the lexer rules depending on whether I'm > in a #if def block that is active or not. I figured with ANTLR I'd be > able to compute if the #ifdef block is active and then throw a switch > to either parse tokens and hand those tokens off to the C# parser or > consume and ignore all input up to the next pre-processor instruction > thereby disabling that chunk of code. If I can do this then I could put > the pre-processor and parser in the same file and construct the AST in > one pass! Would that be cool? And clean? And maybe worth making a goal > for ANTLR to be able to do? > :) > > > > To be a bit more concrete: Here is the production for matching newline > at the end of pre-processor instructions. The idea would be to enable > PP_SKIPPED_CHARACTERS only if inside a disabling #ifdef block which > would consume all characters till the next pre-processing instruction. > > > > pp_new_line > > : SINGLE_LINE_COMMENT? ((NEW_LINE! PP_SKIPPED_CHARACTERS) | EOF!) > > ; > > > > Here is what I was hoping would work as PP_SKIPPED_CHARACTERS. > Unfortunately I don't seem to understand how to flip lexer rules on and > off well enough to make this work... > > > > PP_SKIPPED_CHARACTERS > > : { IfDefedOut }? ( ~(F_NEW_LINE_CHARACTER | F_POUND_SIGN) > F_INPUT_CHARACTER* F_NEW_LINE )* > > ; > > > > I hope that is enough to give you an idea of what I'm trying to do. > This approach just seems so elegant to me (by which I mean almost all > declarative > -- no need to sprinkle procedural logic in among my productions to > maintain a stack or whatever) that I'd hope that it would be do able in > ANTLR. What do you think? Is it a worthy goal? Does it feel possible to > you? If not, is a goal worth trying to achieve? > > > > Thanks, > Chris > > > > > > > > On Thu, Jul 28, 2011 at 2:37 PM, Sam Harwell > <sharw...@pixelminegames.com> > wrote: > > Hi Chris, > > > > Lookahead prediction occurs before predicates are evaluated. If fixed > lookahead uniquely determines the alternative with a semantic > predicate, the predicate will not be evaluated as part of the decision > process. I'm guessing (but not 100% sure) if you use a gated semantic > predicate, then it will not be entering the rule: > > > > PP_SKIPPED_CHARACTERS > > : {false}? => ( ~(F_NEW_LINE_CHARACTER | '#') F_INPUT_CHARACTER* > F_NEW_LINE )* > > ; > > > > Also, a word of warning: this lexer rule can match a zero-length > character span, which could result in an infinite loop. You should > always ensure that every path through any lexer rule that's not marked > "fragment" will consume at least 1 character. There's also a bug with > certain exceptions in the lexer that can cause infinite loops - this > has been resolved for release 3.4 but I haven't released it yet. > > > > Sam > > > > From: chris king [mailto:kingce...@gmail.com] > Sent: Thursday, July 28, 2011 4:19 PM > To: antlr-interest@antlr.org; Sam Harwell > Subject: Have I found an Antlr CSharp3 lexer bug if... > > > > Have I found an Antlr lexer CSharp3 bug if I can alter program > execution (exception instead of no exception) by introducing a lexer > production with a predicate that is always false? For example > > > > PP_SKIPPED_CHARACTERS > > : { false }? ( ~(F_NEW_LINE_CHARACTER | '#') F_INPUT_CHARACTER* > F_NEW_LINE > )* > > ; > > > > I would think that such a production should always be ignored because > it's predicate is always false and therefore would never alter program > execution. > Yet I'm seeing a change in the execution of my program. I'm seeing it > enter this function and throw a FailedPredicateException. I wouldn't > have expected that this function should ever even have been executed > because the predicate is always false. > > > > [GrammarRule("PP_SKIPPED_CHARACTERS")] > > private void mPP_SKIPPED_CHARACTERS() > > { > > EnterRule_PP_SKIPPED_CHARACTERS(); > > EnterRule("PP_SKIPPED_CHARACTERS", 31); > > TraceIn("PP_SKIPPED_CHARACTERS", 31); > > try > > { > > int _type = PP_SKIPPED_CHARACTERS; > > int _channel = DefaultTokenChannel; > > // CSharp\\CSharpPreProcessor.g:197:3: ({...}? (~ ( > F_NEW_LINE_CHARACTER | F_POUND_SIGN ) ( F_INPUT_CHARACTER ) > > DebugEnterAlt(1); > > // CSharp\\CSharpPreProcessor.g:197:5: {...}? (~ ( > F_NEW_LINE_CHARACTER | F_POUND_SIGN ) ( F_INPUT_CHARACTER ) > > { > > DebugLocation(197, 5); > > if (!(( false ))) > > { > > throw new FailedPredicateException(input, > "PP_SKIPPED_CHARACTERS", " False() "); > > } > > > > Sam, I'm on an all CSharp stack v3.3.1.7705. I'm using your VS plugin > (which is wonderful) and build integration to generate the lexer/parser > (also > wonderful) and then running on top of your CSharp port of the runtime. > If you think this is a bug and you'd like to have a look at the repro > please let me know. The project is open source up on CodePlex. > > > > Thanks, > Chris > > > > > List: http://www.antlr.org/mailman/listinfo/antlr-interest > Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your- > email-address List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address -- You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to il-antlr-inter...@googlegroups.com. To unsubscribe from this group, send email to il-antlr-interest+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en.