[il-antlr-interest: 34203] Re: [antlr-interest] [C] code to change Token type, use char* and loose data when buffer destroyed

Jim Idle Wed, 28 Sep 2011 08:49:24 -0700

You can of course process things anywhere that it does not cause ambiguity
but the best approach is to defer any processing that you can until the
last point in time, so that you do not process anything that you find you
don't actually need to. The second 'rule' is that you only want to process
things once, so process and cache the result for later.


If you can modify the input stream, then you don't need to copy anything
here, just move the start and end pointers in the token and overwrite the
few bytes that you are moving. That way there is no malloc and nothing to
free. If you cannot modify the input stream, then you will need to copy
from the token pointers of course.

So, here you should lex the escape characters and the embedded '' in to
STRING_LITERAL but not try to process the WS* there, return two or more
tokens. Then the parser or tree parser can process the strings. If you are
going to do multiple walks, then probably in the parser, but if just one
walk (ot only one walk where you care about the text represented by the
tokens), then process in the tree parser when you hit the STRING_LITERAL+

Jim

> -----Original Message-----
> From: Ruslan Zasukhin [mailto:ruslan_zasuk...@valentina-db.com]
> Sent: Tuesday, September 27, 2011 11:41 PM
> To: antlr-interest@antlr.org; Jim Idle
> Subject: Re: [antlr-interest] [C] code to change Token type, use char*
> and loose data when buffer destroyed
>
> Hi Jim,
>
> What you think about this idea to resolve everything on the LEXER
> level?
>
> So we must resolve tokens as
>
> * STRING_LITERAL          'aa'
> * STRING_LITERAL          'aa' ws* 'bb'     => Token( "aabb" )
>
> * STRING_LITERAL          'aa\'bb'          => Token( "aa'bb" )
> * STRING_LITERAL          'aa''bb'           => Token( "aa'bb" )
> * STRING_LITERAL          'aa''bb''cc'      => Token( "aa'bb'cc" )
>
> * HEX_LITERAL              x'aa'                  => Token( "aabb" )
> * HEX_LITERAL              x'aa' ws* 'bb'     => Token( "aabb" )
>
>
> Do you think we can do this in [C] without copying buffers?
> I think not.
>
> Then question is:
>     how this can be solved using minimal copies?
>
> Or you think that better really use
>     Lexer -> Parser -> TreeParser combination ?
>
>
> On 9/28/11 1:34 AM, "Ruslan Zasukhin" <ruslan_zasukhin@valentina-
> db.com>
> wrote:
>
> > On 9/28/11 12:46 AM, "Douglas Godfrey" <douglasgodf...@gmail.com>
> wrote:
> >
> > Hi Douglas,
> >
> > Yes, I have thinked about this way also.
> >
> > But in your solution you use helper functions as
> >     RemoveQuotePairs()
> >
> > Which, I guess do some coping in additional ram buffers.
> > This is fine for Java guys, but in C code, as Jim likes underline
> each
> > time, we tend to use only pointers to input buffer, as long as
> possible.
> >
> >
> >> You need to modify your string lexing rules to use sub-rules for the
> >> elementary strings and return the concatenated string as the lexer
> >> token value.
> >>
> >> The value of
> >>
> >> StringConstant: QuotedString
> >> {RemoveQuotePairs($QuotedString);};
> >>
> >> fragment
> >> QuotedString:  ( StringTerm )+;
> >>
> >> fragment
> >> StringTerm:  Dquote ( Character )* Dquote;
> >>
> >> fragment
> >> Character: ( ' ' | AlphaChar | Punctuation | Digit );
>
> --
> Best regards,
>
> Ruslan Zasukhin
> VP Engineering and New Technology
> Paradigma Software, Inc
>
> Valentina - Joining Worlds of Information http://www.paradigmasoft.com
>
> [I feel the need: the need for speed]
>

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

-- 
You received this message because you are subscribed to the Google Groups 
"il-antlr-interest" group.
To post to this group, send email to il-antlr-inter...@googlegroups.com.
To unsubscribe from this group, send email to 
il-antlr-interest+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en.

[il-antlr-interest: 34203] Re: [antlr-interest] [C] code to change Token type, use char* and loose data when buffer destroyed

Reply via email to