----- Original Message ----- > From: Andreas Perstinger <andiper...@gmail.com> > To: "tutor@python.org" <tutor@python.org> > Cc: > Sent: Friday, June 14, 2013 2:23 PM > Subject: Re: [Tutor] regex grouping/capturing > > On 14.06.2013 10:48, Albert-Jan Roskam wrote: >> I am trying to create a pygments regex lexer. > > Well, writing a lexer is a little bit more complex than your original > example suggested.
Hi Andreas, sorry for the late reply. It is true that creating a lexer is not that simple. I oversimplified my original example indeed. <snip> > I'm not sure if a single regex can capture this. > But looking at the pygments docs I think you need something along the > lines of (adapt the token names to your need): > > class ExampleLexer(RegexLexer): > tokens = { > 'root': [ > (r'\s+', Text), > (r'set', Keyword), > (r'workspace|header', Name), > (r'\S+', Text), > ] > } > > Does this help? In my original regex example I used groups because I wanted to use pygments.lexer.bygroups (see below) to disentangle commands, subcommands, keywords, values. Finding a command is relatively easy, but the other three elements are not. A command is always preceded by newline and a dot. A subcommand is preceded by a forward slash. A value is *optionally* preceded by an equals sign. A keyword precedes a value. Each command has its own subset of subcommands, keywords, values (e.g. a 'median' keyword is valid only in an 'aggregate' command, not in, say, a 'set' command. I have an xml representation of each of the 1000+ commands. My plan is to parse these and regexify them (one regex per command, all to be stored in a dictionary/shelve). Oh, and if that's not challenging enough: regexes in pygment lexers may not contain nested groups (not sure if that also applies to non-capturing groups). I think I have to get the xml part right first and than see if this can be done. Thanks again Andreas! from pygments.lexer import RegexLexer, bygroups from pygments.token import * class IniLexer(RegexLexer): name = 'INI' aliases = ['ini', 'cfg'] filenames = ['*.ini', '*.cfg'] tokens = { 'root': [ (r'\s+', Text), (r';.*?$', Comment), (r'\[.*?\]$', Keyword), (r'(.*?)(\s*)(=)(\s*)(.*?)$', bygroups(Name.Attribute, Text, Operator, Text, String)) ] } _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor