On Mon, 18 Jul 2005, Liam Clarke wrote:
> country = { > tag = ENG > ai = { > flags = { } > combat = { DAU FRA ORL PRO } > continent = { } > area = { } > region = { "British Isles" "NorthSeaSea" "ECAtlanticSea" "NAtlanticSea" > "TagoSea" "WCAtlanticSea" } > war = 60 > ferocity = no > } > } [Long message ahead; skip if you're not interested.] Kent mentioned PyParsing, http://pyparsing.sourceforge.net/ which is a really excellent system. Here's a demo of what it can do, just so you have a better idea what pyparsing is capable of. (For the purposes of this demo, I'm doing 'import pyparsing', but in real usage, I'd probably use 'from pyparsing import ...' just to make things less verbose.) Let's say that we want to recognize a simpler subset of the data that you have there, something like: { fee fie foo fum } And let's imagine that we have a function parse() that can take a string like: ###### >>> testString = """ ... { fee fie foo fum } ... """ ###### This imaginary parse() function could turn that into something that looks like a Python value, like this: ###### >>> parse(testString) (["fee", "fie", "foo", "fum"]) ###### That's our goal; does this make sense so far? So how do we start? Instead of going at the big goal of doing: country = { fee fie foo fum } let's start small by teaching our system how to recognize the innermost parts, the small things like fee or foo. Let's start there: ###### >>> Symbol = pyparsing.Word(pyparsing.alphas) ###### We want a Symbol to be able to recognize a "Word" made up of alphabetic letters. Does this work? ###### >>> Symbol.parseString("fee") (['fee'], {}) ####### Symbol is now a thing that can parse a string, and return a list of results in a pyparsing.ParseResults object. Ok, if we can recognize Symbols, let's go for the jugular: { fee fie foo fum } Let's call this a Sequence. ###### >>> Sequence = "{" + pyparsing.ZeroOrMore(Symbol) + "}" ###### A Sequence is made up of zero or more Symbols. Wait, let's change that, for a moment, to "A Sequence is made up of zero or more Values." (You'll see why in a moment. *grin*) If we turn toward this strange way, then we need a definition for a Value: ###### >>> Value = Symbol ###### and now we can say that a Sequence is a bunch of Values: ###### >>> Sequence = "{" + pyparsing.ZeroOrMore(Value) + "}" ###### Let's try this out: ###### >>> Sequence.parseString('{ fee fie foo fum}') (['{', 'fee', 'fie', 'foo', 'fum', '}'], {}) ###### This is close, but it's not quite right: the problem is that we'd like to somehow group the results all together in a list, and without the braces. That is, we actually want to see: [['fee', 'fie', 'foo', 'fum']] in some form. (Remember, we want a list of a single result, and that result should be our Sequence.) How do we get this working? We have to tell pyparsing to "Group" the middle elements together in a collection, and to "suppress" the braces from the result. Here we go: ###### >>> Sequence = (pyparsing.Suppress("{") + ... pyparsing.Group(pyparsing.ZeroOrMore(Value)) + ... pyparsing.Suppress("}")) ###### Does this work? ###### >>> Sequence.parseString('{ fee fie foo fum}') ([(['fee', 'fie', 'foo', 'fum'], {})], {}) ###### That looks a little messy and more nested than expected. Actually, what's happening is that we're looking at that pyparsing.ParseResults object, so there's more nesting in the string representation than what's really there. We can use the ParseResults's asList() method to make it a little easier to see what the real result value looks like: ###### >>> Sequence.parseString('{ fee fie foo fum}').asList() [['fee', 'fie', 'foo', 'fum']] ###### That's better. Out of curiosity, wouldn't it be neat if we could parse out something like this? { fee fie {foo "fum"} } *cough* *cough* What we'd like to do is make Sequence itself a possible value. The problem is that then there's a little circularity involved: ### Illegal PyParsing pseudocode ### Value = Symbol | Sequence Sequence = (pyparsing.Suppress("{") + pyparsing.Group(pyparsing.ZeroOrMore(Value)) + pyparsing.Suppress("}")) ###### The problem is that Value can't be defined before Sequence is, and vice-versa. We break this problem by telling PyParsing "ok, the following rules will come up soon" and "forward" define them: ###### >>> Value = pyparsing.Forward() >>> Sequence = pyparsing.Forward() ###### and once we have these forward declarations, we can then reconnect them to their real definitions by using '<<'. (This looks bizarre, but it applies just to rules that are Forward()ed.) ###### Value << (Symbol | Sequence) Sequence << (pyparsing.Suppress("{") + pyparsing.Group(pyparsing.ZeroOrMore(Value)) + pyparsing.Suppress("}")) ###### Let's try it: ###### >>> Value.parseString(' { fee fie {foo fum} } ').asList() [['fee', 'fie', ['foo', 'fum']]] ###### Cool. Ok, that was a little artificial, but oh well. The idea is we now know how to say: A Value is either a Symbol or Sequence and A Sequence is a bunch of Values without getting into trouble with pyparsing, and that's important whenever we're dealing with things that have recursive structure... like: country = { tag = ENG ai = { flags = { } combat = { DAU FRA ORL PRO } continent = { } area = { } region = { "British Isles" "NorthSeaSea" "ECAtlanticSea" "NAtlanticSea" "TagoSea" "WCAtlanticSea" } war = 60 ferocity = no } } Anyway, this is a really fast whirlwind tour of pyparsing, with some intentional glossing-over of hard stuff, just so you get a better idea of the core of parsing. Sorry if it went fast. *grin* If you have questions, please feel free to ask! _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor