On Oct 17, 10:42 am, Luis Zarrabeitia <[EMAIL PROTECTED]> wrote: > I need to parse a file, text file. The format is something like that: > > TYPE1 metadata > data line 1 > data line 2 > ... > data line N > TYPE2 metadata > data line 1 > ... > TYPE3 metadata > ... > > And so on. The type and metadata determine how to parse the following data > lines. When the parser fails to parse one of the lines, the next parser is > chosen (or if there is no 'TYPE metadata' line there, an exception is thrown). > <snip>
Pyparsing will take care of this for you, if you define a set of alternatives and then parse/search for them. Here is an annotated example. Note the ability to attach names to different fields of the parser, and then how those fields are accessed after parsing. """ TYPE1 metadata data line 1 data line 2 ... data line N TYPE2 metadata data line 1 ... TYPE3 metadata ... """ from pyparsing import * # define basic element types to be used in data formats integer = Word(nums) ident = Word(alphas) | quotedString.setParseAction(removeQuotes) zipcode = Combine(Word(nums,exact=5) + Optional("-" + Word(nums,exact=4))) stateAbbreviation = oneOf("""AA AE AK AL AP AR AS AZ CA CO CT DC DE FL FM GA GU HI IA ID IL IN KS KY LA MA MD ME MH MI MN MO MP MS MT NC ND NE NH NJ NM NV NY OH OK OR PA PR PW RI SC SD TN TX UT VA VI VT WA WI WV WY""".split()) # define data format for each type DATA = Suppress("data") type1dataline = Group(DATA + OneOrMore(integer)) type2dataline = Group(DATA + delimitedList(ident)) type3dataline = DATA + countedArray(ident) # define complete expressions for each type - note different types # may have different metadata type1data = "TYPE1" + ident("name") + \ OneOrMore(type1dataline)("data") type2data = "TYPE2" + ident("name") + zipcode("zip") + \ OneOrMore(type2dataline)("data") type3data = "TYPE3" + ident("name") + stateAbbreviation("state") + \ OneOrMore(type3dataline)("data") # expression containing all different type alternatives data = type1data | type2data | type3data # search a test input string and dump the matched tokens by name testInput = """ TYPE1 Abercrombie data 400 26 42 66 data 1 1 2 3 5 8 13 21 data 1 4 9 16 25 36 data 1 2 4 8 16 32 64 TYPE2 Benjamin 78704 data Larry, Curly, Moe data Hewey,Dewey ,Louie data Tom , Dick, Harry, Fred data Thelma,Louise TYPE3 Christopher WA data 3 "Raspberry Red" "Lemon Yellow" "Orange Orange" data 7 Grumpy Sneezy Happy Dopey Bashful Sleepy Doc """ for tokens in data.searchString(testInput): print tokens.dump() print tokens.name if tokens.state: print tokens.state for d in tokens.data: print " ",d print Prints: ['TYPE1', 'Abercrombie', ['400', '26', '42', '66'], ['1', '1', '2', '3', '5', '8', '13', '21'], ['1', '4', '9', '16', '25', '36'], ['1', '2', '4', '8', '16', '32', '64']] - data: [['400', '26', '42', '66'], ['1', '1', '2', '3', '5', '8', '13', '21'], ['1', '4', '9', '16', '25', '36'], ['1', '2', '4', '8', '16', '32', '64']] - name: Abercrombie Abercrombie ['400', '26', '42', '66'] ['1', '1', '2', '3', '5', '8', '13', '21'] ['1', '4', '9', '16', '25', '36'] ['1', '2', '4', '8', '16', '32', '64'] ['TYPE2', 'Benjamin', '78704', ['Larry', 'Curly', 'Moe'], ['Hewey', 'Dewey', 'Louie'], ['Tom', 'Dick', 'Harry', 'Fred'], ['Thelma', 'Louise']] - data: [['Larry', 'Curly', 'Moe'], ['Hewey', 'Dewey', 'Louie'], ['Tom', 'Dick', 'Harry', 'Fred'], ['Thelma', 'Louise']] - name: Benjamin - zip: 78704 Benjamin ['Larry', 'Curly', 'Moe'] ['Hewey', 'Dewey', 'Louie'] ['Tom', 'Dick', 'Harry', 'Fred'] ['Thelma', 'Louise'] ['TYPE3', 'Christopher', 'WA', ['Raspberry Red', 'Lemon Yellow', 'Orange Orange'], ['Grumpy', 'Sneezy', 'Happy', 'Dopey', 'Bashful', 'Sleepy', 'Doc']] - data: [['Raspberry Red', 'Lemon Yellow', 'Orange Orange'], ['Grumpy', 'Sneezy', 'Happy', 'Dopey', 'Bashful', 'Sleepy', 'Doc']] - name: Christopher - state: WA Christopher WA ['Raspberry Red', 'Lemon Yellow', 'Orange Orange'] ['Grumpy', 'Sneezy', 'Happy', 'Dopey', 'Bashful', 'Sleepy', 'Doc'] More info on pyparsing at http://pyparsing.wikispaces.com. -- Paul -- http://mail.python.org/mailman/listinfo/python-list