Re: [Tutor] Parsing text file
On Sun, May 13, 2007 at 03:04:36PM -0700, Alan wrote: > I'm looking for a more elegant way to parse sections of text files that > are bordered by BEGIN/END delimiting phrases, like this: > > some text > some more text > BEGIN_INTERESTING_BIT > someline1 > someline2 > someline3 > END_INTERESTING_BIT > more text > more text > > What I have been doing is clumsy, involving converting to a string and > slicing out the required section using split('DELIMITER'): > > import sys > infile = open(sys.argv[1], 'r') > #join list elements with @ character into a string > fileStr = '@'.join(infile.readlines()) > #Slice out the interesting section with split, then split again into > lines using @ > resultLine = > fileStr.split('BEGIN_INTERESTING_BIT')[1].split('END_INTERESTING_BIT')[0].split('@') > for line in resultLine: > do things > > Can anyone point me at a better way to do this? > Possibly over-kill, but ... How much fun are you interested in having? Others have given you the "low fun" easy way. Now ask yourself whether this task is likely to become more complex (the interesting parts more hidden in a more complex grammar) and perhaps you also can't wait to have some fun. Is so, consider this suggestion: 1. Write grammar rules that describe your input text. In your case, those rules might look something like the following: Seq ::= {InterestingChunk | UninterestingChunk}* InterestingChunk ::= BeginToken InterestingSeq EndToken InterestingSeq ::= InterestingChunk* 2. For each rule, write a Python function that tries to recognize what the rule describes. To do its job, each function might call other functions that implement other grammar rules and might call a tokenizer function (see below) when it needs another token from the input stream. Example: def InterestingChunk_reco(self): if self.token_type == Tok_Begin: self.get_token() if self.InterestingSeq_reco(): if self.token_type == Tok_End: self.get_token() return True else: self.Error('bad interesting sequence') 3. Write a tokenizer function. Each time this function is called, it returns the next "token" (probably a word) from the input stream and a code that indicates the token type. Recognizer functions call this tokenizer function each time another token is needed. In your case there might be 3 token types: (1) plain word, (2) BeginTok, and (3) EndTok. If you do the above, you have just written your first recursive descent parser. Then, the next time you are at a party, beer bar, or wedding, any time the conversation comes even remotely close to the subject of parsing text, you say, "Well, for that kind of problem I usually write a recursive descent parser. It's the most powerful way and the only way to go. ..." Now, that's how to impress your friends and relations. But, seriously, recursive descent parsers are quite easy and are a useful technique to have in your tool bag. And, like I said above: It's fun. Besides, if your problem becomes more complex, and, for example, the input is not quite so line oriented, you will need a more powerful approach. Wikipedia has a better explanation than mine plus an example and links: http://en.wikipedia.org/wiki/Recursive_descent_parser I've attached a sample solution and sample input. Also, be aware that there are parse generators for Python. Dave -- Dave Kuhlman http://www.rexx.com/~dkuhlman #!/usr/bin/env python # -*- mode: pymode; coding: latin1; -*- """ Recognize and print out interesting parts of input. A recursive descent parser is used to scan the input. Usage: python recursive_descent_parser.py [options] Options: -h, --help Display this help message. Example: python recursive_descent_parser.py infile Grammar: Seq ::= {InterestingChunk | UninterestingChunk}* InterestingChunk ::= BeginToken InterestingSeq EndToken InterestingSeq ::= InterestingChunk* """ # # Imports import sys import getopt # # Globals and constants # Token types: Tok_EOF, Tok_Begin, Tok_End, Tok_Word = range(1, 5) # # Classes class InterestingParser(object): def __init__(self, infilename=None): self.current_token = None if infilename: self.infilename = infilename self.read_input() #print self.input self.get_token() def read_input(self): self.infile = open(self.infilename, 'r') self.input = [] for line in self.infile: self.input.extend(line.rstrip('\n').split(' ')) self.infile.close() self.input_iterator = iter(self.input) def parse(self): return self.Seq_reco() def get_token(self): try: token = self.input_iterator.next() except StopIteration, e: token = None self.token = token
Re: [Tutor] Parsing text file
"Alan" <[EMAIL PROTECTED]> wrote > I'm looking for a more elegant way to parse sections of text files > that > are bordered by BEGIN/END delimiting phrases, like this: > > some text > BEGIN_INTERESTING_BIT > someline1 > someline3 > END_INTERESTING_BIT > more text > > What I have been doing is clumsy, involving converting to a string > and > slicing out the required section using split('DELIMITER'): The method I usually use is only slightly less clunky - or maybe just as clunky! I iterate over the lines setting a flag at the start and unsetting it at the end. Pseudo code: amInterested = False for line in textfile: if amInterested and not isEndPattern(line): storeLine(line) amInterested = not isEndPattern(line) if line.find(begin_pattern): amInterested = True Whether thats any better than joining/splitting is debateable. (Obviously you need to write the isEndPattern helper function too) Alan G. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Parsing text file
On 14/05/07, Alan <[EMAIL PROTECTED]> wrote: > I'm looking for a more elegant way to parse sections of text files that > are bordered by BEGIN/END delimiting phrases, like this: > > some text > some more text > BEGIN_INTERESTING_BIT > someline1 > someline2 > someline3 > END_INTERESTING_BIT > more text > more text If the structure is pretty simple, you could use a state machine approach. eg: import sys infile = open(sys.argv[1], 'r') INTERESTING, BORING = 'interesting', 'boring' interestingLines = [] for line in infile: if line == 'BEGIN_INTERESTING_BIT': state = INTERESTING elif line == 'END_INTERESTING_BIT': state = BORING elif state == INTERESTING: interestingLines.append(line) return interestingLines If you want to put each group of interesting lines into its own section, you could do a bit of extra work (append a new empty list to interestingLines on 'BEGIN', then append to the list at position -1 on state==INTERESTING). HTH! -- John. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] Parsing text file
I'm looking for a more elegant way to parse sections of text files that are bordered by BEGIN/END delimiting phrases, like this: some text some more text BEGIN_INTERESTING_BIT someline1 someline2 someline3 END_INTERESTING_BIT more text more text What I have been doing is clumsy, involving converting to a string and slicing out the required section using split('DELIMITER'): import sys infile = open(sys.argv[1], 'r') #join list elements with @ character into a string fileStr = '@'.join(infile.readlines()) #Slice out the interesting section with split, then split again into lines using @ resultLine = fileStr.split('BEGIN_INTERESTING_BIT')[1].split('END_INTERESTING_BIT')[0].split('@') for line in resultLine: do things Can anyone point me at a better way to do this? Thanks -- -- Alan Wardroper [EMAIL PROTECTED] ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Parsing text file with Python
"Jay Mutter III" <[EMAIL PROTECTED]> wrote > 1.) Are there better ways to write this? There are always other ways, as to which is better depends on your judgement criteria. Your way works. > 2.) As it writes out the one group to the new file for companies it > is as if it leaves blank lines behind for if I don't have the elif > len > (line) . 1 the > inventor's file has blank lines in it. I'm not sure what you mean here can you elaborate, maybe with some sample data? > 3.) I reopened the inventor's file to get a count of lines but is > there a better way to do this? You could track the numbers of items being written as you go. The only disadvantage of your technique is the time invloved in opening the file and rereading the data then counting it. On a really big file that could take a long time. But it has the big advantage of simplicity. A couple of points: > in_filename = raw_input('What is the COMPLETE name of the file you > would like to process?') > in_file = open(in_filename, 'rU') You might want to put your file opening code inside a try/except in case the file isn't there or is locked. > text = in_file.readlines() > count = len(text) > print "There are ", count, 'lines to process in this file' Unless this is really useful info you could simplify by omitting the readlines and count and just iterating over the file. If you use enumerate you even get the final count for free at the end. for count,line in enumerate(in_file): # count is the line number, line the data > for line in text: > if line.endswith(')\n'): > companies.write(line) > elif line.endswith(') \n'): > companies.write(line) You could use a boolean or to combine these: if line.endswith(')\n') or line.endswith(') \n'): companies.write(line) > in_filename2 = raw_input('What was the name of the inventor\'s > file ?') Given you opened it surely you already know? It should be stored in patentdata so you don't need to ask again? Also you could use flush() and then seek(0) and then readlines() before closing the file to get the count. but frankly thats being picky. > in_file2 = open(in_filename2, 'rU') > text2 = in_file2.readlines() > count = len(text2) Well done, -- Alan Gauld Author of the Learn to Program web site http://www.freenetpages.co.uk/hp/alan.gauld ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] Parsing text file with Python
Script i have to date is below and Thanks to your help i can see some daylight but I still have a few questions 1.) Are there better ways to write this? 2.) As it writes out the one group to the new file for companies it is as if it leaves blank lines behind for if I don't have the elif len (line) . 1 the inventor's file has blank lines in it. 3.) I reopened the inventor's file to get a count of lines but is there a better way to do this? Thanks in_filename = raw_input('What is the COMPLETE name of the file you would like to process?') in_file = open(in_filename, 'rU') text = in_file.readlines() count = len(text) print "There are ", count, 'lines to process in this file' out_filename1 = raw_input('What is the COMPLETE name of the file in which you would like to save Companies?') companies = open(out_filename1, 'aU') out_filename2 = raw_input('What is the COMPLETE name of the file in which you would like to save Inventors?') patentdata = open(out_filename2, 'aU') for line in text: if line.endswith(')\n'): companies.write(line) elif line.endswith(') \n'): companies.write(line) elif len(line) > 1: patentdata.write(line) in_file.close() companies.close() patentdata.close() in_filename2 = raw_input('What was the name of the inventor\'s file ?') in_file2 = open(in_filename2, 'rU') text2 = in_file2.readlines() count = len(text2) print "There are - well until we clean up more - approximately ", count, 'inventor\s in this file' ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor