Re: [Tutor] Parsing text file
On Sun, May 13, 2007 at 03:04:36PM -0700, Alan wrote: I'm looking for a more elegant way to parse sections of text files that are bordered by BEGIN/END delimiting phrases, like this: some text some more text BEGIN_INTERESTING_BIT someline1 someline2 someline3 END_INTERESTING_BIT more text more text What I have been doing is clumsy, involving converting to a string and slicing out the required section using split('DELIMITER'): import sys infile = open(sys.argv[1], 'r') #join list elements with @ character into a string fileStr = '@'.join(infile.readlines()) #Slice out the interesting section with split, then split again into lines using @ resultLine = fileStr.split('BEGIN_INTERESTING_BIT')[1].split('END_INTERESTING_BIT')[0].split('@') for line in resultLine: do things Can anyone point me at a better way to do this? Possibly over-kill, but ... How much fun are you interested in having? Others have given you the low fun easy way. Now ask yourself whether this task is likely to become more complex (the interesting parts more hidden in a more complex grammar) and perhaps you also can't wait to have some fun. Is so, consider this suggestion: 1. Write grammar rules that describe your input text. In your case, those rules might look something like the following: Seq ::= {InterestingChunk | UninterestingChunk}* InterestingChunk ::= BeginToken InterestingSeq EndToken InterestingSeq ::= InterestingChunk* 2. For each rule, write a Python function that tries to recognize what the rule describes. To do its job, each function might call other functions that implement other grammar rules and might call a tokenizer function (see below) when it needs another token from the input stream. Example: def InterestingChunk_reco(self): if self.token_type == Tok_Begin: self.get_token() if self.InterestingSeq_reco(): if self.token_type == Tok_End: self.get_token() return True else: self.Error('bad interesting sequence') 3. Write a tokenizer function. Each time this function is called, it returns the next token (probably a word) from the input stream and a code that indicates the token type. Recognizer functions call this tokenizer function each time another token is needed. In your case there might be 3 token types: (1) plain word, (2) BeginTok, and (3) EndTok. If you do the above, you have just written your first recursive descent parser. Then, the next time you are at a party, beer bar, or wedding, any time the conversation comes even remotely close to the subject of parsing text, you say, Well, for that kind of problem I usually write a recursive descent parser. It's the most powerful way and the only way to go. ... Now, that's how to impress your friends and relations. But, seriously, recursive descent parsers are quite easy and are a useful technique to have in your tool bag. And, like I said above: It's fun. Besides, if your problem becomes more complex, and, for example, the input is not quite so line oriented, you will need a more powerful approach. Wikipedia has a better explanation than mine plus an example and links: http://en.wikipedia.org/wiki/Recursive_descent_parser I've attached a sample solution and sample input. Also, be aware that there are parse generators for Python. Dave -- Dave Kuhlman http://www.rexx.com/~dkuhlman #!/usr/bin/env python # -*- mode: pymode; coding: latin1; -*- Recognize and print out interesting parts of input. A recursive descent parser is used to scan the input. Usage: python recursive_descent_parser.py [options] infile Options: -h, --help Display this help message. Example: python recursive_descent_parser.py infile Grammar: Seq ::= {InterestingChunk | UninterestingChunk}* InterestingChunk ::= BeginToken InterestingSeq EndToken InterestingSeq ::= InterestingChunk* # # Imports import sys import getopt # # Globals and constants # Token types: Tok_EOF, Tok_Begin, Tok_End, Tok_Word = range(1, 5) # # Classes class InterestingParser(object): def __init__(self, infilename=None): self.current_token = None if infilename: self.infilename = infilename self.read_input() #print self.input self.get_token() def read_input(self): self.infile = open(self.infilename, 'r') self.input = [] for line in self.infile: self.input.extend(line.rstrip('\n').split(' ')) self.infile.close() self.input_iterator = iter(self.input) def parse(self): return self.Seq_reco() def get_token(self): try: token = self.input_iterator.next() except StopIteration, e: token = None self.token = token if token is None:
[Tutor] Parsing text file
I'm looking for a more elegant way to parse sections of text files that are bordered by BEGIN/END delimiting phrases, like this: some text some more text BEGIN_INTERESTING_BIT someline1 someline2 someline3 END_INTERESTING_BIT more text more text What I have been doing is clumsy, involving converting to a string and slicing out the required section using split('DELIMITER'): import sys infile = open(sys.argv[1], 'r') #join list elements with @ character into a string fileStr = '@'.join(infile.readlines()) #Slice out the interesting section with split, then split again into lines using @ resultLine = fileStr.split('BEGIN_INTERESTING_BIT')[1].split('END_INTERESTING_BIT')[0].split('@') for line in resultLine: do things Can anyone point me at a better way to do this? Thanks -- -- Alan Wardroper [EMAIL PROTECTED] ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Parsing text file
Alan [EMAIL PROTECTED] wrote I'm looking for a more elegant way to parse sections of text files that are bordered by BEGIN/END delimiting phrases, like this: some text BEGIN_INTERESTING_BIT someline1 someline3 END_INTERESTING_BIT more text What I have been doing is clumsy, involving converting to a string and slicing out the required section using split('DELIMITER'): The method I usually use is only slightly less clunky - or maybe just as clunky! I iterate over the lines setting a flag at the start and unsetting it at the end. Pseudo code: amInterested = False for line in textfile: if amInterested and not isEndPattern(line): storeLine(line) amInterested = not isEndPattern(line) if line.find(begin_pattern): amInterested = True Whether thats any better than joining/splitting is debateable. (Obviously you need to write the isEndPattern helper function too) Alan G. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] parsing text
Kent thanks for this as I was clearly confused with regards to string and list of strings. I am, however, still having difficulty with how to solve a problem involving a related issue. i have the following text: Barnett, John B., assignor of one-half to R. N. Tutt, Kansas City, Mo.Automatic display-sign.No. 1,330 411-Apr. 13 ; v. 273 ; p. 193. Barnett, John II.. Tettenhall, England. Seat of motorcars.No. 1.353,708; Sept. 21 ; v. 278; p. 487. Barnett, Otto R.(See Scott, John M., assignor.) Barnett. Otto R. (See Sponenburg, Hiram H., assignor) Barnett, William A., Lincoln. Nebr.Attachment for garment- turning machines. No. 1,342,937; June 8 ? v 270 ; p. 313. Barnhart, Clarence D., Brooklyn, assignor to W. S. Rockwell Company, New York. N. Y.Conveyer for furnaces No. 1.333.371 ; Mar. 9 ; v. 272 ; p. 278. Barnhart, Clarence v., Waynesboro, Pa., assignor to J. K. Hoffman and W. M. Raeclitel. Hagerstowu, Md. Seed-planter.No. 1,357.43S: Nov. 2; v. 280: p. 45. Barnhart, John E.(See Haves, J. P.. and Barnhart ) Barnhart,-Mollie E.(See Freeman. Alpheus J., assignor) Barnhill, E. B., and J. Stone, Indianapolis, Ind.Auto-tire 477513 1.) when i do readlines and create a list and then print the list it adds a blank line between every line of text 2.)in the second line after p.487 there is the beginning of a new line of data only it isn't on a newline. i tried string.replace(s,'p.','\n') in an attempt to put a CR in but it just put the characters\n in the string. ideas? Thanks again jay Jay Mutter III wrote: Thanks for the response Actually the number of lines this returns is the same number of lines given when i put it in a text editor (TextWrangler). Luke had mentioned the same thing earlier but when I do change read to readlines i get the following Traceback (most recent call last): File extract_companies.py, line 17, in ? count = len(text.splitlines()) AttributeError: 'list' object has no attribute 'splitlines' I think maybe you are confused about the difference between all the text of a file in a single string and all the lines of a file in a list of strings. When you open() a file and read() the contents, you get all the text of a file in a single string. len() will give you the length of the string (the total file size) and iterating over the string gives you one character at at time. Here is an example of a string: In [1]: s = 'This is text' In [2]: len(s) Out[2]: 12 In [3]: for i in s: ...: print i ...: ...: T h i s i s t e x t On the other hand, if you open() the file and then readlines() from the file, the result is a list of strings, each of with is the contents of one line of the file, up to and including the newline. len() of the list is the number of lines in the list, and iterating the list gives each line in turn. Here is an example of a list of strings: In [4]: l = [ 'line1', 'line2' ] In [5]: len(l) Out[5]: 2 In [6]: for i in l: ...: print i ...: ...: line1 line2 Notice that s and l are *used* exactly the same way with len() and for, but the results are different. As a further wrinkle, there are two easy ways to get all the lines in a file and they give slightly different results. open(...).readlines() returns a list of lines in the file and each line includes the final newline if it was in the file. (The last line will not include a newline if the last line of the file did not.) open(...).read().splitlines() also gives a list of lines in the file, but the newlines are not included. HTH, Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parsing text
Alan Gauld wrote: 1.) when i do readlines and create a list and then print the list it adds a blank line between every line of text I suspect that's because you are reading a newline character from the file and print adds a newline of its own. You need to use rstrip() to take out the newline from the file. or use sys.stdout.write() instead of print, it doesn't add a newline. Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] Parsing text file with Python
Script i have to date is below and Thanks to your help i can see some daylight but I still have a few questions 1.) Are there better ways to write this? 2.) As it writes out the one group to the new file for companies it is as if it leaves blank lines behind for if I don't have the elif len (line) . 1 the inventor's file has blank lines in it. 3.) I reopened the inventor's file to get a count of lines but is there a better way to do this? Thanks in_filename = raw_input('What is the COMPLETE name of the file you would like to process?') in_file = open(in_filename, 'rU') text = in_file.readlines() count = len(text) print There are , count, 'lines to process in this file' out_filename1 = raw_input('What is the COMPLETE name of the file in which you would like to save Companies?') companies = open(out_filename1, 'aU') out_filename2 = raw_input('What is the COMPLETE name of the file in which you would like to save Inventors?') patentdata = open(out_filename2, 'aU') for line in text: if line.endswith(')\n'): companies.write(line) elif line.endswith(') \n'): companies.write(line) elif len(line) 1: patentdata.write(line) in_file.close() companies.close() patentdata.close() in_filename2 = raw_input('What was the name of the inventor\'s file ?') in_file2 = open(in_filename2, 'rU') text2 = in_file2.readlines() count = len(text2) print There are - well until we clean up more - approximately , count, 'inventor\s in this file' ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Parsing text file with Python
Jay Mutter III [EMAIL PROTECTED] wrote 1.) Are there better ways to write this? There are always other ways, as to which is better depends on your judgement criteria. Your way works. 2.) As it writes out the one group to the new file for companies it is as if it leaves blank lines behind for if I don't have the elif len (line) . 1 the inventor's file has blank lines in it. I'm not sure what you mean here can you elaborate, maybe with some sample data? 3.) I reopened the inventor's file to get a count of lines but is there a better way to do this? You could track the numbers of items being written as you go. The only disadvantage of your technique is the time invloved in opening the file and rereading the data then counting it. On a really big file that could take a long time. But it has the big advantage of simplicity. A couple of points: in_filename = raw_input('What is the COMPLETE name of the file you would like to process?') in_file = open(in_filename, 'rU') You might want to put your file opening code inside a try/except in case the file isn't there or is locked. text = in_file.readlines() count = len(text) print There are , count, 'lines to process in this file' Unless this is really useful info you could simplify by omitting the readlines and count and just iterating over the file. If you use enumerate you even get the final count for free at the end. for count,line in enumerate(in_file): # count is the line number, line the data for line in text: if line.endswith(')\n'): companies.write(line) elif line.endswith(') \n'): companies.write(line) You could use a boolean or to combine these: if line.endswith(')\n') or line.endswith(') \n'): companies.write(line) in_filename2 = raw_input('What was the name of the inventor\'s file ?') Given you opened it surely you already know? It should be stored in patentdata so you don't need to ask again? Also you could use flush() and then seek(0) and then readlines() before closing the file to get the count. but frankly thats being picky. in_file2 = open(in_filename2, 'rU') text2 = in_file2.readlines() count = len(text2) Well done, -- Alan Gauld Author of the Learn to Program web site http://www.freenetpages.co.uk/hp/alan.gauld ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor