Re: [Tutor] Parsing text file

2007-05-14 Thread Dave Kuhlman
On Sun, May 13, 2007 at 03:04:36PM -0700, Alan wrote:
 I'm looking for a more elegant way to parse sections of text files that 
 are bordered by BEGIN/END delimiting phrases, like this:
 
 some text
 some more text
 BEGIN_INTERESTING_BIT
 someline1
 someline2
 someline3
 END_INTERESTING_BIT
 more text
 more text
 
 What I have been doing is clumsy, involving converting to a string and 
 slicing out the required section using split('DELIMITER'): 
 
 import sys
 infile = open(sys.argv[1], 'r')
 #join list elements with @ character into a string
 fileStr = '@'.join(infile.readlines())
 #Slice out the interesting section with split, then split again into 
 lines using @
 resultLine = 
 fileStr.split('BEGIN_INTERESTING_BIT')[1].split('END_INTERESTING_BIT')[0].split('@')
 for line in resultLine:
 do things
 
 Can anyone point me at a better way to do this?
 

Possibly over-kill, but ...

How much fun are you interested in having?  Others have given you
the low fun easy way.  Now ask yourself whether this task is
likely to become more complex (the interesting parts more hidden in
a more complex grammar) and perhaps you also can't wait to have
some fun.  Is so, consider this suggestion:

1. Write grammar rules that describe your input text.  In your
   case, those rules might look something like the following:

   Seq ::= {InterestingChunk | UninterestingChunk}*
   InterestingChunk ::= BeginToken InterestingSeq EndToken
   InterestingSeq ::= InterestingChunk*


2. For each rule, write a Python function that tries to recognize
   what the rule describes.  To do its job, each function might
   call other functions that implement other grammar rules and
   might call a tokenizer function (see below) when it needs
   another token from the input stream.  Example:

   def InterestingChunk_reco(self):
   if self.token_type == Tok_Begin:
   self.get_token()
   if self.InterestingSeq_reco():
   if self.token_type == Tok_End:
   self.get_token()
   return True
   else:
   self.Error('bad interesting sequence')

3. Write a tokenizer function.  Each time this function is called,
   it returns the next token (probably a word) from the input
   stream and a code that indicates the token type.  Recognizer
   functions call this tokenizer function each time another token
   is needed.  In your case there might be 3 token types: (1) plain
   word, (2) BeginTok, and (3) EndTok.

If you do the above, you have just written your first recursive
descent parser.

Then, the next time you are at a party, beer bar, or wedding, any
time the conversation comes even remotely close to the subject of
parsing text, you say, Well, for that kind of problem I usually
write a recursive descent parser.  It's the most powerful way and
the only way to go.  ... Now, that's how to impress your friends
and relations.

But, seriously, recursive descent parsers are quite easy and are a
useful technique to have in your tool bag.  And, like I said above:
It's fun.

Besides, if your problem becomes more complex, and, for example,
the input is not quite so line oriented, you will need a more
powerful approach.

Wikipedia has a better explanation than mine plus an example and
links: http://en.wikipedia.org/wiki/Recursive_descent_parser

I've attached a sample solution and sample input.

Also, be aware that there are parse generators for Python.

Dave


-- 
Dave Kuhlman
http://www.rexx.com/~dkuhlman
#!/usr/bin/env python
# -*- mode: pymode; coding: latin1; -*-

Recognize and print out interesting parts of input.
A recursive descent parser is used to scan the input.

Usage:
python recursive_descent_parser.py [options] infile
Options:
-h, --help  Display this help message.
Example:
python recursive_descent_parser.py infile

Grammar:
Seq ::= {InterestingChunk | UninterestingChunk}*
InterestingChunk ::= BeginToken InterestingSeq EndToken
InterestingSeq ::= InterestingChunk*



#
# Imports

import sys
import getopt


#
# Globals and constants

# Token types:
Tok_EOF, Tok_Begin, Tok_End, Tok_Word = range(1, 5)


#
# Classes

class InterestingParser(object):
def __init__(self, infilename=None):
self.current_token = None
if infilename:
self.infilename = infilename
self.read_input()
#print self.input
self.get_token()
def read_input(self):
self.infile = open(self.infilename, 'r')
self.input = []
for line in self.infile:
self.input.extend(line.rstrip('\n').split(' '))
self.infile.close()
self.input_iterator = iter(self.input)
def parse(self):
return self.Seq_reco()
def get_token(self):
try:
token = self.input_iterator.next()
except StopIteration, e:
token = None
self.token = token
if token is None:

[Tutor] Parsing text file

2007-05-13 Thread Alan
I'm looking for a more elegant way to parse sections of text files that 
are bordered by BEGIN/END delimiting phrases, like this:

some text
some more text
BEGIN_INTERESTING_BIT
someline1
someline2
someline3
END_INTERESTING_BIT
more text
more text

What I have been doing is clumsy, involving converting to a string and 
slicing out the required section using split('DELIMITER'): 

import sys
infile = open(sys.argv[1], 'r')
#join list elements with @ character into a string
fileStr = '@'.join(infile.readlines())
#Slice out the interesting section with split, then split again into 
lines using @
resultLine = 
fileStr.split('BEGIN_INTERESTING_BIT')[1].split('END_INTERESTING_BIT')[0].split('@')
for line in resultLine:
do things

Can anyone point me at a better way to do this?

Thanks

-- 
--
Alan Wardroper
[EMAIL PROTECTED]

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Parsing text file

2007-05-13 Thread Alan Gauld

Alan [EMAIL PROTECTED] wrote

 I'm looking for a more elegant way to parse sections of text files 
 that
 are bordered by BEGIN/END delimiting phrases, like this:

 some text
 BEGIN_INTERESTING_BIT
 someline1
 someline3
 END_INTERESTING_BIT
 more text

 What I have been doing is clumsy, involving converting to a string 
 and
 slicing out the required section using split('DELIMITER'):

The method I usually use is only slightly less clunky - or maybe
just as clunky!

I iterate over the lines setting a flag at the start and unsetting
it at the end. Pseudo code:

amInterested = False
for line in textfile:
if amInterested and not isEndPattern(line):
   storeLine(line)
amInterested = not isEndPattern(line)
if line.find(begin_pattern):
   amInterested = True

Whether thats any better than joining/splitting is debateable.
(Obviously you need to write the isEndPattern helper
function too)

Alan G. 


___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] parsing text

2007-03-24 Thread Jay Mutter III
Kent thanks for this as I was clearly confused with regards to string  
and list of strings.
I am, however, still having difficulty with how to solve a problem  
involving a related issue.


i have the following text:

Barnett, John B., assignor of one-half to R. N. Tutt, Kansas City,  
Mo.Automatic display-sign.No. 1,330 411-Apr. 13 ; v. 273 ; p.  
193.
Barnett,  John  II..  Tettenhall,  England. Seat  of   
motorcars.No. 1.353,708; Sept. 21 ; v. 278; p. 487. Barnett, Otto  
R.(See Scott, John M., assignor.)

Barnett. Otto R. (See Sponenburg, Hiram H., assignor)
Barnett, William A., Lincoln. Nebr.Attachment for garment- 
turning   machines. No.   1,342,937;   June   8 ?   v 270 ; p. 313.
Barnhart, Clarence D., Brooklyn, assignor to W. S. Rockwell Company,  
New York. N. Y.Conveyer for furnaces No. 1.333.371 ; Mar. 9 ; v.  
272 ; p. 278.
Barnhart, Clarence v., Waynesboro, Pa., assignor to J. K. Hoffman and  
W. M. Raeclitel.  Hagerstowu, Md. Seed-planter.No. 1,357.43S:  
Nov. 2; v. 280: p. 45.

Barnhart, John E.(See Haves, J. P.. and Barnhart )
Barnhart,-Mollie E.(See Freeman. Alpheus J., assignor) Barnhill,  
E. B., and J. Stone, Indianapolis, Ind.Auto-tire 477513


1.) when i do readlines and create a list and then print the list it  
adds a blank line between every line of text
2.)in the second line after p.487 there is the beginning of a new  
line of data only it isn't on a newline.
i tried string.replace(s,'p.','\n') in an attempt to put a CR in but  
it just put the characters\n in the string.


ideas?

Thanks again

jay



Jay Mutter III wrote:
 Thanks for the response
 Actually the number of lines this returns is the same number of lines
 given when i put it in a text editor (TextWrangler).
 Luke had mentioned the same thing earlier but when I do change  
read to

 readlines  i get the following


 Traceback (most recent call last):
   File extract_companies.py, line 17, in ?
 count = len(text.splitlines())
 AttributeError: 'list' object has no attribute 'splitlines'

I think maybe you are confused about the difference between all the
text of a file in a single string and all the lines of a file in a
list of strings.

When you open() a file and read() the contents, you get all the text of
a file in a single string. len() will give you the length of the string
(the total file size) and iterating over the string gives you one
character at at time.

Here is an example of a string:
In [1]: s = 'This is text'
In [2]: len(s)
Out[2]: 12
In [3]: for i in s:
...: print i
...:
...:
T
h
i
s

i
s

t
e
x
t

On the other hand, if you open() the file and then readlines() from the
file, the result is a list of strings, each of with is the contents of
one line of the file, up to and including the newline. len() of the list
is the number of lines in the list, and iterating the list gives each
line in turn.

Here is an example of a list of strings:
In [4]: l = [ 'line1', 'line2' ]
In [5]: len(l)
Out[5]: 2
In [6]: for i in l:
...: print i
...:
...:
line1
line2

Notice that s and l are *used* exactly the same way with len() and for,
but the results are different.

As a further wrinkle, there are two easy ways to get all the lines in a
file and they give slightly different results.

open(...).readlines() returns a list of lines in the file and each line
includes the final newline if it was in the file. (The last line will
not include a newline if the last line of the file did not.)

open(...).read().splitlines() also gives a list of lines in the file,
but the newlines are not included.

HTH,
Kent



___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parsing text

2007-03-24 Thread Kent Johnson
Alan Gauld wrote:

 1.) when i do readlines and create a list and then print the list it
 adds a blank line between every line of text
 
 I suspect that's because you are reading a newline character
 from the file and print adds a newline of its own. You need to
 use rstrip() to take out the newline from the file.

or use sys.stdout.write() instead of print, it doesn't add a newline.

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Parsing text file with Python

2007-03-23 Thread Jay Mutter III
Script i have to date is below and
Thanks to your help i can see some daylight  but I still have a few  
questions

1.)  Are there better ways to write this?
2.) As it writes out the one group to the new file for companies it  
is as if it leaves blank lines behind for if I don't have the elif len 
(line) . 1 the
   inventor's file has blank lines in it.
3.) I reopened the inventor's file to get a count of lines but is  
there a better way to do this?

Thanks



in_filename = raw_input('What is the COMPLETE name of the file you  
would like to process?')
in_file = open(in_filename, 'rU')
text = in_file.readlines()
count = len(text)
print There are , count, 'lines to process in this file'
out_filename1 = raw_input('What is the COMPLETE name of the file in  
which you would like to save Companies?')
companies = open(out_filename1, 'aU')
out_filename2 = raw_input('What is the COMPLETE name of the file in  
which you would like to save Inventors?')
patentdata = open(out_filename2, 'aU')
for line in text:
 if line.endswith(')\n'):
 companies.write(line)
 elif line.endswith(') \n'):
 companies.write(line)
  elif len(line)  1:
 patentdata.write(line)
in_file.close()
companies.close()
patentdata.close()
in_filename2 = raw_input('What was the name of the inventor\'s  
file ?')
in_file2 = open(in_filename2, 'rU')
text2 = in_file2.readlines()
count = len(text2)
print There are - well until we clean up more - approximately ,  
count, 'inventor\s in this file'
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Parsing text file with Python

2007-03-23 Thread Alan Gauld
Jay Mutter III [EMAIL PROTECTED] wrote
 1.)  Are there better ways to write this?

There are always other ways, as to which is better depends
on your judgement criteria. Your way works.

 2.) As it writes out the one group to the new file for companies it
 is as if it leaves blank lines behind for if I don't have the elif 
 len
 (line) . 1 the
   inventor's file has blank lines in it.

I'm not sure what you mean here can you elaborate,
maybe with some sample data?

 3.) I reopened the inventor's file to get a count of lines but is
 there a better way to do this?

You could track the numbers of items being written as you go.
The only disadvantage of your technique is the time invloved
in opening the file and rereading the data then counting it.
On a really big file that could take a long time. But it has
the big advantage of simplicity.

A couple of points:

 in_filename = raw_input('What is the COMPLETE name of the file you
 would like to process?')
 in_file = open(in_filename, 'rU')

You might want to put your file opening code inside a try/except
in case the file isn't there or is locked.

 text = in_file.readlines()
 count = len(text)
 print There are , count, 'lines to process in this file'

Unless this is really useful info you could simplify by
omitting the readlines and count and just iterating over
the file. If you use enumerate you even get the final
count for free at the end.

for count,line in enumerate(in_file):
 # count is the line number, line the data

 for line in text:
 if line.endswith(')\n'):
 companies.write(line)
 elif line.endswith(') \n'):
 companies.write(line)

You could use a boolean or to combine these:

 if line.endswith(')\n') or line.endswith(') \n'):
 companies.write(line)

 in_filename2 = raw_input('What was the name of the inventor\'s
 file ?')

Given you opened it surely you already know?
It should be stored in patentdata so you don't need
to ask again?

Also you could use flush() and then seek(0) and then readlines()
before closing the file to get the count. but frankly thats being 
picky.


 in_file2 = open(in_filename2, 'rU')
 text2 = in_file2.readlines()
 count = len(text2)

Well done,


-- 
Alan Gauld
Author of the Learn to Program web site
http://www.freenetpages.co.uk/hp/alan.gauld 


___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor