Re: [Tutor] Parsing text file

2007-05-14 Thread Dave Kuhlman
On Sun, May 13, 2007 at 03:04:36PM -0700, Alan wrote:
> I'm looking for a more elegant way to parse sections of text files that 
> are bordered by BEGIN/END delimiting phrases, like this:
> 
> some text
> some more text
> BEGIN_INTERESTING_BIT
> someline1
> someline2
> someline3
> END_INTERESTING_BIT
> more text
> more text
> 
> What I have been doing is clumsy, involving converting to a string and 
> slicing out the required section using split('DELIMITER'): 
> 
> import sys
> infile = open(sys.argv[1], 'r')
> #join list elements with @ character into a string
> fileStr = '@'.join(infile.readlines())
> #Slice out the interesting section with split, then split again into 
> lines using @
> resultLine = 
> fileStr.split('BEGIN_INTERESTING_BIT')[1].split('END_INTERESTING_BIT')[0].split('@')
> for line in resultLine:
> do things
> 
> Can anyone point me at a better way to do this?
> 

Possibly over-kill, but ...

How much fun are you interested in having?  Others have given you
the "low fun" easy way.  Now ask yourself whether this task is
likely to become more complex (the interesting parts more hidden in
a more complex grammar) and perhaps you also can't wait to have
some fun.  Is so, consider this suggestion:

1. Write grammar rules that describe your input text.  In your
   case, those rules might look something like the following:

   Seq ::= {InterestingChunk | UninterestingChunk}*
   InterestingChunk ::= BeginToken InterestingSeq EndToken
   InterestingSeq ::= InterestingChunk*


2. For each rule, write a Python function that tries to recognize
   what the rule describes.  To do its job, each function might
   call other functions that implement other grammar rules and
   might call a tokenizer function (see below) when it needs
   another token from the input stream.  Example:

   def InterestingChunk_reco(self):
   if self.token_type == Tok_Begin:
   self.get_token()
   if self.InterestingSeq_reco():
   if self.token_type == Tok_End:
   self.get_token()
   return True
   else:
   self.Error('bad interesting sequence')

3. Write a tokenizer function.  Each time this function is called,
   it returns the next "token" (probably a word) from the input
   stream and a code that indicates the token type.  Recognizer
   functions call this tokenizer function each time another token
   is needed.  In your case there might be 3 token types: (1) plain
   word, (2) BeginTok, and (3) EndTok.

If you do the above, you have just written your first recursive
descent parser.

Then, the next time you are at a party, beer bar, or wedding, any
time the conversation comes even remotely close to the subject of
parsing text, you say, "Well, for that kind of problem I usually
write a recursive descent parser.  It's the most powerful way and
the only way to go.  ..." Now, that's how to impress your friends
and relations.

But, seriously, recursive descent parsers are quite easy and are a
useful technique to have in your tool bag.  And, like I said above:
It's fun.

Besides, if your problem becomes more complex, and, for example,
the input is not quite so line oriented, you will need a more
powerful approach.

Wikipedia has a better explanation than mine plus an example and
links: http://en.wikipedia.org/wiki/Recursive_descent_parser

I've attached a sample solution and sample input.

Also, be aware that there are parse generators for Python.

Dave


-- 
Dave Kuhlman
http://www.rexx.com/~dkuhlman
#!/usr/bin/env python
# -*- mode: pymode; coding: latin1; -*-
"""
Recognize and print out interesting parts of input.
A recursive descent parser is used to scan the input.

Usage:
python recursive_descent_parser.py [options] 
Options:
-h, --help  Display this help message.
Example:
python recursive_descent_parser.py infile

Grammar:
Seq ::= {InterestingChunk | UninterestingChunk}*
InterestingChunk ::= BeginToken InterestingSeq EndToken
InterestingSeq ::= InterestingChunk*
"""


#
# Imports

import sys
import getopt


#
# Globals and constants

# Token types:
Tok_EOF, Tok_Begin, Tok_End, Tok_Word = range(1, 5)


#
# Classes

class InterestingParser(object):
def __init__(self, infilename=None):
self.current_token = None
if infilename:
self.infilename = infilename
self.read_input()
#print self.input
self.get_token()
def read_input(self):
self.infile = open(self.infilename, 'r')
self.input = []
for line in self.infile:
self.input.extend(line.rstrip('\n').split(' '))
self.infile.close()
self.input_iterator = iter(self.input)
def parse(self):
return self.Seq_reco()
def get_token(self):
try:
token = self.input_iterator.next()
except StopIteration, e:
token = None
self.token = token

Re: [Tutor] Parsing text file

2007-05-13 Thread Alan Gauld

"Alan" <[EMAIL PROTECTED]> wrote

> I'm looking for a more elegant way to parse sections of text files 
> that
> are bordered by BEGIN/END delimiting phrases, like this:
>
> some text
> BEGIN_INTERESTING_BIT
> someline1
> someline3
> END_INTERESTING_BIT
> more text
>
> What I have been doing is clumsy, involving converting to a string 
> and
> slicing out the required section using split('DELIMITER'):

The method I usually use is only slightly less clunky - or maybe
just as clunky!

I iterate over the lines setting a flag at the start and unsetting
it at the end. Pseudo code:

amInterested = False
for line in textfile:
if amInterested and not isEndPattern(line):
   storeLine(line)
amInterested = not isEndPattern(line)
if line.find(begin_pattern):
   amInterested = True

Whether thats any better than joining/splitting is debateable.
(Obviously you need to write the isEndPattern helper
function too)

Alan G. 


___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Parsing text file

2007-05-13 Thread John Fouhy
On 14/05/07, Alan <[EMAIL PROTECTED]> wrote:
> I'm looking for a more elegant way to parse sections of text files that
> are bordered by BEGIN/END delimiting phrases, like this:
>
> some text
> some more text
> BEGIN_INTERESTING_BIT
> someline1
> someline2
> someline3
> END_INTERESTING_BIT
> more text
> more text

If the structure is pretty simple, you could use a state machine approach.  eg:

import sys
infile = open(sys.argv[1], 'r')

INTERESTING, BORING = 'interesting', 'boring'
interestingLines = []

for line in infile:
  if line == 'BEGIN_INTERESTING_BIT':
state = INTERESTING
  elif line == 'END_INTERESTING_BIT':
state = BORING
  elif state == INTERESTING:
interestingLines.append(line)

return interestingLines

If you want to put each group of interesting lines into its own
section, you could do a bit of extra work (append a new empty list to
interestingLines on 'BEGIN', then append to the list at position -1 on
state==INTERESTING).

HTH!

-- 
John.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Parsing text file

2007-05-13 Thread Alan
I'm looking for a more elegant way to parse sections of text files that 
are bordered by BEGIN/END delimiting phrases, like this:

some text
some more text
BEGIN_INTERESTING_BIT
someline1
someline2
someline3
END_INTERESTING_BIT
more text
more text

What I have been doing is clumsy, involving converting to a string and 
slicing out the required section using split('DELIMITER'): 

import sys
infile = open(sys.argv[1], 'r')
#join list elements with @ character into a string
fileStr = '@'.join(infile.readlines())
#Slice out the interesting section with split, then split again into 
lines using @
resultLine = 
fileStr.split('BEGIN_INTERESTING_BIT')[1].split('END_INTERESTING_BIT')[0].split('@')
for line in resultLine:
do things

Can anyone point me at a better way to do this?

Thanks

-- 
--
Alan Wardroper
[EMAIL PROTECTED]

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Parsing text file with Python

2007-03-23 Thread Alan Gauld
"Jay Mutter III" <[EMAIL PROTECTED]> wrote
> 1.)  Are there better ways to write this?

There are always other ways, as to which is better depends
on your judgement criteria. Your way works.

> 2.) As it writes out the one group to the new file for companies it
> is as if it leaves blank lines behind for if I don't have the elif 
> len
> (line) . 1 the
>   inventor's file has blank lines in it.

I'm not sure what you mean here can you elaborate,
maybe with some sample data?

> 3.) I reopened the inventor's file to get a count of lines but is
> there a better way to do this?

You could track the numbers of items being written as you go.
The only disadvantage of your technique is the time invloved
in opening the file and rereading the data then counting it.
On a really big file that could take a long time. But it has
the big advantage of simplicity.

A couple of points:

> in_filename = raw_input('What is the COMPLETE name of the file you
> would like to process?')
> in_file = open(in_filename, 'rU')

You might want to put your file opening code inside a try/except
in case the file isn't there or is locked.

> text = in_file.readlines()
> count = len(text)
> print "There are ", count, 'lines to process in this file'

Unless this is really useful info you could simplify by
omitting the readlines and count and just iterating over
the file. If you use enumerate you even get the final
count for free at the end.

for count,line in enumerate(in_file):
 # count is the line number, line the data

> for line in text:
> if line.endswith(')\n'):
> companies.write(line)
> elif line.endswith(') \n'):
> companies.write(line)

You could use a boolean or to combine these:

 if line.endswith(')\n') or line.endswith(') \n'):
 companies.write(line)

> in_filename2 = raw_input('What was the name of the inventor\'s
> file ?')

Given you opened it surely you already know?
It should be stored in patentdata so you don't need
to ask again?

Also you could use flush() and then seek(0) and then readlines()
before closing the file to get the count. but frankly thats being 
picky.


> in_file2 = open(in_filename2, 'rU')
> text2 = in_file2.readlines()
> count = len(text2)

Well done,


-- 
Alan Gauld
Author of the Learn to Program web site
http://www.freenetpages.co.uk/hp/alan.gauld 


___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Parsing text file with Python

2007-03-23 Thread Jay Mutter III
Script i have to date is below and
Thanks to your help i can see some daylight  but I still have a few  
questions

1.)  Are there better ways to write this?
2.) As it writes out the one group to the new file for companies it  
is as if it leaves blank lines behind for if I don't have the elif len 
(line) . 1 the
   inventor's file has blank lines in it.
3.) I reopened the inventor's file to get a count of lines but is  
there a better way to do this?

Thanks



in_filename = raw_input('What is the COMPLETE name of the file you  
would like to process?')
in_file = open(in_filename, 'rU')
text = in_file.readlines()
count = len(text)
print "There are ", count, 'lines to process in this file'
out_filename1 = raw_input('What is the COMPLETE name of the file in  
which you would like to save Companies?')
companies = open(out_filename1, 'aU')
out_filename2 = raw_input('What is the COMPLETE name of the file in  
which you would like to save Inventors?')
patentdata = open(out_filename2, 'aU')
for line in text:
 if line.endswith(')\n'):
 companies.write(line)
 elif line.endswith(') \n'):
 companies.write(line)
  elif len(line) > 1:
 patentdata.write(line)
in_file.close()
companies.close()
patentdata.close()
in_filename2 = raw_input('What was the name of the inventor\'s  
file ?')
in_file2 = open(in_filename2, 'rU')
text2 = in_file2.readlines()
count = len(text2)
print "There are - well until we clean up more - approximately ",  
count, 'inventor\s in this file'
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor