[EMAIL PROTECTED] writes: > I'm a total newbie to Python so any and all advice is greatly > appreciated.
Well, I've got some for you. > I'm trying to use regular expressions to process text in an SGML file > but only in one section. This is generally a bad idea. SGML family languages aren't easy to parse - even the ones that were designed to be easy to parse - and generally require very complex regular expessions to get right. It may be that your SGML data can be parsed by the re you use, but there are almost certainly valid SGML documents that your parser will not properly parse. In general, it's better to use a parser for the language in question. > So the input would look like this: > > <ch-part no="I"><title>RESEARCH GUIDE > <sec-main no="1.01"><title>content > <para>content > > <sec-main no="2.01"><title>content > <para>content > > > <ch-part no="II"><title>FORMS > <sec-main no="3.01"><title>content > > <sec-sub1 no="1"><title>content > <para>content > > <sec-sub2 no="1"><title>content > <para>content This is funny-looking SGML. Are the the end tags really optional for all the tag types? > But no matter what I try I end up changing the entire file rather than > just one part. Other have explained why you can't do that, so I'll skip it. > Here's what I've come up with so far but I can't think of anything > else. > > *** > > import os, re > setpath = raw_input("Enter the path where the program should run: ") > print > > for root, dirs, files in os.walk(setpath): > fname = files > for fname in files: > inputFile = file(os.path.join(root,fname), 'r') > line = inputFile.read() > inputFile.close() > > > chpart_pattern = re.compile(r'<ch-part > no=\"[A-Z]{1,4}\"><title>(RESEARCH)', re.IGNORECASE) This makes a number of assumptions that are invalid about SGML in general, but may be valid for your sample text - how attributes are quoted, the lack of line breaks, which can be added without changing the content, and the format of the "no" attribute. > while 1: > if chpart_pattern.search(line): > line = re.sub(r"<sec-main > no=(\"[0-9]*.[0-9]*\")><title>(.*)", r"<sec-main > no=\1><title>\2\n<biblio>", line) Ditto. Heren's an sgmllib solution that gets does what you do above, except it writes it to standard out: #!/usr/bin/env python import sys from sgmllib import SGMLParser datain = """ <ch-part no="I"><title>RESEARCH GUIDE <sec-main no="1.01"><title>content <para>content <sec-main no="2.01"><title>content <para>content <ch-part no="II"><title>FORMS <sec-main no="3.01"><title>content <sec-sub1 no="1"><title>content <para>content <sec-sub2 no="1"><title>content <para>content """ class Parser(SGMLParser): def __init__(self): # install the handlers with funny names setattr(self, "start_ch-part", self.handle_ch_part) # And start with chapter 0 self.ch_num = 0 SGMLParser.__init__(self) def format_attributes(self, attributes): return ['%s="%s"' % pair for pair in attributes] def unknown_starttag(self, tag, attributes): taglist = self.format_attributes(attributes) taglist.insert(0, tag) sys.stdout.write('<%s>' % ' '.join(taglist)) def handle_data(self, data): sys.stdout.write(data) def handle_ch_part(self, attributes): """This should be called start_ch-part, but, well, you know.""" self.unknown_starttag('ch-part', attributes) for name, value in attributes: if name == 'no': self.ch_num = value def start_para(self, attributes): if self.ch_num == 'I': sys.stdout.write('<biblio>\n') self.unknown_starttag('para', attributes) parser = Parser() parser.feed(datain) parser.close() sgmllib isn't a very good SGML parser - it was written to support htmllib, and really only handles that subset of sgml well. In particular, it doesn't really understand DTDs, so can't handle the missing end tags in your example. You may be able to work around that. If you can coerce this to XML, then the xml tools in the standard library will work well. For HTML, I like BeautifulSoup, but that's mostly because it deals with all the crud on the net that is passed off as HTML. For SGML - well, I don't have a good answer. Last time I had to deal with real SGML, I used a C parser that spat out a parse tree that could be parsed properly. <mike -- Mike Meyer <[EMAIL PROTECTED]> http://www.mired.org/home/mwm/ Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information. -- http://mail.python.org/mailman/listinfo/python-list