http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52297
comp.lang.python can be found here: http://groups-beta.google.com/group/comp.lang.python?hl=en&lr=&ie=UTF-8&c2coff=1
Kent
Michael Powe wrote:
Hello,
I'm having erratic results with a regex. I'm hoping someone can pinpoint the problem.
This function removes HTML formatting codes from a text email that is poorly exported -- it is supposed to be a text version of an HTML mailing, but it's basically just a text version of the HTML page. I'm not after anything elaborate, but it has gotten to be a bit of an itch. ;-)
def parseFile(inFile) : import re bSpace = re.compile("^ ") multiSpace = re.compile(r"\s\s+") nbsp = re.compile(r" ") HTMLRegEx = re.compile(r"(<|<)/?((!--.*--)|(STYLE.*STYLE)|(P|BR|b|STRONG))/?(>|>) ",re.I)
f = open(inFile,"r") lines = f.readlines() newLines = [] for line in lines : line = HTMLRegEx.sub(' ',line) line = bSpace.sub('',line) line = nbsp.sub(' ',line) line = multiSpace.sub(' ',line) newLines.append(line) f.close() return newLines
Now, the main issue I'm looking at is with the multiSpace regex. When applied, this removes some blank lines but not others. I don't want it to remove any blank lines, just contiguous multiple spaces in a line.
BTB, this also illustrates a difference between python and perl -- in perl, i can change "line" and it automatically changes the entry in the array; this doesn't work in python. A bit annoying, actually. ;-)
Thanks for any help. If there's a better way to do this, I'm open to suggestions on that regard, too.
mp _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor