If you search comp.lang.python for 'convert html text', the top four results all have solutions for this problem including a reference to this cookbook recipe:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52297


comp.lang.python can be found here:
http://groups-beta.google.com/group/comp.lang.python?hl=en&lr=&ie=UTF-8&c2coff=1

Kent


Michael Powe wrote:
Hello,

I'm having erratic results with a regex.  I'm hoping someone can
pinpoint the problem.

This function removes HTML formatting codes from a text email that is
poorly exported -- it is supposed to be a text version of an HTML
mailing, but it's basically just a text version of the HTML page.  I'm
not after anything elaborate, but it has gotten to be a bit of an
itch.  ;-)

def parseFile(inFile) :
    import re
    bSpace = re.compile("^ ")
    multiSpace = re.compile(r"\s\s+")
    nbsp = re.compile(r" ")
    HTMLRegEx =
    re.compile(r"(&lt;|<)/?((!--.*--)|(STYLE.*STYLE)|(P|BR|b|STRONG))/?(&gt;|>)
",re.I)

    f = open(inFile,"r")
    lines = f.readlines()
    newLines = []
    for line in lines :
        line = HTMLRegEx.sub(' ',line)
        line = bSpace.sub('',line)
        line = nbsp.sub(' ',line)
        line = multiSpace.sub(' ',line)
        newLines.append(line)
    f.close()
    return newLines

Now, the main issue I'm looking at is with the multiSpace regex.  When
applied, this removes some blank lines but not others.  I don't want
it to remove any blank lines, just contiguous multiple spaces in a
line.

BTB, this also illustrates a difference between python and perl -- in
perl, i can change "line" and it automatically changes the entry in
the array; this doesn't work in python.  A bit annoying, actually.
;-)

Thanks for any help.  If there's a better way to do this, I'm open to
suggestions on that regard, too.

mp
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to