Tim Chase wrote: >> I have some marked up text and would like to convert it to plain text, >> by simply removing all the tags. Of course I can do it from first >> principles but I felt that among all Python's markup tools there must >> be something that would do this simply, without having to create an >> XML parser etc. >> >> I've looked around a bit but failed to find anything, any tips? >> >> (e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday") > > > Well, if all you want to do is remove everything from a "<" to a > ">", you can use > > >>> s = "<B>Today</B> is <U>Friday</U>" > >>> import re > >>> r = re.compile('<[^>]*>') > >>> print r.sub('', s) > Today is Friday > > it should even work for semi-pathological cases such as > > s = """You can find my <a > href='http://example.com'>thesis</a > > online""" > > where the tag contents are split across lines. There are more > pathological cases where tags aren't well-formed, e.g. > > s ="This <tag>has a > sign in it and <odd<ly>-nested> tags" > > in which case you get what you deserve for making such > pathological conditions ;-) > The real answer to this question is "learn how to use Beautiful Soup" -- see http://www.crummy.com/software/BeautifulSoup/
regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ -- http://mail.python.org/mailman/listinfo/python-list