Hi Sunil, Don't use regular expressions for this task. Use something that knows about HTML structure. As others have noted, the Beautiful Soup or lxml libraries are probably a much better choice here.
There are good reasons to avoid regexp for the task you're trying to do. For example, your regular expression: "<span style=\"(.*)\" does not respect the string boundaries of attributes. You may think that ".*" matches just content within a string attribute, but this is not true. For example, see the following example: ###################################################### >>> import re >>> m = re.match("'(.*)'", "'quoted' text, but note how it's greedy!") >>> m.group(1) "quoted' text, but note how it" ###################################################### and note how the match doesn't limited itself to "quoted", but goes as far as it can. This shows at least one of the problems that you're going to run into. Fixing this so it doesn't grab so much is doable, of course. But there are other issues, all of which are little headaches upon headaches. (e.g. Attribute vlaues may be single or double quoted, may use HTML entity references, etc.) So don't try to parse HTML by hand. Let a library do it for you. For example with Beautiful Soup: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ the code should be as straightforward as: ########################### from bs4 import BeautifulSoup soup = BeautifulSoup(stmt) for span in soup.find_all('span'): print span.get('style') ########################### where you deal with the _structure_ of your document, rather than at the low-level individual characters of that document. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor