Here's a construct with which BeautifulSoup has problems. It's from "http://support.microsoft.com/contactussupport/?ws=support".
This is the original: <a href="http://www.microsoft.com/usability/enroll.mspx" id="L_75998" title="<!--http://www.microsoft.com/usability/information.mspx->" onclick="return MS_HandleClick(this,'C_32179', true);"> Help us improve our products </a> And this is what comes back after parsing with BeautifulSoup and using "prettify": <a href="http://www.microsoft.com/usability/enroll.mspx" id="L_75998" title="<!--http://www.microsoft.com/usability/information.mspx->"> <br clear="all" style="line-height: 1px; overflow: hidden" /> <table id="msviFooter" width="100%" cellpadding="0" cellspacing="0"> <tr valign="bottom"> <td id="msviFooter2" style="filter:progid:DXImageTransform.Microsoft.Gradient(startColorStr='#FFFFFF', endColorStr='#3F8CDA', gradientType='1')"> <div id="msviLocalFooter"> <nobr> </nobr> </div> </td> </tr> </table> </a> All that other stuff is in the neighborhood, but not in that <a> tag. Strictly speaking, it's Microsoft's fault. title="<!--http://www.microsoft.com/usability/information.mspx->" is supposed to be an HTML comment. But it's improperly terminated. It should end with "-->". So all that following stuff is from what follows the next "-->" which terminates a comment. It's so Microsoft. Unfortunately, even Firefox accepts bad comments like that. Anyway, a BeautifulSoup question. "findall(text=True)" collects comments, processing instructions, etc. as well as real text. What's the right way to collect ordinary text only? John Nagle -- http://mail.python.org/mailman/listinfo/python-list