On Apr 2, 9:28 pm, [EMAIL PROTECTED] (Cameron Laird) wrote: > In article <[EMAIL PROTECTED]>, > > > > anjesh <[EMAIL PROTECTED]> wrote: > >On Apr 2, 12:54 am, "Dotan Cohen" <[EMAIL PROTECTED]> wrote: > >> On 1 Apr 2007 07:56:04 -0700, Ulysse <[EMAIL PROTECTED]> wrote: > > >> > I have seen the Beautiful Soup online help and tried to apply that to > >> > my problem. But it seems to be a little bit hard. I will rather try to > >> > do this with regular expressions... > > >> If you think that Beautiful Soup is difficult than wait till you try > >> to do this with regexes. Granted you know the exact format of the HTML > >> you are scraping will help, if you ever need to parse HTML from an > >> unknown source than Beautiful Soup is the only way to go. Not all HTML > >> authors close their td and tr tags, and sometimes there are attributes > >> to those tags. If you plan on ever reusing the code or the format of > >> the HTML may change, then you are best off sticking with Beautiful > >> Soup. > > >> Dotan Cohen > > >>http://lyricslist.com/http://what-is-what.com/ > > >Have you tried HTMLParser. It can do the task you want to perform > >http://docs.python.org/lib/module-HTMLParser.html > > >-anjesh > > Yes, except that these last two follow-ups UNDERstate the difficulty--in > fact, the impossibility--of achieving adequate results on this problem > with regular expressions. We'll help with the documentation for HTMLParser > and BeautifulSoup. REs are an invitation to madness. > > <URL:http://www.unixreview.com/documents/s=10121/ur0702e/> might amuse > those who want to think more about REs.
r'(\d{2}\.\d{2}\.\d{4} - \d{2}:\d{2}:\d{2})</td>\W*?<td class="tdn"> \W*?<a href="(.*?)">(.*?)</a>.*?</td>' r'(\d{2}\.\d{2}\.\d{4} - \d{2}:\d{2}:\d{2}).*?player\.php.*?>(.*?)</ a>.*?<textarea.*?>(.*?)</textarea>' r'(\d{2}\.\d{2}\.\d{4} - \d{2}:\d{2}:\d{2})</td>\W*?<td class="tdn"> \W*?Message au clan de :([a-zA-Z0-9_\-]+?)\W*<br>(.*?)</th>' These three REs extract all data I need. That not exactly apply to the given string. I read the article but I didn't understood why REs are invitation to madness... -- http://mail.python.org/mailman/listinfo/python-list