Re: Stripping scripts from HTML with regular expressions

2008-04-11 Thread Stefan Behnel
Michel Bouwmans wrote: > I don't think HTMLParser was doing anything wrong here. I needed to parse a > HTML document, but it contained script-blocks with document.write's in > them. I only care for the content outside these blocks but HTMLParser will > choke on such a block when it isn't encapsulat

RE: Stripping scripts from HTML with regular expressions

2008-04-10 Thread Michel Bouwmans
.org >> Subject: RE: Stripping scripts from HTML with regular expressions >> >> >> Thanks! That did the trick. :) I was trying to use HTMLParser but that >> choked on the script-blocks that didn't contain comment-indicators. >> Guess I >> can now move on

RE: Stripping scripts from HTML with regular expressions

2008-04-10 Thread Reedick, Andrew
> -Original Message- > From: [EMAIL PROTECTED] [mailto:python- > [EMAIL PROTECTED] On Behalf Of Michel Bouwmans > Sent: Wednesday, April 09, 2008 5:44 PM > To: python-list@python.org > Subject: RE: Stripping scripts from HTML with regular expressions > > >

Re: Stripping scripts from HTML with regular expressions

2008-04-10 Thread Paul McGuire
On Apr 9, 2:38 pm, Michel Bouwmans <[EMAIL PROTECTED]> wrote: > Hey everyone, > > I'm trying to strip all script-blocks from a HTML-file using regex. > > I tried the following in Python: > > testfile = open('testfile') > testhtml = testfile.read() > regex = re.compile(']*>(.*?)', re.DOTALL) > resul

Re: Stripping scripts from HTML with regular expressions

2008-04-10 Thread Nikita the Spider
PM > > To: python-list@python.org > > Subject: Stripping scripts from HTML with regular expressions > > > > Hey everyone, > > > > I'm trying to strip all script-blocks from a HTML-file using regex. > > > > [Insert obligatory comment abo

RE: Stripping scripts from HTML with regular expressions

2008-04-09 Thread Michel Bouwmans
Reedick, Andrew wrote: > > >> -Original Message- >> From: [EMAIL PROTECTED] [mailto:python- >> [EMAIL PROTECTED] On Behalf Of Michel Bouwmans >> Sent: Wednesday, April 09, 2008 3:38 PM >> To: python-list@python.org >> Subject: Stripping

RE: Stripping scripts from HTML with regular expressions

2008-04-09 Thread Reedick, Andrew
> -Original Message- > From: [EMAIL PROTECTED] [mailto:python- > [EMAIL PROTECTED] On Behalf Of Michel Bouwmans > Sent: Wednesday, April 09, 2008 3:38 PM > To: python-list@python.org > Subject: Stripping scripts from HTML with regular expressions > > Hey ever

Re: Stripping scripts from HTML with regular expressions

2008-04-09 Thread Stefan Behnel
Michel Bouwmans wrote: > I'm trying to strip all script-blocks from a HTML-file using regex. You might want to take a look at lxml.html instead, which comes with an HTML cleaner module: http://codespeak.net/lxml/lxmlhtml.html#cleaning-up-html Stefan -- http://mail.python.org/mailman/listinfo/py

RE: Stripping scripts from HTML with regular expressions

2008-04-09 Thread Reedick, Andrew
> -Original Message- > From: [EMAIL PROTECTED] [mailto:python- > [EMAIL PROTECTED] On Behalf Of Michel Bouwmans > Sent: Wednesday, April 09, 2008 3:38 PM > To: python-list@python.org > Subject: Stripping scripts from HTML with regular expressions > > Hey ever

Stripping scripts from HTML with regular expressions

2008-04-09 Thread Michel Bouwmans
Hey everyone, I'm trying to strip all script-blocks from a HTML-file using regex. I tried the following in Python: testfile = open('testfile') testhtml = testfile.read() regex = re.compile(']*>(.*?)', re.DOTALL) result = regex.sub('', blaat) print result This strips far more away then just the