Re: BeautifulSoup vs. loose chars

2006-12-26 Thread placid
John Nagle wrote: I've been parsing existing HTML with BeautifulSoup, and occasionally hit content which has something like Design Advertising, that is, an instead of an amp;. Is there some way I can get BeautifulSoup to clean those up? There are various parsing options related to

Re: BeautifulSoup vs. loose chars

2006-12-26 Thread Felipe Almeida Lessa
On 26 Dec 2006 04:22:38 -0800, placid [EMAIL PROTECTED] wrote: So do you want to remove or replace them with amp; ? If you want to replace it try the following; I think he wants to replace them, but just the invalid ones. I.e., This this amp; that would become This amp; this amp; that

Re: BeautifulSoup vs. loose chars

2006-12-26 Thread Duncan Booth
Felipe Almeida Lessa [EMAIL PROTECTED] wrote: On 26 Dec 2006 04:22:38 -0800, placid [EMAIL PROTECTED] wrote: So do you want to remove or replace them with amp; ? If you want to replace it try the following; I think he wants to replace them, but just the invalid ones. I.e., This this

Re: BeautifulSoup vs. loose chars

2006-12-26 Thread John Nagle
Felipe Almeida Lessa wrote: On 26 Dec 2006 04:22:38 -0800, placid [EMAIL PROTECTED] wrote: So do you want to remove or replace them with amp; ? If you want to replace it try the following; I think he wants to replace them, but just the invalid ones. I.e., This this amp; that

Re: SPAM-LOW: Re: BeautifulSoup vs. loose chars

2006-12-26 Thread Andreas Lysdal
Duncan Booth skrev: Felipe Almeida Lessa [EMAIL PROTECTED] wrote: On 26 Dec 2006 04:22:38 -0800, placid [EMAIL PROTECTED] wrote: So do you want to remove or replace them with amp; ? If you want to replace it try the following; I think he wants to replace them, but just

Re: SPAM-LOW: Re: BeautifulSoup vs. loose chars

2006-12-26 Thread Duncan Booth
Andreas Lysdal [EMAIL PROTECTED] wrote: P.S. apos is handled specially as it isn't technically a valid html entity (and Python doesn't include it in its entity list), but it is an xml entity and recognised by many browsers so some people might use it in html. Hey i fund this site:

Re: BeautifulSoup vs. loose chars

2006-12-26 Thread Frederic Rentsch
John Nagle wrote: Felipe Almeida Lessa wrote: On 26 Dec 2006 04:22:38 -0800, placid [EMAIL PROTECTED] wrote: So do you want to remove or replace them with amp; ? If you want to replace it try the following; I think he wants to replace them, but just the invalid ones. I.e.,

BeautifulSoup vs. loose chars

2006-12-25 Thread John Nagle
I've been parsing existing HTML with BeautifulSoup, and occasionally hit content which has something like Design Advertising, that is, an instead of an amp;. Is there some way I can get BeautifulSoup to clean those up? There are various parsing options related to handling, but none of them