Re: Mutating an HTML file with BeautifulSoup

2022-08-23 Thread Peter J. Holzer
On 2022-08-22 19:27:28 -, Jon Ribbens via Python-list wrote: > On 2022-08-22, Peter J. Holzer wrote: > > On 2022-08-22 00:45:56 -, Jon Ribbens via Python-list wrote: > >> With the offset though, BeautifulSoup made an arbitrary decision to > >> use ISO-8859-1 encoding and so when you

Re: Mutating an HTML file with BeautifulSoup

2022-08-22 Thread Jon Ribbens via Python-list
On 2022-08-22, Peter J. Holzer wrote: > On 2022-08-22 00:45:56 -, Jon Ribbens via Python-list wrote: >> With the offset though, BeautifulSoup made an arbitrary decision to >> use ISO-8859-1 encoding and so when you chopped the bytestring at >> that offset it only worked because BeautifulSoup

Re: Mutating an HTML file with BeautifulSoup

2022-08-22 Thread Peter J. Holzer
On 2022-08-22 00:45:56 -, Jon Ribbens via Python-list wrote: > With the offset though, BeautifulSoup made an arbitrary decision to > use ISO-8859-1 encoding and so when you chopped the bytestring at > that offset it only worked because BeautifulSoup had happened to > choose a

Re: Mutating an HTML file with BeautifulSoup

2022-08-22 Thread Peter J. Holzer
On 2022-08-22 00:09:01 -, Jon Ribbens via Python-list wrote: > On 2022-08-21, Peter J. Holzer wrote: > > On 2022-08-20 21:51:41 -, Jon Ribbens via Python-list wrote: > >> result = re.sub( > >> r"""(<\s*a\s+[^>]*href\s*=\s*)(['"])\s*OLD\s*\2""", > > > > This will fail on: > >

Re: Mutating an HTML file with BeautifulSoup

2022-08-22 Thread Jon Ribbens via Python-list
On 2022-08-21, Chris Angelico wrote: > On Mon, 22 Aug 2022 at 05:43, Jon Ribbens via Python-list > wrote: >> On 2022-08-21, Chris Angelico wrote: >> > On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list >> > wrote: >> >> On 2022-08-20, Chris Angelico wrote: >> >> > On Sun, 21 Aug 2022 at

Re: Mutating an HTML file with BeautifulSoup

2022-08-22 Thread Jon Ribbens via Python-list
On 2022-08-21, Peter J. Holzer wrote: > On 2022-08-20 21:51:41 -, Jon Ribbens via Python-list wrote: >> On 2022-08-20, Stefan Ram wrote: >> > Jon Ribbens writes: >> >>... or you could avoid all that faff and just do re.sub()? > >> > source = '' >> > >> > # Use Python to change the source,

Re: Mutating an HTML file with BeautifulSoup

2022-08-22 Thread Peter Otten
On 22/08/2022 05:30, Chris Angelico wrote: On Mon, 22 Aug 2022 at 10:04, Buck Evan wrote: I've had much success doing round trips through the lxml.html parser. https://lxml.de/lxmlhtml.html I ditched bs for lxml long ago and never regretted it. If you find that you have a bunch of invalid

Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Chris Angelico
On Mon, 22 Aug 2022 at 10:04, Buck Evan wrote: > > I've had much success doing round trips through the lxml.html parser. > > https://lxml.de/lxmlhtml.html > > I ditched bs for lxml long ago and never regretted it. > > If you find that you have a bunch of invalid html that lxml inadvertently >

Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Buck Evan
I've had much success doing round trips through the lxml.html parser. https://lxml.de/lxmlhtml.html I ditched bs for lxml long ago and never regretted it. If you find that you have a bunch of invalid html that lxml inadvertently "fixes", I would recommend adding a stutter-step to your project:

Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Chris Angelico
On Mon, 22 Aug 2022 at 05:43, Jon Ribbens via Python-list wrote: > > On 2022-08-21, Chris Angelico wrote: > > On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list > > wrote: > >> On 2022-08-20, Chris Angelico wrote: > >> > On Sun, 21 Aug 2022 at 03:27, Stefan Ram wrote: > >> >>

Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Jon Ribbens via Python-list
On 2022-08-21, Chris Angelico wrote: > On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list > wrote: >> On 2022-08-20, Chris Angelico wrote: >> > On Sun, 21 Aug 2022 at 03:27, Stefan Ram wrote: >> >> 2qdxy4rzwzuui...@potatochowder.com writes: >> >> >textual representations. That way, the

Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Peter J. Holzer
On 2022-08-20 21:51:41 -, Jon Ribbens via Python-list wrote: > On 2022-08-20, Stefan Ram wrote: > > Jon Ribbens writes: > >>... or you could avoid all that faff and just do re.sub()? > > source = '' > > > > # Use Python to change the source, keeping the order of attributes. > > > > result =

Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Barry
> On 21 Aug 2022, at 09:12, Chris Angelico wrote: > > On Sun, 21 Aug 2022 at 17:26, Barry wrote: >> >> >> On 19 Aug 2022, at 22:04, Chris Angelico wrote: >>> >>> On Sat, 20 Aug 2022 at 05:12, Barry wrote: >> On 19 Aug 2022, at 19:33, Chris Angelico wrote:

Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Chris Angelico
On Sun, 21 Aug 2022 at 17:26, Barry wrote: > > > > > On 19 Aug 2022, at 22:04, Chris Angelico wrote: > > > > On Sat, 20 Aug 2022 at 05:12, Barry wrote: > >> > >> > >> > On 19 Aug 2022, at 19:33, Chris Angelico wrote: > >>> > >>> What's the best way to precisely reconstruct an HTML file

Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Barry
> On 19 Aug 2022, at 22:04, Chris Angelico wrote: > > On Sat, 20 Aug 2022 at 05:12, Barry wrote: >> >> >> On 19 Aug 2022, at 19:33, Chris Angelico wrote: >>> >>> What's the best way to precisely reconstruct an HTML file after >>> parsing it with BeautifulSoup? >> >> I recall that

Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Chris Angelico
On Sun, 21 Aug 2022 at 13:41, dn wrote: > > On 21/08/2022 13.00, Chris Angelico wrote: > > Well, I don't like headaches, but I do appreciate what the G Archive > > has given me over the years, so I'm taking this on as a means of > > giving back to the community. > > This point will be picked-up

Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread dn
On 21/08/2022 13.00, Chris Angelico wrote: > On Sun, 21 Aug 2022 at 09:48, dn wrote: >> On 20/08/2022 12.38, Chris Angelico wrote: >>> On Sat, 20 Aug 2022 at 10:19, dn wrote: On 20/08/2022 09.01, Chris Angelico wrote: > On Sat, 20 Aug 2022 at 05:12, Barry wrote: >>> On 19 Aug 2022,

Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Chris Angelico
On Sun, 21 Aug 2022 at 09:48, dn wrote: > > On 20/08/2022 12.38, Chris Angelico wrote: > > On Sat, 20 Aug 2022 at 10:19, dn wrote: > >> On 20/08/2022 09.01, Chris Angelico wrote: > >>> On Sat, 20 Aug 2022 at 05:12, Barry wrote: > > On 19 Aug 2022, at 19:33, Chris Angelico wrote: > > >

Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Chris Angelico
On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list wrote: > > On 2022-08-20, Chris Angelico wrote: > > On Sun, 21 Aug 2022 at 03:27, Stefan Ram wrote: > >> 2qdxy4rzwzuui...@potatochowder.com writes: > >> >textual representations. That way, the following two elements are the > >> >same

Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread dn
On 20/08/2022 12.38, Chris Angelico wrote: > On Sat, 20 Aug 2022 at 10:19, dn wrote: >> On 20/08/2022 09.01, Chris Angelico wrote: >>> On Sat, 20 Aug 2022 at 05:12, Barry wrote: > On 19 Aug 2022, at 19:33, Chris Angelico wrote: > > What's the best way to precisely reconstruct an

Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Jon Ribbens via Python-list
On 2022-08-20, Stefan Ram wrote: > Jon Ribbens writes: >>... or you could avoid all that faff and just do re.sub()? > > import bs4 > import re > > source = '' > > # Use Python to change the source, keeping the order of attributes. > > result = re.sub( r'href\s*=\s*"http"', r'href="https"',

Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Jon Ribbens via Python-list
On 2022-08-20, Chris Angelico wrote: > On Sun, 21 Aug 2022 at 03:27, Stefan Ram wrote: >> 2qdxy4rzwzuui...@potatochowder.com writes: >> >textual representations. That way, the following two elements are the >> >same (and similar with a collection of sub-elements in a different order >> >in

Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Chris Angelico
On Sun, 21 Aug 2022 at 03:27, Stefan Ram wrote: > > 2qdxy4rzwzuui...@potatochowder.com writes: > >textual representations. That way, the following two elements are the > >same (and similar with a collection of sub-elements in a different order > >in another document): > > The /elements/

Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Jon Ribbens via Python-list
On 2022-08-19, Chris Angelico wrote: > What's the best way to precisely reconstruct an HTML file after > parsing it with BeautifulSoup? > > Using the Alice example from the BS4 docs: > html_doc = """The Dormouse's story > >The Dormouse's story > >Once upon a time there were three little

Re: Mutating an HTML file with BeautifulSoup

2022-08-19 Thread Chris Angelico
On Sat, 20 Aug 2022 at 10:19, dn wrote: > > On 20/08/2022 09.01, Chris Angelico wrote: > > On Sat, 20 Aug 2022 at 05:12, Barry wrote: > >> > >> > >> > >>> On 19 Aug 2022, at 19:33, Chris Angelico wrote: > >>> > >>> What's the best way to precisely reconstruct an HTML file after > >>> parsing

Re: Mutating an HTML file with BeautifulSoup

2022-08-19 Thread dn
On 20/08/2022 09.01, Chris Angelico wrote: > On Sat, 20 Aug 2022 at 05:12, Barry wrote: >> >> >> >>> On 19 Aug 2022, at 19:33, Chris Angelico wrote: >>> >>> What's the best way to precisely reconstruct an HTML file after >>> parsing it with BeautifulSoup? >> >> I recall that in bs4 it parses

Re: Mutating an HTML file with BeautifulSoup

2022-08-19 Thread Chris Angelico
On Sat, 20 Aug 2022 at 10:04, David wrote: > > On Sat, 20 Aug 2022 at 04:31, Chris Angelico wrote: > > > What's the best way to precisely reconstruct an HTML file after > > parsing it with BeautifulSoup? > > > Note two distinct changes: firstly, whitespace has been removed, and > > secondly,

Re: Mutating an HTML file with BeautifulSoup

2022-08-19 Thread David
On Sat, 20 Aug 2022 at 04:31, Chris Angelico wrote: > What's the best way to precisely reconstruct an HTML file after > parsing it with BeautifulSoup? > Note two distinct changes: firstly, whitespace has been removed, and > secondly, attributes are reordered (I think alphabetically). There are

Re: Mutating an HTML file with BeautifulSoup

2022-08-19 Thread Chris Angelico
On Sat, 20 Aug 2022 at 05:12, Barry wrote: > > > > > On 19 Aug 2022, at 19:33, Chris Angelico wrote: > > > > What's the best way to precisely reconstruct an HTML file after > > parsing it with BeautifulSoup? > > I recall that in bs4 it parses into an object tree and loses the detail of > the

Re: Mutating an HTML file with BeautifulSoup

2022-08-19 Thread 2QdxY4RzWzUUiLuE
On 2022-08-19 at 20:12:35 +0100, Barry wrote: > > On 19 Aug 2022, at 19:33, Chris Angelico wrote: > > > > What's the best way to precisely reconstruct an HTML file after > > parsing it with BeautifulSoup? > > I recall that in bs4 it parses into an object tree and loses the > detail of the

Re: Mutating an HTML file with BeautifulSoup

2022-08-19 Thread Barry
> On 19 Aug 2022, at 19:33, Chris Angelico wrote: > > What's the best way to precisely reconstruct an HTML file after > parsing it with BeautifulSoup? I recall that in bs4 it parses into an object tree and loses the detail of the input. I recently ported from very old bs to bs4 and hit the

Mutating an HTML file with BeautifulSoup

2022-08-19 Thread Chris Angelico
What's the best way to precisely reconstruct an HTML file after parsing it with BeautifulSoup? Using the Alice example from the BS4 docs: >>> html_doc = """The Dormouse's story The Dormouse's story Once upon a time there were three little sisters; and their names were http://example.com/elsie;