On 2022-08-19, Chris Angelico <ros...@gmail.com> wrote: > What's the best way to precisely reconstruct an HTML file after > parsing it with BeautifulSoup? > > Using the Alice example from the BS4 docs: > >>>> html_doc = """<html><head><title>The Dormouse's story</title></head> ><body> ><p class="title"><b>The Dormouse's story</b></p> > ><p class="story">Once upon a time there were three little sisters; and > their names were ><a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, ><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and ><a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; > and they lived at the bottom of a well.</p> > ><p class="story">...</p> > """ >>>> print(soup) ><html><head><title>The Dormouse's story</title></head> ><body> ><p class="title"><b>The Dormouse's story</b></p> ><p class="story">Once upon a time there were three little sisters; and > their names were ><a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, ><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and ><a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; > and they lived at the bottom of a well.</p> ><p class="story">...</p> ></body></html> >>>> > > Note two distinct changes: firstly, whitespace has been removed, and > secondly, attributes are reordered (I think alphabetically). There are > other canonicalizations being done, too. > > I'm trying to make some automated changes to a huge number of HTML > files, with minimal diffs so they're easy to validate. That means that > spurious changes like these are very much unwanted. Is there a way to > get BS4 to reconstruct the original precisely? > > The mutation itself would be things like finding an anchor tag and > changing its href attribute. Fairly simple changes, but might alter > the length of the file (eg changing "http://example.com/" into > "https://example.com/"). I'd like to do them intelligently rather than > falling back on element.sourceline and element.sourcepos, but worst > case, that's what I'll have to do (which would be fiddly).
I'm tempting the Wrath of Zalgo by saying it, but ... regexp? -- https://mail.python.org/mailman/listinfo/python-list