What's the best way to precisely reconstruct an HTML file after parsing it with BeautifulSoup?
Using the Alice example from the BS4 docs: >>> html_doc = """<html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ >>> print(soup) <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> </body></html> >>> Note two distinct changes: firstly, whitespace has been removed, and secondly, attributes are reordered (I think alphabetically). There are other canonicalizations being done, too. I'm trying to make some automated changes to a huge number of HTML files, with minimal diffs so they're easy to validate. That means that spurious changes like these are very much unwanted. Is there a way to get BS4 to reconstruct the original precisely? The mutation itself would be things like finding an anchor tag and changing its href attribute. Fairly simple changes, but might alter the length of the file (eg changing "http://example.com/" into "https://example.com/"). I'd like to do them intelligently rather than falling back on element.sourceline and element.sourcepos, but worst case, that's what I'll have to do (which would be fiddly). ChrisA -- https://mail.python.org/mailman/listinfo/python-list