Re: Mutating an HTML file with BeautifulSoup
On 2022-08-22 19:27:28 -, Jon Ribbens via Python-list wrote: > On 2022-08-22, Peter J. Holzer wrote: > > On 2022-08-22 00:45:56 -, Jon Ribbens via Python-list wrote: > >> With the offset though, BeautifulSoup made an arbitrary decision to > >> use ISO-8859-1 encoding and so when you chopped the bytestring at > >> that offset it only worked because BeautifulSoup had happened to > >> choose a 1-byte-per-character encoding. Ironically, *without* the > >> "\xed\xa0\x80\xed\xbc\x9f" it wouldn't have worked. > > > > Actually it would. The unit is bytes if you feed it with bytes, and > > characters if you feed it with str. > > No it isn't. If you give BeautifulSoup's 'html.parser' bytes as input, > it first chooses an encoding and decodes the bytes before sending that > output to html.parser, which is what provides the offset. So the offsets > it gives are in characters, and you've no simple way of converting that > back to byte offsets. Ah, I see. It "worked" for me because "\xed\xa0\x80\xed\xbc\x9f" isn't valid UTF-8. So Beautifulsoup decided to ignore the "" I had inserted before and used ISO-8859-1, providing me with correct byte offsets. If I replace that gibberish with a correct UTF-8 sequence (e.g. "\x4B\xC3\xA4\x73\x65") the UTF-8 is decoded and I get a character offset. > >> It looks like BeautifulSoup is doing something like that, yes. > >> Personally I would be nervous about some of my files being parsed > >> as UTF-8 and some of them ISO-8859-1 (due to decoding errors rather > >> than some of the files actually *being* ISO-8859-1 ;-) ) > > > > Since none of the syntactically meaningful characters have a code >= > > 0x80, you can parse HTML at the byte level if you know that it's encoded > > in a strict superset of ASCII (which all of the ISO-8859 family and > > UTF-8 are). Only if that's not true (e.g. if your files might be UTF-16 > > (or Shift-JIS or EUC, if I remember correctly) then you have to know > > the the character set. > > > > (By parsing I mean only "create a syntax tree". Obviously you have to > > know the encoding to know whether to display =ABc3 bc=BB as =AB=FC=BB or = > >=AB=C3=BC=BB.) > > But the job here isn't to create a syntax tree. It's to change some of > the content, which for all we know is not ASCII. We know it's URLs, and the canonical form of an URL is ASCII. The URLs in the files may not be, but if they aren't you'll have to deal with variants anyway. And the start and end of the attribute can be determined in any strict superset of ASCII including UTF-8. hp -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | h...@hjp.at |-- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" signature.asc Description: PGP signature -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
On 2022-08-22, Peter J. Holzer wrote: > On 2022-08-22 00:45:56 -, Jon Ribbens via Python-list wrote: >> With the offset though, BeautifulSoup made an arbitrary decision to >> use ISO-8859-1 encoding and so when you chopped the bytestring at >> that offset it only worked because BeautifulSoup had happened to >> choose a 1-byte-per-character encoding. Ironically, *without* the >> "\xed\xa0\x80\xed\xbc\x9f" it wouldn't have worked. > > Actually it would. The unit is bytes if you feed it with bytes, and > characters if you feed it with str. No it isn't. If you give BeautifulSoup's 'html.parser' bytes as input, it first chooses an encoding and decodes the bytes before sending that output to html.parser, which is what provides the offset. So the offsets it gives are in characters, and you've no simple way of converting that back to byte offsets. > (OTOH it seems that the html parser doesn't heed any > tags, which seems less than ideal for more pedestrian purposes.) html.parser doesn't accept bytes as input, so it couldn't do anything with the encoding even if it knew it. BeautifulSoup's 'html.parser' however does look for and use (using a regexp, natch). >> It looks like BeautifulSoup is doing something like that, yes. >> Personally I would be nervous about some of my files being parsed >> as UTF-8 and some of them ISO-8859-1 (due to decoding errors rather >> than some of the files actually *being* ISO-8859-1 ;-) ) > > Since none of the syntactically meaningful characters have a code >= > 0x80, you can parse HTML at the byte level if you know that it's encoded > in a strict superset of ASCII (which all of the ISO-8859 family and > UTF-8 are). Only if that's not true (e.g. if your files might be UTF-16 > (or Shift-JIS or EUC, if I remember correctly) then you have to know > the the character set. > > (By parsing I mean only "create a syntax tree". Obviously you have to > know the encoding to know whether to display =ABc3 bc=BB as =AB=FC=BB or = >=AB=C3=BC=BB.) But the job here isn't to create a syntax tree. It's to change some of the content, which for all we know is not ASCII. -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
On 2022-08-22 00:45:56 -, Jon Ribbens via Python-list wrote: > With the offset though, BeautifulSoup made an arbitrary decision to > use ISO-8859-1 encoding and so when you chopped the bytestring at > that offset it only worked because BeautifulSoup had happened to > choose a 1-byte-per-character encoding. Ironically, *without* the > "\xed\xa0\x80\xed\xbc\x9f" it wouldn't have worked. Actually it would. The unit is bytes if you feed it with bytes, and characters if you feed it with str. So in any case you can use the offset on the data you fed to the parser. Maybe not what you expected, but seems quite useful for what Chris has in mind. (OTOH it seems that the html parser doesn't heed any tags, which seems less than ideal for more pedestrian purposes.) > > So I would probably just let this one go through as 8859-1. > > It looks like BeautifulSoup is doing something like that, yes. > Personally I would be nervous about some of my files being parsed > as UTF-8 and some of them ISO-8859-1 (due to decoding errors rather > than some of the files actually *being* ISO-8859-1 ;-) ) Since none of the syntactically meaningful characters have a code >= 0x80, you can parse HTML at the byte level if you know that it's encoded in a strict superset of ASCII (which all of the ISO-8859 family and UTF-8 are). Only if that's not true (e.g. if your files might be UTF-16 (or Shift-JIS or EUC, if I remember correctly) then you have to know the the character set. (By parsing I mean only "create a syntax tree". Obviously you have to know the encoding to know whether to display «c3 bc» as «ü» or «Ã¼».) hp -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | h...@hjp.at |-- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" signature.asc Description: PGP signature -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
On 2022-08-22 00:09:01 -, Jon Ribbens via Python-list wrote: > On 2022-08-21, Peter J. Holzer wrote: > > On 2022-08-20 21:51:41 -, Jon Ribbens via Python-list wrote: > >> result = re.sub( > >> r"""(<\s*a\s+[^>]*href\s*=\s*)(['"])\s*OLD\s*\2""", > > > > This will fail on: > > > > I've seen *a lot* of bad/broken/weird HTML over the years, and I don't > believe I've ever seen anyone do that. (Wrongly putting an 'alt' > attribute on an 'a' element is very common, on the other hand ;-) ) My bad. I meant title, not alt, of course. The unescaped > is completely standard conforming HTML, however (both HTML 4.01 strict and HTML 5). You almost never have to escape > - in fact I can't think of any case right now - and I generally don't (sometimes I do for symmetry with <, but that's an aesthetic choice, not a technical one). > > The problem can be solved with regular expressions (and given the > > constraints I think I would prefer that to using Beautiful Soup), but > > getting the regexps right is not trivial, at least in the general case. > > I would like to see the regular expression that could fully parse > general HTML... That depends on what you mean by "parse". If you mean "construct a DOM tree", you can't since regular expressions (in the mathematical sense, not what's implemented by some programming languages) by definition describe finite automata, and those don't support recursion. But if you mean "split into a sequence of tags and PCDATA's (and then each tag further into its attributes)", that's absolutely possible, and that's all that is needed here. I don't think I have ever implemented a complete solution (if only because stuff like is extremely rare), but I should have some Perl code lying around which worked on a wide variety of HTML. I just have to find it again ... hp -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | h...@hjp.at |-- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" signature.asc Description: PGP signature -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
On 2022-08-21, Chris Angelico wrote: > On Mon, 22 Aug 2022 at 05:43, Jon Ribbens via Python-list > wrote: >> On 2022-08-21, Chris Angelico wrote: >> > On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list >> > wrote: >> >> On 2022-08-20, Chris Angelico wrote: >> >> > On Sun, 21 Aug 2022 at 03:27, Stefan Ram >> >> > wrote: >> >> >> 2qdxy4rzwzuui...@potatochowder.com writes: >> >> >> >textual representations. That way, the following two elements are the >> >> >> >same (and similar with a collection of sub-elements in a different >> >> >> >order >> >> >> >in another document): >> >> >> >> >> >> The /elements/ differ. They have the /same/ infoset. >> >> > >> >> > That's the bit that's hard to prove. >> >> > >> >> >> The OP could edit the files with regexps to create a new version. >> >> > >> >> > To you and Jon, who also suggested this: how would that be beneficial? >> >> > With Beautiful Soup, I have the line number and position within the >> >> > line where the tag starts; what does a regex give me that I don't have >> >> > that way? >> >> >> >> You mean you could use BeautifulSoup to read the file and identify the >> >> bits you want to change by line number and offset, and then you could >> >> use that data to try and update the file, hoping like hell that your >> >> definition of "line" and "offset" are identical to BeautifulSoup's >> >> and that you don't mess up later changes when you do earlier ones (you >> >> could do them in reverse order of line and offset I suppose) and >> >> probably resorting to regexps anyway in order to find the part of the >> >> tag you want to change ... >> >> >> >> ... or you could avoid all that faff and just do re.sub()? >> > >> > Stefan answered in part, but I'll add that it is far FAR easier to do >> > the analysis with BS4 than regular expressions. I'm not sure what >> > "hoping like hell" is supposed to mean here, since the line and offset >> > have been 100% accurate in my experience; >> >> Given the string: >> >> b"\n \r\r\n\v\n\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8?" >> >> what is the line number and offset of the question mark - and does >> BeautifulSoup agree with your answer? Does the answer to that second >> question change depending on what parser you tell BeautifulSoup to use? > > I'm not sure, because I don't know how to ask BS4 about the location > of a question mark. But I replaced that with a tag, and: > raw = b"\n \r\r\n\v\n\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8" from bs4 import BeautifulSoup soup = BeautifulSoup(raw, "html.parser") soup.body.sourceline > 4 soup.body.sourcepos > 12 raw.split(b"\n")[3] > b'\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8' raw.split(b"\n")[3][12:] > b'' > > So, yes, it seems to be correct. (Slightly odd in that the sourceline > is 1-based but the sourcepos is 0-based, but that is indeed the case, > as confirmed with a much more straight-forward string.) > > And yes, it depends on the parser, but I'm using html.parser and it's fine. Hah, yes, it appears html.parser does an end-run about my lovely carefully crafted hard case by not even *trying* to work out what type of line endings the file uses and is just hard-coded to only recognise "\n" as a line ending. With the offset though, BeautifulSoup made an arbitrary decision to use ISO-8859-1 encoding and so when you chopped the bytestring at that offset it only worked because BeautifulSoup had happened to choose a 1-byte-per-character encoding. Ironically, *without* the "\xed\xa0\x80\xed\xbc\x9f" it wouldn't have worked. >> (If your answer is "if the input contains \xed\xa0\x80\xed\xbc\x9f then >> I am happy with the program throwing an exception" then feel free to >> remove that substring from the question.) > > Malformed UTF-8 doesn't seem to be a problem. Every file here seems to > be either UTF-8 or ISO-8859, and in the latter case, I'm assuming > 8859-1. So I would probably just let this one go through as 8859-1. It looks like BeautifulSoup is doing something like that, yes. Personally I would be nervous about some of my files being parsed as UTF-8 and some of them ISO-8859-1 (due to decoding errors rather than some of the files actually *being* ISO-8859-1 ;-) ) >> > the only part I'm unsure about is where the _end_ of the tag is (and >> > maybe there's a way I can use BS4 again to get that??). >> >> There doesn't seem to be. More to the point, there doesn't seem to be >> a way to find out where the *attributes* are, so as I said you'll most >> likely end up using regexps anyway. > > I'm okay with replacing an entire tag that needs to be changed. Oh, that seems like quite a big change to the original problem. > Especially if I can replace just the opening tag, not the contents and > closing tag. And in fact, I may just do that part by scanning for an > unencoded greater-than, on the assumptions that (a) BS4 will correctly > encode any greater-thans in attributes, But your input wasn't created by
Re: Mutating an HTML file with BeautifulSoup
On 2022-08-21, Peter J. Holzer wrote: > On 2022-08-20 21:51:41 -, Jon Ribbens via Python-list wrote: >> On 2022-08-20, Stefan Ram wrote: >> > Jon Ribbens writes: >> >>... or you could avoid all that faff and just do re.sub()? > >> > source = '' >> > >> > # Use Python to change the source, keeping the order of attributes. >> > >> > result = re.sub( r'href\s*=\s*"http"', r'href="https"', source ) >> > result = re.sub( r"href\s*=\s*'http'", r"href='https'", result ) > > Depending on the content of the site, this might replace some stuff > which is not a link. > >> You could go a bit harder with the regexp of course, e.g.: >> >> result = re.sub( >> r"""(<\s*a\s+[^>]*href\s*=\s*)(['"])\s*OLD\s*\2""", > > This will fail on: > I've seen *a lot* of bad/broken/weird HTML over the years, and I don't believe I've ever seen anyone do that. (Wrongly putting an 'alt' attribute on an 'a' element is very common, on the other hand ;-) ) > The problem can be solved with regular expressions (and given the > constraints I think I would prefer that to using Beautiful Soup), but > getting the regexps right is not trivial, at least in the general case. I would like to see the regular expression that could fully parse general HTML... -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
On 22/08/2022 05:30, Chris Angelico wrote: On Mon, 22 Aug 2022 at 10:04, Buck Evan wrote: I've had much success doing round trips through the lxml.html parser. https://lxml.de/lxmlhtml.html I ditched bs for lxml long ago and never regretted it. If you find that you have a bunch of invalid html that lxml inadvertently "fixes", I would recommend adding a stutter-step to your project: perform a noop roundtrip thru lxml on all files. I'd then analyze any diff by progressively excluding changes via `grep -vP`. Unless I'm mistaken, all such changes should fall into no more than a dozen groups. Will this round-trip mutate every single file and reorder the tag attributes? Because I really don't want to manually eyeball all those changes. Most certainly not. Reordering is a bs4 feature that is governed by a formatter. You can easily prevent that attributes are reorderd: >>> import bs4 >>> soup = bs4.BeautifulSoup("") >>> soup >>> class Formatter(bs4.formatter.HTMLFormatter): def attributes(self, tag): return [] if tag.attrs is None else list(tag.attrs.items()) >>> soup.decode(formatter=Formatter()) '' Blank space is probably removed by the underlying html parser. It might be possible to make bs4 instantiate the lxml.html.HTMLParser with remove_blank_text=False, but I didn't try hard enough ;) That said, for my humble html scraping needs I have ditched bs4 in favor of lxml and its xpath capabilities. -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
On Mon, 22 Aug 2022 at 10:04, Buck Evan wrote: > > I've had much success doing round trips through the lxml.html parser. > > https://lxml.de/lxmlhtml.html > > I ditched bs for lxml long ago and never regretted it. > > If you find that you have a bunch of invalid html that lxml inadvertently > "fixes", I would recommend adding a stutter-step to your project: perform a > noop roundtrip thru lxml on all files. I'd then analyze any diff by > progressively excluding changes via `grep -vP`. > Unless I'm mistaken, all such changes should fall into no more than a dozen > groups. > Will this round-trip mutate every single file and reorder the tag attributes? Because I really don't want to manually eyeball all those changes. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
I've had much success doing round trips through the lxml.html parser. https://lxml.de/lxmlhtml.html I ditched bs for lxml long ago and never regretted it. If you find that you have a bunch of invalid html that lxml inadvertently "fixes", I would recommend adding a stutter-step to your project: perform a noop roundtrip thru lxml on all files. I'd then analyze any diff by progressively excluding changes via `grep -vP`. Unless I'm mistaken, all such changes should fall into no more than a dozen groups. On Fri, Aug 19, 2022, 1:34 PM Chris Angelico wrote: > What's the best way to precisely reconstruct an HTML file after > parsing it with BeautifulSoup? > > Using the Alice example from the BS4 docs: > > >>> html_doc = """The Dormouse's story > > The Dormouse's story > > Once upon a time there were three little sisters; and > their names were > http://example.com/elsie; class="sister" id="link1">Elsie, > http://example.com/lacie; class="sister" id="link2">Lacie and > http://example.com/tillie; class="sister" id="link3">Tillie; > and they lived at the bottom of a well. > > ... > """ > >>> print(soup) > The Dormouse's story > > The Dormouse's story > Once upon a time there were three little sisters; and > their names were > http://example.com/elsie; id="link1">Elsie, > http://example.com/lacie; id="link2">Lacie and > http://example.com/tillie; id="link3">Tillie; > and they lived at the bottom of a well. > ... > > >>> > > Note two distinct changes: firstly, whitespace has been removed, and > secondly, attributes are reordered (I think alphabetically). There are > other canonicalizations being done, too. > > I'm trying to make some automated changes to a huge number of HTML > files, with minimal diffs so they're easy to validate. That means that > spurious changes like these are very much unwanted. Is there a way to > get BS4 to reconstruct the original precisely? > > The mutation itself would be things like finding an anchor tag and > changing its href attribute. Fairly simple changes, but might alter > the length of the file (eg changing "http://example.com/; into > "https://example.com/;). I'd like to do them intelligently rather than > falling back on element.sourceline and element.sourcepos, but worst > case, that's what I'll have to do (which would be fiddly). > > ChrisA > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
On Mon, 22 Aug 2022 at 05:43, Jon Ribbens via Python-list wrote: > > On 2022-08-21, Chris Angelico wrote: > > On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list > > wrote: > >> On 2022-08-20, Chris Angelico wrote: > >> > On Sun, 21 Aug 2022 at 03:27, Stefan Ram wrote: > >> >> 2qdxy4rzwzuui...@potatochowder.com writes: > >> >> >textual representations. That way, the following two elements are the > >> >> >same (and similar with a collection of sub-elements in a different > >> >> >order > >> >> >in another document): > >> >> > >> >> The /elements/ differ. They have the /same/ infoset. > >> > > >> > That's the bit that's hard to prove. > >> > > >> >> The OP could edit the files with regexps to create a new version. > >> > > >> > To you and Jon, who also suggested this: how would that be beneficial? > >> > With Beautiful Soup, I have the line number and position within the > >> > line where the tag starts; what does a regex give me that I don't have > >> > that way? > >> > >> You mean you could use BeautifulSoup to read the file and identify the > >> bits you want to change by line number and offset, and then you could > >> use that data to try and update the file, hoping like hell that your > >> definition of "line" and "offset" are identical to BeautifulSoup's > >> and that you don't mess up later changes when you do earlier ones (you > >> could do them in reverse order of line and offset I suppose) and > >> probably resorting to regexps anyway in order to find the part of the > >> tag you want to change ... > >> > >> ... or you could avoid all that faff and just do re.sub()? > > > > Stefan answered in part, but I'll add that it is far FAR easier to do > > the analysis with BS4 than regular expressions. I'm not sure what > > "hoping like hell" is supposed to mean here, since the line and offset > > have been 100% accurate in my experience; > > Given the string: > > b"\n \r\r\n\v\n\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8?" > > what is the line number and offset of the question mark - and does > BeautifulSoup agree with your answer? Does the answer to that second > question change depending on what parser you tell BeautifulSoup to use? I'm not sure, because I don't know how to ask BS4 about the location of a question mark. But I replaced that with a tag, and: >>> raw = b"\n >>> \r\r\n\v\n\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8" >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(raw, "html.parser") >>> soup.body.sourceline 4 >>> soup.body.sourcepos 12 >>> raw.split(b"\n")[3] b'\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8' >>> raw.split(b"\n")[3][12:] b'' So, yes, it seems to be correct. (Slightly odd in that the sourceline is 1-based but the sourcepos is 0-based, but that is indeed the case, as confirmed with a much more straight-forward string.) And yes, it depends on the parser, but I'm using html.parser and it's fine. > (If your answer is "if the input contains \xed\xa0\x80\xed\xbc\x9f then > I am happy with the program throwing an exception" then feel free to > remove that substring from the question.) Malformed UTF-8 doesn't seem to be a problem. Every file here seems to be either UTF-8 or ISO-8859, and in the latter case, I'm assuming 8859-1. So I would probably just let this one go through as 8859-1. > > the only part I'm unsure about is where the _end_ of the tag is (and > > maybe there's a way I can use BS4 again to get that??). > > There doesn't seem to be. More to the point, there doesn't seem to be > a way to find out where the *attributes* are, so as I said you'll most > likely end up using regexps anyway. I'm okay with replacing an entire tag that needs to be changed. Especially if I can replace just the opening tag, not the contents and closing tag. And in fact, I may just do that part by scanning for an unencoded greater-than, on the assumptions that (a) BS4 will correctly encode any greater-thans in attributes, and (b) if there's a mis-encoded one in the input, the diff will be small enough to eyeball, and a human should easily notice that the text has been massively expanded and duplicated. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
On 2022-08-21, Chris Angelico wrote: > On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list > wrote: >> On 2022-08-20, Chris Angelico wrote: >> > On Sun, 21 Aug 2022 at 03:27, Stefan Ram wrote: >> >> 2qdxy4rzwzuui...@potatochowder.com writes: >> >> >textual representations. That way, the following two elements are the >> >> >same (and similar with a collection of sub-elements in a different order >> >> >in another document): >> >> >> >> The /elements/ differ. They have the /same/ infoset. >> > >> > That's the bit that's hard to prove. >> > >> >> The OP could edit the files with regexps to create a new version. >> > >> > To you and Jon, who also suggested this: how would that be beneficial? >> > With Beautiful Soup, I have the line number and position within the >> > line where the tag starts; what does a regex give me that I don't have >> > that way? >> >> You mean you could use BeautifulSoup to read the file and identify the >> bits you want to change by line number and offset, and then you could >> use that data to try and update the file, hoping like hell that your >> definition of "line" and "offset" are identical to BeautifulSoup's >> and that you don't mess up later changes when you do earlier ones (you >> could do them in reverse order of line and offset I suppose) and >> probably resorting to regexps anyway in order to find the part of the >> tag you want to change ... >> >> ... or you could avoid all that faff and just do re.sub()? > > Stefan answered in part, but I'll add that it is far FAR easier to do > the analysis with BS4 than regular expressions. I'm not sure what > "hoping like hell" is supposed to mean here, since the line and offset > have been 100% accurate in my experience; Given the string: b"\n \r\r\n\v\n\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8?" what is the line number and offset of the question mark - and does BeautifulSoup agree with your answer? Does the answer to that second question change depending on what parser you tell BeautifulSoup to use? (If your answer is "if the input contains \xed\xa0\x80\xed\xbc\x9f then I am happy with the program throwing an exception" then feel free to remove that substring from the question.) > the only part I'm unsure about is where the _end_ of the tag is (and > maybe there's a way I can use BS4 again to get that??). There doesn't seem to be. More to the point, there doesn't seem to be a way to find out where the *attributes* are, so as I said you'll most likely end up using regexps anyway. -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
On 2022-08-20 21:51:41 -, Jon Ribbens via Python-list wrote: > On 2022-08-20, Stefan Ram wrote: > > Jon Ribbens writes: > >>... or you could avoid all that faff and just do re.sub()? > > source = '' > > > > # Use Python to change the source, keeping the order of attributes. > > > > result = re.sub( r'href\s*=\s*"http"', r'href="https"', source ) > > result = re.sub( r"href\s*=\s*'http'", r"href='https'", result ) Depending on the content of the site, this might replace some stuff which is not a link. > You could go a bit harder with the regexp of course, e.g.: > > result = re.sub( > r"""(<\s*a\s+[^>]*href\s*=\s*)(['"])\s*OLD\s*\2""", This will fail on: The problem can be solved with regular expressions (and given the constraints I think I would prefer that to using Beautiful Soup), but getting the regexps right is not trivial, at least in the general case. It may become a lot easier if you know that certain conventions were followed (e.g. that ">" was always written as "") or it may become even harder when the files contain errors. hp -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | h...@hjp.at |-- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" signature.asc Description: PGP signature -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
> On 21 Aug 2022, at 09:12, Chris Angelico wrote: > > On Sun, 21 Aug 2022 at 17:26, Barry wrote: >> >> >> On 19 Aug 2022, at 22:04, Chris Angelico wrote: >>> >>> On Sat, 20 Aug 2022 at 05:12, Barry wrote: >> On 19 Aug 2022, at 19:33, Chris Angelico wrote: > > What's the best way to precisely reconstruct an HTML file after > parsing it with BeautifulSoup? I recall that in bs4 it parses into an object tree and loses the detail of the input. I recently ported from very old bs to bs4 and hit the same issue. So no it will not output the same as went in. If you can trust the input to be parsed as xml, meaning all the rules of closing tags have been followed. Then I think you can parse and unparse thru xml to do what you want. >>> >>> >>> Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh >>> well. Thanks for trying, anyhow. >>> >>> So I'm left with a few options: >>> >>> 1) Give up on validation, give up on verification, and just run this >>> thing on the production site with my fingers crossed >> >> Can you build a beta site with original intack? > > In a naive way, a full copy would be quite a few gigabytes. I could > cut that down a good bit by taking only HTML files and the things they > reference, but then we run into the same problem of broken links, > which is what we're here to solve in the first place. > > But I would certainly not want to run two copies of the site and then > manually compare. > >> Also wonder if using selenium to walk the site may work as a verification >> step? >> I cannot recall if you can get an image of the browser window to do image >> compares with to look for rendering differences. > > Image recognition won't necessarily even be valid; some of the changes > will have visual consequences (eg a broken image reference now > becoming correct), and as soon as that happens, the whole document can > reflow. > >> From my one task using bs4 I did not see it produce any bad results. >> In my case the problems where in the code that built on bs1 using bad >> assumptions. > > Did that get run on perfect HTML, or on messy real-world stuff that > uses quirks mode? I small number of messy html pages. Barry > > ChrisA > -- > https://mail.python.org/mailman/listinfo/python-list -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
On Sun, 21 Aug 2022 at 17:26, Barry wrote: > > > > > On 19 Aug 2022, at 22:04, Chris Angelico wrote: > > > > On Sat, 20 Aug 2022 at 05:12, Barry wrote: > >> > >> > >> > On 19 Aug 2022, at 19:33, Chris Angelico wrote: > >>> > >>> What's the best way to precisely reconstruct an HTML file after > >>> parsing it with BeautifulSoup? > >> > >> I recall that in bs4 it parses into an object tree and loses the detail of > >> the input. > >> I recently ported from very old bs to bs4 and hit the same issue. > >> So no it will not output the same as went in. > >> > >> If you can trust the input to be parsed as xml, meaning all the rules of > >> closing > >> tags have been followed. Then I think you can parse and unparse thru xml to > >> do what you want. > >> > > > > > > Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh > > well. Thanks for trying, anyhow. > > > > So I'm left with a few options: > > > > 1) Give up on validation, give up on verification, and just run this > > thing on the production site with my fingers crossed > > Can you build a beta site with original intack? In a naive way, a full copy would be quite a few gigabytes. I could cut that down a good bit by taking only HTML files and the things they reference, but then we run into the same problem of broken links, which is what we're here to solve in the first place. But I would certainly not want to run two copies of the site and then manually compare. > Also wonder if using selenium to walk the site may work as a verification > step? > I cannot recall if you can get an image of the browser window to do image > compares with to look for rendering differences. Image recognition won't necessarily even be valid; some of the changes will have visual consequences (eg a broken image reference now becoming correct), and as soon as that happens, the whole document can reflow. > From my one task using bs4 I did not see it produce any bad results. > In my case the problems where in the code that built on bs1 using bad > assumptions. Did that get run on perfect HTML, or on messy real-world stuff that uses quirks mode? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
> On 19 Aug 2022, at 22:04, Chris Angelico wrote: > > On Sat, 20 Aug 2022 at 05:12, Barry wrote: >> >> >> On 19 Aug 2022, at 19:33, Chris Angelico wrote: >>> >>> What's the best way to precisely reconstruct an HTML file after >>> parsing it with BeautifulSoup? >> >> I recall that in bs4 it parses into an object tree and loses the detail of >> the input. >> I recently ported from very old bs to bs4 and hit the same issue. >> So no it will not output the same as went in. >> >> If you can trust the input to be parsed as xml, meaning all the rules of >> closing >> tags have been followed. Then I think you can parse and unparse thru xml to >> do what you want. >> > > > Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh > well. Thanks for trying, anyhow. > > So I'm left with a few options: > > 1) Give up on validation, give up on verification, and just run this > thing on the production site with my fingers crossed Can you build a beta site with original intack? Also wonder if using selenium to walk the site may work as a verification step? I cannot recall if you can get an image of the browser window to do image compares with to look for rendering differences. From my one task using bs4 I did not see it produce any bad results. In my case the problems where in the code that built on bs1 using bad assumptions. > 2) Instead of doing an intelligent reconstruction, just str.replace() > one URL with another within the file > 3) Split the file into lines, find the Nth line (elem.sourceline) and > str.replace that line only > 4) Attempt to use elem.sourceline and elem.sourcepos to find the start > of the tag, manually find the end, and replace one tag with the > reconstructed form. > > I'm inclined to the first option, honestly. The others just seem like > hard work, and I became a programmer so I could be lazy... > > ChrisA > -- > https://mail.python.org/mailman/listinfo/python-list -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
On Sun, 21 Aug 2022 at 13:41, dn wrote: > > On 21/08/2022 13.00, Chris Angelico wrote: > > Well, I don't like headaches, but I do appreciate what the G Archive > > has given me over the years, so I'm taking this on as a means of > > giving back to the community. > > This point will be picked-up in the conclusion. NB in the same way that > you want to 'give back', so also do others - even if in minor ways or > 'when-relevant'! Very true. > >> In fact, depending upon frequency, making the changes manually (and with > >> improved confidence in the result). > > > > Unfortunately the frequency is very high. > > Screechingly so? Like you're singing Three Little Maids? You don't want to hear me singing that although I do recall once singing Lady Ella's part at a Qwert, to gales of laughter. > > Yeah. I do a first pass to enumerate all domains that are ever linked > > to with http:// URLs, and then I have a script that goes through and > > checks to see if they redirect me to the same URL on the other > > protocol, or other ways of checking. So yes, the list of valid domains > > is part of the program's effective input. > > Wow! Having got that far, you have achieved data-validity. Is there a > need to perform a before-after check or diff? Yes, to ensure that nothing has changed that I *didn't* plan. The planned changes aren't the problem here, I can verify those elsewhere. > Perhaps start making the one-for-one replacements without further > anxiety. As long as there's no silly-mistake, eg failing to remove an > opening or closing angle-bracket; isn't that about all the checking needed? > (for this category of updates) Maybe, but probably not. > BTW in talk of "line-number", you will have realised the need to re-run > the identification of such after each of these steps - in case the 'new > stuff' relating to earlier steps (assuming above became also a temporal > sequence) is shorter/longer than the current HTML. Yep, that's not usually a problem. > >>> And there'll be other fixes to be done too. So it's a bit complicated, > >>> and no simple solution is really sufficient. At the very very least, I > >>> *need* to properly parse with BS4; the only question is whether I > >>> reconstruct from the parse tree, or go back to the raw file and try to > >>> edit it there. > >> > >> At least the diffs would give you something to work-from, but it's a bit > >> like git-diffs claiming a 'change' when the only difference is that my > >> IDE strips blanks from the ends of code-lines, or some-such silliness. > > > > Right; and the reconstructed version has a LOT of those unnecessary > > changes. I'm seeing a lot of changes to whitespace. The only problem > > is whether I can be confident that none of those changes could ever > > matter. > > "White-space" has lesser-meaning in HTML - this is NOT Python! In HTML > if I write "HTML file" (with two spaces), the browser will shorten the > display to a single space (hence some uses of - non-broken > space). Similarly, if attempt to use "\n" to start a new line of text... Yes, whitespace has less meaning... except when it doesn't. https://developer.mozilla.org/en-US/docs/Web/CSS/white-space Text can become preformatted by the styling, and there could be nothing whatsoever in the HTML page that shows this. I think most of the HTML files in this site have been created by a WYSIWYG editor, partly because of clues like a single bold space in a non-bold sequence of text, and the styles aren't consistent everywhere. Given that poetry comes up a lot on this site, I wouldn't put it past the editor to have set a whitespace rule on something. But I'm probably going to just ignore that and hope that any such errors are less significant than the current set of broken links. > Is there a danger of 'chasing your own tail', ie seeking a solution to a > problem which really doesn't matter (particularly if we add the phrase: > at the user-level)? Unfortunately not. I now know of three categories of change that, in theory, shouldn't affect anything: whitespace, order of attributes ("" becoming ""), and self-closing tags. Whitespace probably won't matter, until it does. Order of attributes is absolutely fine unless one of them is miswritten and now we've lost a lot of information about how it ought to have been written. And self-closing tags are probably insignificant, but I don't know how browsers handle things like "..." - and I wouldn't know whether the original intention was for the second one to be a self-closing empty paragraph, or a miswritten closing tag. It's easy to say that these changes have no effect on well-formed HTML. It's less easy to know what browsers will do with ill-formed HTML. > Agree with "properly parse". Question was an apparent dedication to BS4 > when there are other tools. Just checking you aren't wearing that type > of 'blinders'. > (didn't think so, but...) No, but there's also always the option of some tool that I've never heard of! The
Re: Mutating an HTML file with BeautifulSoup
On 21/08/2022 13.00, Chris Angelico wrote: > On Sun, 21 Aug 2022 at 09:48, dn wrote: >> On 20/08/2022 12.38, Chris Angelico wrote: >>> On Sat, 20 Aug 2022 at 10:19, dn wrote: On 20/08/2022 09.01, Chris Angelico wrote: > On Sat, 20 Aug 2022 at 05:12, Barry wrote: >>> On 19 Aug 2022, at 19:33, Chris Angelico wrote: > So I'm left with a few options: > > 1) Give up on validation, give up on verification, and just run this > thing on the production site with my fingers crossed > 2) Instead of doing an intelligent reconstruction, just str.replace() > one URL with another within the file > 3) Split the file into lines, find the Nth line (elem.sourceline) and > str.replace that line only > 4) Attempt to use elem.sourceline and elem.sourcepos to find the start > of the tag, manually find the end, and replace one tag with the > reconstructed form. > > I'm inclined to the first option, honestly. The others just seem like > hard work, and I became a programmer so I could be lazy... +1 - but I've noticed that sometimes I have to work quite hard to be this lazy! >>> >>> Yeah, that's very true... >>> Am assuming that http -> https is not the only 'change' (if it were, you'd just do that without BS). How many such changes are planned/need checking? Care to list them? >> >> This project has many of the same 'smells' as a database-harmonisation >> effort. Particularly one where 'the previous guy' used to use field-X >> for certain data, but his replacement decided that field-Y 'sounded >> better' (or some such user-logic). Arrrggg! >> >> If you like head-aches, and users coming to you with ifs-buts-and-maybes >> AFTER you've 'done stuff', this is your sort of project! > > Well, I don't like headaches, but I do appreciate what the G Archive > has given me over the years, so I'm taking this on as a means of > giving back to the community. This point will be picked-up in the conclusion. NB in the same way that you want to 'give back', so also do others - even if in minor ways or 'when-relevant'! >>> Assumption is correct. The changes are more of the form "find all the >>> problems, add to the list of fixes, try to minimize the ones that need >>> to be done manually". So far, what I have is: >> >> Having taken the trouble to identify this list of improvements and given >> the determination to verify each, consider working through one item at a >> time, rather than in a single pass. This will enable individual logging >> of changes, a manual check of each alteration, and the ability to >> choose/tailor the best tool for that specific task. >> >> In fact, depending upon frequency, making the changes manually (and with >> improved confidence in the result). > > Unfortunately the frequency is very high. Screechingly so? Like you're singing Three Little Maids? >> The presence of (or allusion to) the word "some" in this list-items is >> 'the killer'. Automation doesn't like 'some' (cf "all") unless the >> criteria can be clearly and unambiguously defined. Ouch! >> >> (I don't think you need to be told any of this, but hey: dreams are free!) > > Right; the criteria are quite well defined, but I omitted the details > for brevity. > >>> 1) A bunch of http -> https, but not all of them - only domains where >>> I've confirmed that it's valid >> >> The search-criteria is the list of valid domains, rather than the >> "http/https" which is likely the first focus. > > Yeah. I do a first pass to enumerate all domains that are ever linked > to with http:// URLs, and then I have a script that goes through and > checks to see if they redirect me to the same URL on the other > protocol, or other ways of checking. So yes, the list of valid domains > is part of the program's effective input. Wow! Having got that far, you have achieved data-validity. Is there a need to perform a before-after check or diff? Perhaps start making the one-for-one replacements without further anxiety. As long as there's no silly-mistake, eg failing to remove an opening or closing angle-bracket; isn't that about all the checking needed? (for this category of updates) >>> 2) Some absolute to relative conversions: >>> https://www.gsarchive.net/whowaswho/index.htm should be referred to as >>> /whowaswho/index.htm instead >> >> Similarly, if you have a list of these. > > It's more just the pattern "https://www.gsarchive.net/" and > "https://gsarchive.net/", and the corresponding "http://; > URLs, plus a few other malformed versions that are worth correcting > (if ever I find a link to "www.gsarchive.net/", it's almost > certainly missing its protocol). Isn't the inspection tool (described elsewhere) reporting an HTML/editor line number? That being the case, won't a bit of Swiss-Army knife Python-string work enable appropriate processing and re-writing - as well as providing the means to statistically-sample for QA? >>> 3) A few outdated URLs for which we
Re: Mutating an HTML file with BeautifulSoup
On Sun, 21 Aug 2022 at 09:48, dn wrote: > > On 20/08/2022 12.38, Chris Angelico wrote: > > On Sat, 20 Aug 2022 at 10:19, dn wrote: > >> On 20/08/2022 09.01, Chris Angelico wrote: > >>> On Sat, 20 Aug 2022 at 05:12, Barry wrote: > > On 19 Aug 2022, at 19:33, Chris Angelico wrote: > > > > What's the best way to precisely reconstruct an HTML file after > > parsing it with BeautifulSoup? > ... > > >>> well. Thanks for trying, anyhow. > >>> > >>> So I'm left with a few options: > >>> > >>> 1) Give up on validation, give up on verification, and just run this > >>> thing on the production site with my fingers crossed > >>> 2) Instead of doing an intelligent reconstruction, just str.replace() > >>> one URL with another within the file > >>> 3) Split the file into lines, find the Nth line (elem.sourceline) and > >>> str.replace that line only > >>> 4) Attempt to use elem.sourceline and elem.sourcepos to find the start > >>> of the tag, manually find the end, and replace one tag with the > >>> reconstructed form. > >>> > >>> I'm inclined to the first option, honestly. The others just seem like > >>> hard work, and I became a programmer so I could be lazy... > >> +1 - but I've noticed that sometimes I have to work quite hard to be > >> this lazy! > > > > Yeah, that's very true... > > > >> Am assuming that http -> https is not the only 'change' (if it were, > >> you'd just do that without BS). How many such changes are planned/need > >> checking? Care to list them? > > This project has many of the same 'smells' as a database-harmonisation > effort. Particularly one where 'the previous guy' used to use field-X > for certain data, but his replacement decided that field-Y 'sounded > better' (or some such user-logic). Arrrggg! > > If you like head-aches, and users coming to you with ifs-buts-and-maybes > AFTER you've 'done stuff', this is your sort of project! Well, I don't like headaches, but I do appreciate what the G Archive has given me over the years, so I'm taking this on as a means of giving back to the community. > > Assumption is correct. The changes are more of the form "find all the > > problems, add to the list of fixes, try to minimize the ones that need > > to be done manually". So far, what I have is: > > Having taken the trouble to identify this list of improvements and given > the determination to verify each, consider working through one item at a > time, rather than in a single pass. This will enable individual logging > of changes, a manual check of each alteration, and the ability to > choose/tailor the best tool for that specific task. > > In fact, depending upon frequency, making the changes manually (and with > improved confidence in the result). Unfortunately the frequency is very high. > The presence of (or allusion to) the word "some" in this list-items is > 'the killer'. Automation doesn't like 'some' (cf "all") unless the > criteria can be clearly and unambiguously defined. Ouch! > > (I don't think you need to be told any of this, but hey: dreams are free!) Right; the criteria are quite well defined, but I omitted the details for brevity. > > 1) A bunch of http -> https, but not all of them - only domains where > > I've confirmed that it's valid > > The search-criteria is the list of valid domains, rather than the > "http/https" which is likely the first focus. Yeah. I do a first pass to enumerate all domains that are ever linked to with http:// URLs, and then I have a script that goes through and checks to see if they redirect me to the same URL on the other protocol, or other ways of checking. So yes, the list of valid domains is part of the program's effective input. > > 2) Some absolute to relative conversions: > > https://www.gsarchive.net/whowaswho/index.htm should be referred to as > > /whowaswho/index.htm instead > > Similarly, if you have a list of these. It's more just the pattern "https://www.gsarchive.net/" and "https://gsarchive.net/", and the corresponding "http://; URLs, plus a few other malformed versions that are worth correcting (if ever I find a link to "www.gsarchive.net/", it's almost certainly missing its protocol). > > 3) A few outdated URLs for which we know the replacement, eg > > http://www.cris.com/~oakapple/gasdisc/ to > > http://www.gasdisc.oakapplepress.com/ (this one can't go on > > HTTPS, which is one reason I can't shortcut that) > > Again. Same; although those are manually entered as patterns. > > 4) Some internal broken links where the path is wrong - anything that > > resolves to /books/ but can't be found might be better > > rewritten as /html/perf_grps/websites/ if the file can be > > found there > > Again. The fixups are manually entered, but I also need to know about every broken internal link so that I can look through them and figure out what's wrong. > > 5) Any external link that yields a permanent redirect should, to save > > clientside requests, get replaced by the destination. We have some > > Creative
Re: Mutating an HTML file with BeautifulSoup
On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list wrote: > > On 2022-08-20, Chris Angelico wrote: > > On Sun, 21 Aug 2022 at 03:27, Stefan Ram wrote: > >> 2qdxy4rzwzuui...@potatochowder.com writes: > >> >textual representations. That way, the following two elements are the > >> >same (and similar with a collection of sub-elements in a different order > >> >in another document): > >> > >> The /elements/ differ. They have the /same/ infoset. > > > > That's the bit that's hard to prove. > > > >> The OP could edit the files with regexps to create a new version. > > > > To you and Jon, who also suggested this: how would that be beneficial? > > With Beautiful Soup, I have the line number and position within the > > line where the tag starts; what does a regex give me that I don't have > > that way? > > You mean you could use BeautifulSoup to read the file and identify the > bits you want to change by line number and offset, and then you could > use that data to try and update the file, hoping like hell that your > definition of "line" and "offset" are identical to BeautifulSoup's > and that you don't mess up later changes when you do earlier ones (you > could do them in reverse order of line and offset I suppose) and > probably resorting to regexps anyway in order to find the part of the > tag you want to change ... > > ... or you could avoid all that faff and just do re.sub()? Stefan answered in part, but I'll add that it is far FAR easier to do the analysis with BS4 than regular expressions. I'm not sure what "hoping like hell" is supposed to mean here, since the line and offset have been 100% accurate in my experience; the only part I'm unsure about is where the _end_ of the tag is (and maybe there's a way I can use BS4 again to get that??). ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
On 20/08/2022 12.38, Chris Angelico wrote: > On Sat, 20 Aug 2022 at 10:19, dn wrote: >> On 20/08/2022 09.01, Chris Angelico wrote: >>> On Sat, 20 Aug 2022 at 05:12, Barry wrote: > On 19 Aug 2022, at 19:33, Chris Angelico wrote: > > What's the best way to precisely reconstruct an HTML file after > parsing it with BeautifulSoup? ... >>> well. Thanks for trying, anyhow. >>> >>> So I'm left with a few options: >>> >>> 1) Give up on validation, give up on verification, and just run this >>> thing on the production site with my fingers crossed >>> 2) Instead of doing an intelligent reconstruction, just str.replace() >>> one URL with another within the file >>> 3) Split the file into lines, find the Nth line (elem.sourceline) and >>> str.replace that line only >>> 4) Attempt to use elem.sourceline and elem.sourcepos to find the start >>> of the tag, manually find the end, and replace one tag with the >>> reconstructed form. >>> >>> I'm inclined to the first option, honestly. The others just seem like >>> hard work, and I became a programmer so I could be lazy... >> +1 - but I've noticed that sometimes I have to work quite hard to be >> this lazy! > > Yeah, that's very true... > >> Am assuming that http -> https is not the only 'change' (if it were, >> you'd just do that without BS). How many such changes are planned/need >> checking? Care to list them? This project has many of the same 'smells' as a database-harmonisation effort. Particularly one where 'the previous guy' used to use field-X for certain data, but his replacement decided that field-Y 'sounded better' (or some such user-logic). Arrrggg! If you like head-aches, and users coming to you with ifs-buts-and-maybes AFTER you've 'done stuff', this is your sort of project! > Assumption is correct. The changes are more of the form "find all the > problems, add to the list of fixes, try to minimize the ones that need > to be done manually". So far, what I have is: Having taken the trouble to identify this list of improvements and given the determination to verify each, consider working through one item at a time, rather than in a single pass. This will enable individual logging of changes, a manual check of each alteration, and the ability to choose/tailor the best tool for that specific task. In fact, depending upon frequency, making the changes manually (and with improved confidence in the result). The presence of (or allusion to) the word "some" in this list-items is 'the killer'. Automation doesn't like 'some' (cf "all") unless the criteria can be clearly and unambiguously defined. Ouch! (I don't think you need to be told any of this, but hey: dreams are free!) > 1) A bunch of http -> https, but not all of them - only domains where > I've confirmed that it's valid The search-criteria is the list of valid domains, rather than the "http/https" which is likely the first focus. > 2) Some absolute to relative conversions: > https://www.gsarchive.net/whowaswho/index.htm should be referred to as > /whowaswho/index.htm instead Similarly, if you have a list of these. > 3) A few outdated URLs for which we know the replacement, eg > http://www.cris.com/~oakapple/gasdisc/ to > http://www.gasdisc.oakapplepress.com/ (this one can't go on > HTTPS, which is one reason I can't shortcut that) Again. > 4) Some internal broken links where the path is wrong - anything that > resolves to /books/ but can't be found might be better > rewritten as /html/perf_grps/websites/ if the file can be > found there Again. > 5) Any external link that yields a permanent redirect should, to save > clientside requests, get replaced by the destination. We have some > Creative Commons badges that have moved to new URLs. Do you have these as a list, or are you intending the automated-method to auto-magically follow the link to determine any need for action? > And there'll be other fixes to be done too. So it's a bit complicated, > and no simple solution is really sufficient. At the very very least, I > *need* to properly parse with BS4; the only question is whether I > reconstruct from the parse tree, or go back to the raw file and try to > edit it there. At least the diffs would give you something to work-from, but it's a bit like git-diffs claiming a 'change' when the only difference is that my IDE strips blanks from the ends of code-lines, or some-such silliness. Which brings me to ask: why "*need* to properly parse with BS4"? What about selective use of tools, previously-mentioned in this thread? Is Selenium worthy of consideration? I'm assuming you've already been using a link-checker utility to locate the links which need to be changed. They can be used in QA-mode after-the-fact too. > For the record, I have very long-term plans to migrate parts of the > site to Markdown, which would make a lot of things easier. But for > now, I need to fix the existing problems in the existing HTML files, > without doing gigantic wholesale layout
Re: Mutating an HTML file with BeautifulSoup
On 2022-08-20, Stefan Ram wrote: > Jon Ribbens writes: >>... or you could avoid all that faff and just do re.sub()? > > import bs4 > import re > > source = '' > > # Use Python to change the source, keeping the order of attributes. > > result = re.sub( r'href\s*=\s*"http"', r'href="https"', source ) > result = re.sub( r"href\s*=\s*'http'", r"href='https'", result ) You could go a bit harder with the regexp of course, e.g.: result = re.sub( r"""(<\s*a\s+[^>]*href\s*=\s*)(['"])\s*OLD\s*\2""", r"\1\2NEW\2", source, flags=re.IGNORECASE ) > # Now use BeautifulSoup only for the verification of the result. > > reference = bs4.BeautifulSoup( source, features="html.parser" ) > for a in reference.find_all( "a" ): > if a[ 'href' ]== 'http': a[ 'href' ]='https' > > print( bs4.BeautifulSoup( result, features="html.parser" )== reference ) Hmm, yes that seems like a pretty good idea. -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
On 2022-08-20, Chris Angelico wrote: > On Sun, 21 Aug 2022 at 03:27, Stefan Ram wrote: >> 2qdxy4rzwzuui...@potatochowder.com writes: >> >textual representations. That way, the following two elements are the >> >same (and similar with a collection of sub-elements in a different order >> >in another document): >> >> The /elements/ differ. They have the /same/ infoset. > > That's the bit that's hard to prove. > >> The OP could edit the files with regexps to create a new version. > > To you and Jon, who also suggested this: how would that be beneficial? > With Beautiful Soup, I have the line number and position within the > line where the tag starts; what does a regex give me that I don't have > that way? You mean you could use BeautifulSoup to read the file and identify the bits you want to change by line number and offset, and then you could use that data to try and update the file, hoping like hell that your definition of "line" and "offset" are identical to BeautifulSoup's and that you don't mess up later changes when you do earlier ones (you could do them in reverse order of line and offset I suppose) and probably resorting to regexps anyway in order to find the part of the tag you want to change ... ... or you could avoid all that faff and just do re.sub()? -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
On Sun, 21 Aug 2022 at 03:27, Stefan Ram wrote: > > 2qdxy4rzwzuui...@potatochowder.com writes: > >textual representations. That way, the following two elements are the > >same (and similar with a collection of sub-elements in a different order > >in another document): > > The /elements/ differ. They have the /same/ infoset. That's the bit that's hard to prove. > The OP could edit the files with regexps to create a new version. To you and Jon, who also suggested this: how would that be beneficial? With Beautiful Soup, I have the line number and position within the line where the tag starts; what does a regex give me that I don't have that way? > Soup := BeautifulSoup. > > Then have Soup read both the new version and the old version. > > Then have Soup also edit the old version read in, the same way as > the regexps did and verify that now the old version edited by > Soup and the new version created using regexps agree. > > Or just use Soup as a tool to show the diffs for visual inspection > by having Soup read both the original version and the version edited > with regexps. Now both are normalized by Soup and Soup can show the > diffs (such a diff feature might not be a part of Soup, but it should > not be too much effort to write one using Soup). > But as mentioned, the entire problem *is* the normalization, as I have no proof that it has had no impact on the rendering of the page. Comparing two normalized versions is no better than my original option 1, whereby I simply ignore the normalization and write out the reconstructed content. It's easy if you know for certain that the page is well-formed. Much harder if you do not - or, as in some cases, if you know the page is badly-formed. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
On 2022-08-19, Chris Angelico wrote: > What's the best way to precisely reconstruct an HTML file after > parsing it with BeautifulSoup? > > Using the Alice example from the BS4 docs: > html_doc = """The Dormouse's story > >The Dormouse's story > >Once upon a time there were three little sisters; and > their names were >http://example.com/elsie; class="sister" id="link1">Elsie, >http://example.com/lacie; class="sister" id="link2">Lacie and >http://example.com/tillie; class="sister" id="link3">Tillie; > and they lived at the bottom of a well. > >... > """ print(soup) >The Dormouse's story > >The Dormouse's story >Once upon a time there were three little sisters; and > their names were >http://example.com/elsie; id="link1">Elsie, >http://example.com/lacie; id="link2">Lacie and >http://example.com/tillie; id="link3">Tillie; > and they lived at the bottom of a well. >... > > > Note two distinct changes: firstly, whitespace has been removed, and > secondly, attributes are reordered (I think alphabetically). There are > other canonicalizations being done, too. > > I'm trying to make some automated changes to a huge number of HTML > files, with minimal diffs so they're easy to validate. That means that > spurious changes like these are very much unwanted. Is there a way to > get BS4 to reconstruct the original precisely? > > The mutation itself would be things like finding an anchor tag and > changing its href attribute. Fairly simple changes, but might alter > the length of the file (eg changing "http://example.com/; into > "https://example.com/;). I'd like to do them intelligently rather than > falling back on element.sourceline and element.sourcepos, but worst > case, that's what I'll have to do (which would be fiddly). I'm tempting the Wrath of Zalgo by saying it, but ... regexp? -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
On Sat, 20 Aug 2022 at 10:19, dn wrote: > > On 20/08/2022 09.01, Chris Angelico wrote: > > On Sat, 20 Aug 2022 at 05:12, Barry wrote: > >> > >> > >> > >>> On 19 Aug 2022, at 19:33, Chris Angelico wrote: > >>> > >>> What's the best way to precisely reconstruct an HTML file after > >>> parsing it with BeautifulSoup? > >> > >> I recall that in bs4 it parses into an object tree and loses the detail of > >> the input. > >> I recently ported from very old bs to bs4 and hit the same issue. > >> So no it will not output the same as went in. > >> > >> If you can trust the input to be parsed as xml, meaning all the rules of > >> closing > >> tags have been followed. Then I think you can parse and unparse thru xml to > >> do what you want. > >> > > > > > > Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh > > well. Thanks for trying, anyhow. > > > > So I'm left with a few options: > > > > 1) Give up on validation, give up on verification, and just run this > > thing on the production site with my fingers crossed > > 2) Instead of doing an intelligent reconstruction, just str.replace() > > one URL with another within the file > > 3) Split the file into lines, find the Nth line (elem.sourceline) and > > str.replace that line only > > 4) Attempt to use elem.sourceline and elem.sourcepos to find the start > > of the tag, manually find the end, and replace one tag with the > > reconstructed form. > > > > I'm inclined to the first option, honestly. The others just seem like > > hard work, and I became a programmer so I could be lazy... > +1 - but I've noticed that sometimes I have to work quite hard to be > this lazy! Yeah, that's very true... > Am assuming that http -> https is not the only 'change' (if it were, > you'd just do that without BS). How many such changes are planned/need > checking? Care to list them? > Assumption is correct. The changes are more of the form "find all the problems, add to the list of fixes, try to minimize the ones that need to be done manually". So far, what I have is: 1) A bunch of http -> https, but not all of them - only domains where I've confirmed that it's valid 2) Some absolute to relative conversions: https://www.gsarchive.net/whowaswho/index.htm should be referred to as /whowaswho/index.htm instead 3) A few outdated URLs for which we know the replacement, eg http://www.cris.com/~oakapple/gasdisc/ to http://www.gasdisc.oakapplepress.com/ (this one can't go on HTTPS, which is one reason I can't shortcut that) 4) Some internal broken links where the path is wrong - anything that resolves to /books/ but can't be found might be better rewritten as /html/perf_grps/websites/ if the file can be found there 5) Any external link that yields a permanent redirect should, to save clientside requests, get replaced by the destination. We have some Creative Commons badges that have moved to new URLs. And there'll be other fixes to be done too. So it's a bit complicated, and no simple solution is really sufficient. At the very very least, I *need* to properly parse with BS4; the only question is whether I reconstruct from the parse tree, or go back to the raw file and try to edit it there. For the record, I have very long-term plans to migrate parts of the site to Markdown, which would make a lot of things easier. But for now, I need to fix the existing problems in the existing HTML files, without doing gigantic wholesale layout changes. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
On 20/08/2022 09.01, Chris Angelico wrote: > On Sat, 20 Aug 2022 at 05:12, Barry wrote: >> >> >> >>> On 19 Aug 2022, at 19:33, Chris Angelico wrote: >>> >>> What's the best way to precisely reconstruct an HTML file after >>> parsing it with BeautifulSoup? >> >> I recall that in bs4 it parses into an object tree and loses the detail of >> the input. >> I recently ported from very old bs to bs4 and hit the same issue. >> So no it will not output the same as went in. >> >> If you can trust the input to be parsed as xml, meaning all the rules of >> closing >> tags have been followed. Then I think you can parse and unparse thru xml to >> do what you want. >> > > > Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh > well. Thanks for trying, anyhow. > > So I'm left with a few options: > > 1) Give up on validation, give up on verification, and just run this > thing on the production site with my fingers crossed > 2) Instead of doing an intelligent reconstruction, just str.replace() > one URL with another within the file > 3) Split the file into lines, find the Nth line (elem.sourceline) and > str.replace that line only > 4) Attempt to use elem.sourceline and elem.sourcepos to find the start > of the tag, manually find the end, and replace one tag with the > reconstructed form. > > I'm inclined to the first option, honestly. The others just seem like > hard work, and I became a programmer so I could be lazy... +1 - but I've noticed that sometimes I have to work quite hard to be this lazy! Am assuming that http -> https is not the only 'change' (if it were, you'd just do that without BS). How many such changes are planned/need checking? Care to list them? -- -- Regards, =dn -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
On Sat, 20 Aug 2022 at 10:04, David wrote: > > On Sat, 20 Aug 2022 at 04:31, Chris Angelico wrote: > > > What's the best way to precisely reconstruct an HTML file after > > parsing it with BeautifulSoup? > > > Note two distinct changes: firstly, whitespace has been removed, and > > secondly, attributes are reordered (I think alphabetically). There are > > other canonicalizations being done, too. > > > I'm trying to make some automated changes to a huge number of HTML > > files, with minimal diffs so they're easy to validate. That means that > > spurious changes like these are very much unwanted. Is there a way to > > get BS4 to reconstruct the original precisely? > > On Sat, 20 Aug 2022 at 07:02, Chris Angelico wrote: > > On Sat, 20 Aug 2022 at 05:12, Barry wrote: > > > > I recall that in bs4 it parses into an object tree and loses the detail > > > of the input. I recently ported from very old bs to bs4 and hit the > > > same issue. So no it will not output the same as went in. > > > So I'm left with a few options: > > > 1) Give up on validation, give up on verification, and just run this > >thing on the production site with my fingers crossed > > > 2) Instead of doing an intelligent reconstruction, just str.replace() one > >URL with another within the file > > > 3) Split the file into lines, find the Nth line (elem.sourceline) and > >str.replace that line only > > > 4) Attempt to use elem.sourceline and elem.sourcepos to find the start of > >the tag, manually find the end, and replace one tag with the > >reconstructed form. > > > I'm inclined to the first option, honestly. The others just seem like > > hard work, and I became a programmer so I could be lazy... > > Hi, I don't know if you will like this option, but I don't see it on the > list yet so ... Hey, all options are welcomed :) > I'm assuming that the phrase "with minimal diffs so they're easy to > validate" means being eyeballed by a human. > > Have you considered two passes through BS? Do the first pass with no > modification, so that the intermediate result gets the BS default > "spurious" changes. > > Then do the second pass with the desired changes, so that the human will > see only the desired changes in the diff. I'm 100% confident of the actual changes, so that wouldn't really solve anything. The problem is that, without eyeballing the actual changes, I can't easily see if there's been something else changed or broken. This is a scripted change that will affect probably hundreds of HTML files across a large web site, so making sure I don't break anything means either (a) minimize the diff so it's clearly correct, or (b) eyeball the rendered versions of every page - manually - to see if there were any unintended changes. (There WILL be intended visual changes, so I can't render the page to bitmap and ensure that it hasn't changed. This is not React snapshot testing, which IMO is one of the most useless testing features ever devised. No, actually, that can't be true, someone MUST have made a worse one.) Appreciate the suggestion, though! ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
On Sat, 20 Aug 2022 at 04:31, Chris Angelico wrote: > What's the best way to precisely reconstruct an HTML file after > parsing it with BeautifulSoup? > Note two distinct changes: firstly, whitespace has been removed, and > secondly, attributes are reordered (I think alphabetically). There are > other canonicalizations being done, too. > I'm trying to make some automated changes to a huge number of HTML > files, with minimal diffs so they're easy to validate. That means that > spurious changes like these are very much unwanted. Is there a way to > get BS4 to reconstruct the original precisely? On Sat, 20 Aug 2022 at 07:02, Chris Angelico wrote: > On Sat, 20 Aug 2022 at 05:12, Barry wrote: > > I recall that in bs4 it parses into an object tree and loses the detail > > of the input. I recently ported from very old bs to bs4 and hit the > > same issue. So no it will not output the same as went in. > So I'm left with a few options: > 1) Give up on validation, give up on verification, and just run this >thing on the production site with my fingers crossed > 2) Instead of doing an intelligent reconstruction, just str.replace() one >URL with another within the file > 3) Split the file into lines, find the Nth line (elem.sourceline) and >str.replace that line only > 4) Attempt to use elem.sourceline and elem.sourcepos to find the start of >the tag, manually find the end, and replace one tag with the >reconstructed form. > I'm inclined to the first option, honestly. The others just seem like > hard work, and I became a programmer so I could be lazy... Hi, I don't know if you will like this option, but I don't see it on the list yet so ... I'm assuming that the phrase "with minimal diffs so they're easy to validate" means being eyeballed by a human. Have you considered two passes through BS? Do the first pass with no modification, so that the intermediate result gets the BS default "spurious" changes. Then do the second pass with the desired changes, so that the human will see only the desired changes in the diff. -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
On Sat, 20 Aug 2022 at 05:12, Barry wrote: > > > > > On 19 Aug 2022, at 19:33, Chris Angelico wrote: > > > > What's the best way to precisely reconstruct an HTML file after > > parsing it with BeautifulSoup? > > I recall that in bs4 it parses into an object tree and loses the detail of > the input. > I recently ported from very old bs to bs4 and hit the same issue. > So no it will not output the same as went in. > > If you can trust the input to be parsed as xml, meaning all the rules of > closing > tags have been followed. Then I think you can parse and unparse thru xml to > do what you want. > Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh well. Thanks for trying, anyhow. So I'm left with a few options: 1) Give up on validation, give up on verification, and just run this thing on the production site with my fingers crossed 2) Instead of doing an intelligent reconstruction, just str.replace() one URL with another within the file 3) Split the file into lines, find the Nth line (elem.sourceline) and str.replace that line only 4) Attempt to use elem.sourceline and elem.sourcepos to find the start of the tag, manually find the end, and replace one tag with the reconstructed form. I'm inclined to the first option, honestly. The others just seem like hard work, and I became a programmer so I could be lazy... ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
On 2022-08-19 at 20:12:35 +0100, Barry wrote: > > On 19 Aug 2022, at 19:33, Chris Angelico wrote: > > > > What's the best way to precisely reconstruct an HTML file after > > parsing it with BeautifulSoup? > > I recall that in bs4 it parses into an object tree and loses the > detail of the input. I recently ported from very old bs to bs4 and > hit the same issue. So no it will not output the same as went in. > > If you can trust the input to be parsed as xml, meaning all the rules > of closing tags have been followed. Then I think you can parse and > unparse thru xml to do what you want. XML is in the same boat. Except for "canonical form" (which underlies cryptographically signed XML documents) the standards explicitly don't require tools to round-trip the "source code." The preferred method of comparing XML documents is at the structural level rather than with textual representations. That way, the following two elements are the same (and similar with a collection of sub-elements in a different order in another document): and Dan -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
> On 19 Aug 2022, at 19:33, Chris Angelico wrote: > > What's the best way to precisely reconstruct an HTML file after > parsing it with BeautifulSoup? I recall that in bs4 it parses into an object tree and loses the detail of the input. I recently ported from very old bs to bs4 and hit the same issue. So no it will not output the same as went in. If you can trust the input to be parsed as xml, meaning all the rules of closing tags have been followed. Then I think you can parse and unparse thru xml to do what you want. Barry > > Using the Alice example from the BS4 docs: > html_doc = """The Dormouse's story > > The Dormouse's story > > Once upon a time there were three little sisters; and > their names were > http://example.com/elsie; class="sister" id="link1">Elsie, > http://example.com/lacie; class="sister" id="link2">Lacie and > http://example.com/tillie; class="sister" id="link3">Tillie; > and they lived at the bottom of a well. > > ... > """ print(soup) > The Dormouse's story > > The Dormouse's story > Once upon a time there were three little sisters; and > their names were > http://example.com/elsie; id="link1">Elsie, > http://example.com/lacie; id="link2">Lacie and > http://example.com/tillie; id="link3">Tillie; > and they lived at the bottom of a well. > ... > > > Note two distinct changes: firstly, whitespace has been removed, and > secondly, attributes are reordered (I think alphabetically). There are > other canonicalizations being done, too. > > I'm trying to make some automated changes to a huge number of HTML > files, with minimal diffs so they're easy to validate. That means that > spurious changes like these are very much unwanted. Is there a way to > get BS4 to reconstruct the original precisely? > > The mutation itself would be things like finding an anchor tag and > changing its href attribute. Fairly simple changes, but might alter > the length of the file (eg changing "http://example.com/; into > "https://example.com/;). I'd like to do them intelligently rather than > falling back on element.sourceline and element.sourcepos, but worst > case, that's what I'll have to do (which would be fiddly). > > ChrisA > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list