Re: Mutating an HTML file with BeautifulSoup

2022-08-23 Thread Peter J. Holzer
On 2022-08-22 19:27:28 -, Jon Ribbens via Python-list wrote:
> On 2022-08-22, Peter J. Holzer  wrote:
> > On 2022-08-22 00:45:56 -, Jon Ribbens via Python-list wrote:
> >> With the offset though, BeautifulSoup made an arbitrary decision to
> >> use ISO-8859-1 encoding and so when you chopped the bytestring at
> >> that offset it only worked because BeautifulSoup had happened to
> >> choose a 1-byte-per-character encoding. Ironically, *without* the
> >> "\xed\xa0\x80\xed\xbc\x9f" it wouldn't have worked.
> >
> > Actually it would. The unit is bytes if you feed it with bytes, and
> > characters if you feed it with str.
> 
> No it isn't. If you give BeautifulSoup's 'html.parser' bytes as input,
> it first chooses an encoding and decodes the bytes before sending that
> output to html.parser, which is what provides the offset. So the offsets
> it gives are in characters, and you've no simple way of converting that
> back to byte offsets.

Ah, I see. It "worked" for me because "\xed\xa0\x80\xed\xbc\x9f" isn't
valid UTF-8. So Beautifulsoup decided to ignore the "" I had inserted before and used ISO-8859-1, providing
me with correct byte offsets. If I replace that gibberish with a correct
UTF-8 sequence (e.g. "\x4B\xC3\xA4\x73\x65") the UTF-8 is decoded and I
get a character offset.


> >> It looks like BeautifulSoup is doing something like that, yes.
> >> Personally I would be nervous about some of my files being parsed
> >> as UTF-8 and some of them ISO-8859-1 (due to decoding errors rather
> >> than some of the files actually *being* ISO-8859-1 ;-) )
> >
> > Since none of the syntactically meaningful characters have a code >=
> > 0x80, you can parse HTML at the byte level if you know that it's encoded
> > in a strict superset of ASCII (which all of the ISO-8859 family and
> > UTF-8 are). Only if that's not true (e.g. if your files might be UTF-16
> > (or Shift-JIS  or EUC, if I remember correctly) then you have to know
> > the the character set.
> >
> > (By parsing I mean only "create a syntax tree". Obviously you have to
> > know the encoding to know whether to display =ABc3 bc=BB as =AB=FC=BB or =
> >=AB=C3=BC=BB.)
> 
> But the job here isn't to create a syntax tree. It's to change some of
> the content, which for all we know is not ASCII.

We know it's URLs, and the canonical form of an URL is ASCII. The URLs
in the files may not be, but if they aren't you'll have to deal with
variants anyway. And the start and end of the attribute can be
determined in any strict superset of ASCII including UTF-8.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-22 Thread Jon Ribbens via Python-list
On 2022-08-22, Peter J. Holzer  wrote:
> On 2022-08-22 00:45:56 -, Jon Ribbens via Python-list wrote:
>> With the offset though, BeautifulSoup made an arbitrary decision to
>> use ISO-8859-1 encoding and so when you chopped the bytestring at
>> that offset it only worked because BeautifulSoup had happened to
>> choose a 1-byte-per-character encoding. Ironically, *without* the
>> "\xed\xa0\x80\xed\xbc\x9f" it wouldn't have worked.
>
> Actually it would. The unit is bytes if you feed it with bytes, and
> characters if you feed it with str.

No it isn't. If you give BeautifulSoup's 'html.parser' bytes as input,
it first chooses an encoding and decodes the bytes before sending that
output to html.parser, which is what provides the offset. So the offsets
it gives are in characters, and you've no simple way of converting that
back to byte offsets.

> (OTOH it seems that the html parser doesn't heed any 
> tags, which seems less than ideal for more pedestrian purposes.)

html.parser doesn't accept bytes as input, so it couldn't do anything
with the encoding even if it knew it. BeautifulSoup's 'html.parser'
however does look for and use  (using a regexp, natch).

>> It looks like BeautifulSoup is doing something like that, yes.
>> Personally I would be nervous about some of my files being parsed
>> as UTF-8 and some of them ISO-8859-1 (due to decoding errors rather
>> than some of the files actually *being* ISO-8859-1 ;-) )
>
> Since none of the syntactically meaningful characters have a code >=
> 0x80, you can parse HTML at the byte level if you know that it's encoded
> in a strict superset of ASCII (which all of the ISO-8859 family and
> UTF-8 are). Only if that's not true (e.g. if your files might be UTF-16
> (or Shift-JIS  or EUC, if I remember correctly) then you have to know
> the the character set.
>
> (By parsing I mean only "create a syntax tree". Obviously you have to
> know the encoding to know whether to display =ABc3 bc=BB as =AB=FC=BB or =
>=AB=C3=BC=BB.)

But the job here isn't to create a syntax tree. It's to change some of
the content, which for all we know is not ASCII.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-22 Thread Peter J. Holzer
On 2022-08-22 00:45:56 -, Jon Ribbens via Python-list wrote:
> With the offset though, BeautifulSoup made an arbitrary decision to
> use ISO-8859-1 encoding and so when you chopped the bytestring at
> that offset it only worked because BeautifulSoup had happened to
> choose a 1-byte-per-character encoding. Ironically, *without* the
> "\xed\xa0\x80\xed\xbc\x9f" it wouldn't have worked.

Actually it would. The unit is bytes if you feed it with bytes, and
characters if you feed it with str. So in any case you can use the
offset on the data you fed to the parser. Maybe not what you expected,
but seems quite useful for what Chris has in mind.

(OTOH it seems that the html parser doesn't heed any 
tags, which seems less than ideal for more pedestrian purposes.)

> > So I would probably just let this one go through as 8859-1.
> 
> It looks like BeautifulSoup is doing something like that, yes.
> Personally I would be nervous about some of my files being parsed
> as UTF-8 and some of them ISO-8859-1 (due to decoding errors rather
> than some of the files actually *being* ISO-8859-1 ;-) )

Since none of the syntactically meaningful characters have a code >=
0x80, you can parse HTML at the byte level if you know that it's encoded
in a strict superset of ASCII (which all of the ISO-8859 family and
UTF-8 are). Only if that's not true (e.g. if your files might be UTF-16
(or Shift-JIS  or EUC, if I remember correctly) then you have to know
the the character set.

(By parsing I mean only "create a syntax tree". Obviously you have to
know the encoding to know whether to display «c3 bc» as «ü» or «Ã¼».)

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-22 Thread Peter J. Holzer
On 2022-08-22 00:09:01 -, Jon Ribbens via Python-list wrote:
> On 2022-08-21, Peter J. Holzer  wrote:
> > On 2022-08-20 21:51:41 -, Jon Ribbens via Python-list wrote:
> >>   result = re.sub(
> >>   r"""(<\s*a\s+[^>]*href\s*=\s*)(['"])\s*OLD\s*\2""",
> >
> > This will fail on:
> > 
> 
> I've seen *a lot* of bad/broken/weird HTML over the years, and I don't
> believe I've ever seen anyone do that. (Wrongly putting an 'alt'
> attribute on an 'a' element is very common, on the other hand ;-) )

My bad. I meant title, not alt, of course. The unescaped > is completely
standard conforming HTML, however (both HTML 4.01 strict and HTML 5).
You almost never have to escape > - in fact I can't think of any case
right now - and I generally don't (sometimes I do for symmetry with <,
but that's an aesthetic choice, not a technical one).


> > The problem can be solved with regular expressions (and given the
> > constraints I think I would prefer that to using Beautiful Soup), but
> > getting the regexps right is not trivial, at least in the general case.
> 
> I would like to see the regular expression that could fully parse
> general HTML...

That depends on what you mean by "parse".

If you mean "construct a DOM tree", you can't since regular expressions
(in the mathematical sense, not what's implemented by some programming
languages) by definition describe finite automata, and those don't
support recursion.

But if you mean "split into a sequence of tags and PCDATA's (and then
each tag further into its attributes)", that's absolutely possible, and
that's all that is needed here. I don't think I have ever implemented a
complete solution (if only because stuff like  is
extremely rare), but I should have some Perl code lying around which
worked on a wide variety of HTML. I just have to find it again ...

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-22 Thread Jon Ribbens via Python-list
On 2022-08-21, Chris Angelico  wrote:
> On Mon, 22 Aug 2022 at 05:43, Jon Ribbens via Python-list
> wrote:
>> On 2022-08-21, Chris Angelico  wrote:
>> > On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list
>> > wrote:
>> >> On 2022-08-20, Chris Angelico  wrote:
>> >> > On Sun, 21 Aug 2022 at 03:27, Stefan Ram  
>> >> > wrote:
>> >> >> 2qdxy4rzwzuui...@potatochowder.com writes:
>> >> >> >textual representations.  That way, the following two elements are the
>> >> >> >same (and similar with a collection of sub-elements in a different 
>> >> >> >order
>> >> >> >in another document):
>> >> >>
>> >> >>   The /elements/ differ. They have the /same/ infoset.
>> >> >
>> >> > That's the bit that's hard to prove.
>> >> >
>> >> >>   The OP could edit the files with regexps to create a new version.
>> >> >
>> >> > To you and Jon, who also suggested this: how would that be beneficial?
>> >> > With Beautiful Soup, I have the line number and position within the
>> >> > line where the tag starts; what does a regex give me that I don't have
>> >> > that way?
>> >>
>> >> You mean you could use BeautifulSoup to read the file and identify the
>> >> bits you want to change by line number and offset, and then you could
>> >> use that data to try and update the file, hoping like hell that your
>> >> definition of "line" and "offset" are identical to BeautifulSoup's
>> >> and that you don't mess up later changes when you do earlier ones (you
>> >> could do them in reverse order of line and offset I suppose) and
>> >> probably resorting to regexps anyway in order to find the part of the
>> >> tag you want to change ...
>> >>
>> >> ... or you could avoid all that faff and just do re.sub()?
>> >
>> > Stefan answered in part, but I'll add that it is far FAR easier to do
>> > the analysis with BS4 than regular expressions. I'm not sure what
>> > "hoping like hell" is supposed to mean here, since the line and offset
>> > have been 100% accurate in my experience;
>>
>> Given the string:
>>
>> b"\n \r\r\n\v\n\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8?"
>>
>> what is the line number and offset of the question mark - and does
>> BeautifulSoup agree with your answer? Does the answer to that second
>> question change depending on what parser you tell BeautifulSoup to use?
>
> I'm not sure, because I don't know how to ask BS4 about the location
> of a question mark. But I replaced that with a tag, and:
>
 raw = b"\n 
 \r\r\n\v\n\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8"
 from bs4 import BeautifulSoup
 soup = BeautifulSoup(raw, "html.parser")
 soup.body.sourceline
> 4
 soup.body.sourcepos
> 12
 raw.split(b"\n")[3]
> b'\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8'
 raw.split(b"\n")[3][12:]
> b''
>
> So, yes, it seems to be correct. (Slightly odd in that the sourceline
> is 1-based but the sourcepos is 0-based, but that is indeed the case,
> as confirmed with a much more straight-forward string.)
>
> And yes, it depends on the parser, but I'm using html.parser and it's fine.

Hah, yes, it appears html.parser does an end-run about my lovely
carefully crafted hard case by not even *trying* to work out what
type of line endings the file uses and is just hard-coded to only
recognise "\n" as a line ending.

With the offset though, BeautifulSoup made an arbitrary decision to
use ISO-8859-1 encoding and so when you chopped the bytestring at
that offset it only worked because BeautifulSoup had happened to
choose a 1-byte-per-character encoding. Ironically, *without* the
"\xed\xa0\x80\xed\xbc\x9f" it wouldn't have worked.

>> (If your answer is "if the input contains \xed\xa0\x80\xed\xbc\x9f then
>> I am happy with the program throwing an exception" then feel free to
>> remove that substring from the question.)
>
> Malformed UTF-8 doesn't seem to be a problem. Every file here seems to
> be either UTF-8 or ISO-8859, and in the latter case, I'm assuming
> 8859-1. So I would probably just let this one go through as 8859-1.

It looks like BeautifulSoup is doing something like that, yes.
Personally I would be nervous about some of my files being parsed
as UTF-8 and some of them ISO-8859-1 (due to decoding errors rather
than some of the files actually *being* ISO-8859-1 ;-) )

>> > the only part I'm unsure about is where the _end_ of the tag is (and
>> > maybe there's a way I can use BS4 again to get that??).
>>
>> There doesn't seem to be. More to the point, there doesn't seem to be
>> a way to find out where the *attributes* are, so as I said you'll most
>> likely end up using regexps anyway.
>
> I'm okay with replacing an entire tag that needs to be changed.

Oh, that seems like quite a big change to the original problem.

> Especially if I can replace just the opening tag, not the contents and
> closing tag. And in fact, I may just do that part by scanning for an
> unencoded greater-than, on the assumptions that (a) BS4 will correctly
> encode any greater-thans in attributes,

But your input wasn't created by 

Re: Mutating an HTML file with BeautifulSoup

2022-08-22 Thread Jon Ribbens via Python-list
On 2022-08-21, Peter J. Holzer  wrote:
> On 2022-08-20 21:51:41 -, Jon Ribbens via Python-list wrote:
>> On 2022-08-20, Stefan Ram  wrote:
>> > Jon Ribbens  writes:
>> >>... or you could avoid all that faff and just do re.sub()?
>
>> > source = ''
>> >
>> > # Use Python to change the source, keeping the order of attributes.
>> >
>> > result = re.sub( r'href\s*=\s*"http"', r'href="https"', source )
>> > result = re.sub( r"href\s*=\s*'http'", r"href='https'", result )
>
> Depending on the content of the site, this might replace some stuff
> which is not a link.
>
>> You could go a bit harder with the regexp of course, e.g.:
>> 
>>   result = re.sub(
>>   r"""(<\s*a\s+[^>]*href\s*=\s*)(['"])\s*OLD\s*\2""",
>
> This will fail on:
> 

I've seen *a lot* of bad/broken/weird HTML over the years, and I don't
believe I've ever seen anyone do that. (Wrongly putting an 'alt'
attribute on an 'a' element is very common, on the other hand ;-) )

> The problem can be solved with regular expressions (and given the
> constraints I think I would prefer that to using Beautiful Soup), but
> getting the regexps right is not trivial, at least in the general case.

I would like to see the regular expression that could fully parse
general HTML...
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-22 Thread Peter Otten

On 22/08/2022 05:30, Chris Angelico wrote:

On Mon, 22 Aug 2022 at 10:04, Buck Evan  wrote:


I've had much success doing round trips through the lxml.html parser.

https://lxml.de/lxmlhtml.html

I ditched bs for lxml long ago and never regretted it.

If you find that you have a bunch of invalid html that lxml inadvertently 
"fixes", I would recommend adding a stutter-step to your project: perform a 
noop roundtrip thru lxml on all files. I'd then analyze any diff by progressively 
excluding changes via `grep -vP`.
Unless I'm mistaken, all such changes should fall into no more than a dozen 
groups.



Will this round-trip mutate every single file and reorder the tag
attributes? Because I really don't want to manually eyeball all those
changes.


Most certainly not. Reordering is a bs4 feature that is governed by a
formatter. You can easily prevent that attributes are reorderd:

>>> import bs4
>>> soup = bs4.BeautifulSoup("")
>>> soup

>>> class Formatter(bs4.formatter.HTMLFormatter):
def attributes(self, tag):
return [] if tag.attrs is None else list(tag.attrs.items())

>>> soup.decode(formatter=Formatter())
''

Blank space is probably removed by the underlying html parser.
It might be possible to make bs4 instantiate the lxml.html.HTMLParser
with remove_blank_text=False, but I didn't try hard enough ;)

That said, for my humble html scraping needs I have ditched bs4 in favor
of lxml and its xpath capabilities.


--
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Chris Angelico
On Mon, 22 Aug 2022 at 10:04, Buck Evan  wrote:
>
> I've had much success doing round trips through the lxml.html parser.
>
> https://lxml.de/lxmlhtml.html
>
> I ditched bs for lxml long ago and never regretted it.
>
> If you find that you have a bunch of invalid html that lxml inadvertently 
> "fixes", I would recommend adding a stutter-step to your project: perform a 
> noop roundtrip thru lxml on all files. I'd then analyze any diff by 
> progressively excluding changes via `grep -vP`.
> Unless I'm mistaken, all such changes should fall into no more than a dozen 
> groups.
>

Will this round-trip mutate every single file and reorder the tag
attributes? Because I really don't want to manually eyeball all those
changes.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Buck Evan
I've had much success doing round trips through the lxml.html parser.

https://lxml.de/lxmlhtml.html

I ditched bs for lxml long ago and never regretted it.

If you find that you have a bunch of invalid html that lxml inadvertently
"fixes", I would recommend adding a stutter-step to your project: perform a
noop roundtrip thru lxml on all files. I'd then analyze any diff by
progressively excluding changes via `grep -vP`.
Unless I'm mistaken, all such changes should fall into no more than a dozen
groups.




On Fri, Aug 19, 2022, 1:34 PM Chris Angelico  wrote:

> What's the best way to precisely reconstruct an HTML file after
> parsing it with BeautifulSoup?
>
> Using the Alice example from the BS4 docs:
>
> >>> html_doc = """The Dormouse's story
> 
> The Dormouse's story
>
> Once upon a time there were three little sisters; and
> their names were
> http://example.com/elsie; class="sister" id="link1">Elsie,
> http://example.com/lacie; class="sister" id="link2">Lacie and
> http://example.com/tillie; class="sister" id="link3">Tillie;
> and they lived at the bottom of a well.
>
> ...
> """
> >>> print(soup)
> The Dormouse's story
> 
> The Dormouse's story
> Once upon a time there were three little sisters; and
> their names were
> http://example.com/elsie; id="link1">Elsie,
> http://example.com/lacie; id="link2">Lacie and
> http://example.com/tillie; id="link3">Tillie;
> and they lived at the bottom of a well.
> ...
> 
> >>>
>
> Note two distinct changes: firstly, whitespace has been removed, and
> secondly, attributes are reordered (I think alphabetically). There are
> other canonicalizations being done, too.
>
> I'm trying to make some automated changes to a huge number of HTML
> files, with minimal diffs so they're easy to validate. That means that
> spurious changes like these are very much unwanted. Is there a way to
> get BS4 to reconstruct the original precisely?
>
> The mutation itself would be things like finding an anchor tag and
> changing its href attribute. Fairly simple changes, but might alter
> the length of the file (eg changing "http://example.com/; into
> "https://example.com/;). I'd like to do them intelligently rather than
> falling back on element.sourceline and element.sourcepos, but worst
> case, that's what I'll have to do (which would be fiddly).
>
> ChrisA
> --
> https://mail.python.org/mailman/listinfo/python-list
>
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Chris Angelico
On Mon, 22 Aug 2022 at 05:43, Jon Ribbens via Python-list
 wrote:
>
> On 2022-08-21, Chris Angelico  wrote:
> > On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list
> > wrote:
> >> On 2022-08-20, Chris Angelico  wrote:
> >> > On Sun, 21 Aug 2022 at 03:27, Stefan Ram  wrote:
> >> >> 2qdxy4rzwzuui...@potatochowder.com writes:
> >> >> >textual representations.  That way, the following two elements are the
> >> >> >same (and similar with a collection of sub-elements in a different 
> >> >> >order
> >> >> >in another document):
> >> >>
> >> >>   The /elements/ differ. They have the /same/ infoset.
> >> >
> >> > That's the bit that's hard to prove.
> >> >
> >> >>   The OP could edit the files with regexps to create a new version.
> >> >
> >> > To you and Jon, who also suggested this: how would that be beneficial?
> >> > With Beautiful Soup, I have the line number and position within the
> >> > line where the tag starts; what does a regex give me that I don't have
> >> > that way?
> >>
> >> You mean you could use BeautifulSoup to read the file and identify the
> >> bits you want to change by line number and offset, and then you could
> >> use that data to try and update the file, hoping like hell that your
> >> definition of "line" and "offset" are identical to BeautifulSoup's
> >> and that you don't mess up later changes when you do earlier ones (you
> >> could do them in reverse order of line and offset I suppose) and
> >> probably resorting to regexps anyway in order to find the part of the
> >> tag you want to change ...
> >>
> >> ... or you could avoid all that faff and just do re.sub()?
> >
> > Stefan answered in part, but I'll add that it is far FAR easier to do
> > the analysis with BS4 than regular expressions. I'm not sure what
> > "hoping like hell" is supposed to mean here, since the line and offset
> > have been 100% accurate in my experience;
>
> Given the string:
>
> b"\n \r\r\n\v\n\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8?"
>
> what is the line number and offset of the question mark - and does
> BeautifulSoup agree with your answer? Does the answer to that second
> question change depending on what parser you tell BeautifulSoup to use?

I'm not sure, because I don't know how to ask BS4 about the location
of a question mark. But I replaced that with a tag, and:

>>> raw = b"\n 
>>> \r\r\n\v\n\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8"
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(raw, "html.parser")
>>> soup.body.sourceline
4
>>> soup.body.sourcepos
12
>>> raw.split(b"\n")[3]
b'\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8'
>>> raw.split(b"\n")[3][12:]
b''

So, yes, it seems to be correct. (Slightly odd in that the sourceline
is 1-based but the sourcepos is 0-based, but that is indeed the case,
as confirmed with a much more straight-forward string.)

And yes, it depends on the parser, but I'm using html.parser and it's fine.

> (If your answer is "if the input contains \xed\xa0\x80\xed\xbc\x9f then
> I am happy with the program throwing an exception" then feel free to
> remove that substring from the question.)

Malformed UTF-8 doesn't seem to be a problem. Every file here seems to
be either UTF-8 or ISO-8859, and in the latter case, I'm assuming
8859-1. So I would probably just let this one go through as 8859-1.

> > the only part I'm unsure about is where the _end_ of the tag is (and
> > maybe there's a way I can use BS4 again to get that??).
>
> There doesn't seem to be. More to the point, there doesn't seem to be
> a way to find out where the *attributes* are, so as I said you'll most
> likely end up using regexps anyway.

I'm okay with replacing an entire tag that needs to be changed.
Especially if I can replace just the opening tag, not the contents and
closing tag. And in fact, I may just do that part by scanning for an
unencoded greater-than, on the assumptions that (a) BS4 will correctly
encode any greater-thans in attributes, and (b) if there's a
mis-encoded one in the input, the diff will be small enough to
eyeball, and a human should easily notice that the text has been
massively expanded and duplicated.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Jon Ribbens via Python-list
On 2022-08-21, Chris Angelico  wrote:
> On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list
> wrote:
>> On 2022-08-20, Chris Angelico  wrote:
>> > On Sun, 21 Aug 2022 at 03:27, Stefan Ram  wrote:
>> >> 2qdxy4rzwzuui...@potatochowder.com writes:
>> >> >textual representations.  That way, the following two elements are the
>> >> >same (and similar with a collection of sub-elements in a different order
>> >> >in another document):
>> >>
>> >>   The /elements/ differ. They have the /same/ infoset.
>> >
>> > That's the bit that's hard to prove.
>> >
>> >>   The OP could edit the files with regexps to create a new version.
>> >
>> > To you and Jon, who also suggested this: how would that be beneficial?
>> > With Beautiful Soup, I have the line number and position within the
>> > line where the tag starts; what does a regex give me that I don't have
>> > that way?
>>
>> You mean you could use BeautifulSoup to read the file and identify the
>> bits you want to change by line number and offset, and then you could
>> use that data to try and update the file, hoping like hell that your
>> definition of "line" and "offset" are identical to BeautifulSoup's
>> and that you don't mess up later changes when you do earlier ones (you
>> could do them in reverse order of line and offset I suppose) and
>> probably resorting to regexps anyway in order to find the part of the
>> tag you want to change ...
>>
>> ... or you could avoid all that faff and just do re.sub()?
>
> Stefan answered in part, but I'll add that it is far FAR easier to do
> the analysis with BS4 than regular expressions. I'm not sure what
> "hoping like hell" is supposed to mean here, since the line and offset
> have been 100% accurate in my experience;

Given the string:

b"\n \r\r\n\v\n\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8?"

what is the line number and offset of the question mark - and does
BeautifulSoup agree with your answer? Does the answer to that second
question change depending on what parser you tell BeautifulSoup to use?

(If your answer is "if the input contains \xed\xa0\x80\xed\xbc\x9f then
I am happy with the program throwing an exception" then feel free to
remove that substring from the question.)

> the only part I'm unsure about is where the _end_ of the tag is (and
> maybe there's a way I can use BS4 again to get that??).

There doesn't seem to be. More to the point, there doesn't seem to be
a way to find out where the *attributes* are, so as I said you'll most
likely end up using regexps anyway.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Peter J. Holzer
On 2022-08-20 21:51:41 -, Jon Ribbens via Python-list wrote:
> On 2022-08-20, Stefan Ram  wrote:
> > Jon Ribbens  writes:
> >>... or you could avoid all that faff and just do re.sub()?

> > source = ''
> >
> > # Use Python to change the source, keeping the order of attributes.
> >
> > result = re.sub( r'href\s*=\s*"http"', r'href="https"', source )
> > result = re.sub( r"href\s*=\s*'http'", r"href='https'", result )

Depending on the content of the site, this might replace some stuff
which is not a link.


> You could go a bit harder with the regexp of course, e.g.:
> 
>   result = re.sub(
>   r"""(<\s*a\s+[^>]*href\s*=\s*)(['"])\s*OLD\s*\2""",

This will fail on:


The problem can be solved with regular expressions (and given the
constraints I think I would prefer that to using Beautiful Soup), but
getting the regexps right is not trivial, at least in the general case.
It may become a lot easier if you know that certain conventions were
followed (e.g. that ">" was always written as "") or it may become
even harder when the files contain errors.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Barry


> On 21 Aug 2022, at 09:12, Chris Angelico  wrote:
> 
> On Sun, 21 Aug 2022 at 17:26, Barry  wrote:
>> 
>> 
>> 
 On 19 Aug 2022, at 22:04, Chris Angelico  wrote:
>>> 
>>> On Sat, 20 Aug 2022 at 05:12, Barry  wrote:
 
 
 
>> On 19 Aug 2022, at 19:33, Chris Angelico  wrote:
> 
> What's the best way to precisely reconstruct an HTML file after
> parsing it with BeautifulSoup?
 
 I recall that in bs4 it parses into an object tree and loses the detail of 
 the input.
 I recently ported from very old bs to bs4 and hit the same issue.
 So no it will not output the same as went in.
 
 If you can trust the input to be parsed as xml, meaning all the rules of 
 closing
 tags have been followed. Then I think you can parse and unparse thru xml to
 do what you want.
 
>>> 
>>> 
>>> Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh
>>> well. Thanks for trying, anyhow.
>>> 
>>> So I'm left with a few options:
>>> 
>>> 1) Give up on validation, give up on verification, and just run this
>>> thing on the production site with my fingers crossed
>> 
>> Can you build a beta site with original intack?
> 
> In a naive way, a full copy would be quite a few gigabytes. I could
> cut that down a good bit by taking only HTML files and the things they
> reference, but then we run into the same problem of broken links,
> which is what we're here to solve in the first place.
> 
> But I would certainly not want to run two copies of the site and then
> manually compare.
> 
>> Also wonder if using selenium to walk the site may work as a verification 
>> step?
>> I cannot recall if you can get an image of the browser window to do image 
>> compares with to look for rendering differences.
> 
> Image recognition won't necessarily even be valid; some of the changes
> will have visual consequences (eg a broken image reference now
> becoming correct), and as soon as that happens, the whole document can
> reflow.
> 
>> From my one task using bs4 I did not see it produce any bad results.
>> In my case the problems where in the code that built on bs1 using bad 
>> assumptions.
> 
> Did that get run on perfect HTML, or on messy real-world stuff that
> uses quirks mode?

I small number of messy html pages.

Barry

> 
> ChrisA
> -- 
> https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Chris Angelico
On Sun, 21 Aug 2022 at 17:26, Barry  wrote:
>
>
>
> > On 19 Aug 2022, at 22:04, Chris Angelico  wrote:
> >
> > On Sat, 20 Aug 2022 at 05:12, Barry  wrote:
> >>
> >>
> >>
>  On 19 Aug 2022, at 19:33, Chris Angelico  wrote:
> >>>
> >>> What's the best way to precisely reconstruct an HTML file after
> >>> parsing it with BeautifulSoup?
> >>
> >> I recall that in bs4 it parses into an object tree and loses the detail of 
> >> the input.
> >> I recently ported from very old bs to bs4 and hit the same issue.
> >> So no it will not output the same as went in.
> >>
> >> If you can trust the input to be parsed as xml, meaning all the rules of 
> >> closing
> >> tags have been followed. Then I think you can parse and unparse thru xml to
> >> do what you want.
> >>
> >
> >
> > Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh
> > well. Thanks for trying, anyhow.
> >
> > So I'm left with a few options:
> >
> > 1) Give up on validation, give up on verification, and just run this
> > thing on the production site with my fingers crossed
>
> Can you build a beta site with original intack?

In a naive way, a full copy would be quite a few gigabytes. I could
cut that down a good bit by taking only HTML files and the things they
reference, but then we run into the same problem of broken links,
which is what we're here to solve in the first place.

But I would certainly not want to run two copies of the site and then
manually compare.

> Also wonder if using selenium to walk the site may work as a verification 
> step?
> I cannot recall if you can get an image of the browser window to do image 
> compares with to look for rendering differences.

Image recognition won't necessarily even be valid; some of the changes
will have visual consequences (eg a broken image reference now
becoming correct), and as soon as that happens, the whole document can
reflow.

> From my one task using bs4 I did not see it produce any bad results.
> In my case the problems where in the code that built on bs1 using bad 
> assumptions.

Did that get run on perfect HTML, or on messy real-world stuff that
uses quirks mode?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Barry


> On 19 Aug 2022, at 22:04, Chris Angelico  wrote:
> 
> On Sat, 20 Aug 2022 at 05:12, Barry  wrote:
>> 
>> 
>> 
 On 19 Aug 2022, at 19:33, Chris Angelico  wrote:
>>> 
>>> What's the best way to precisely reconstruct an HTML file after
>>> parsing it with BeautifulSoup?
>> 
>> I recall that in bs4 it parses into an object tree and loses the detail of 
>> the input.
>> I recently ported from very old bs to bs4 and hit the same issue.
>> So no it will not output the same as went in.
>> 
>> If you can trust the input to be parsed as xml, meaning all the rules of 
>> closing
>> tags have been followed. Then I think you can parse and unparse thru xml to
>> do what you want.
>> 
> 
> 
> Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh
> well. Thanks for trying, anyhow.
> 
> So I'm left with a few options:
> 
> 1) Give up on validation, give up on verification, and just run this
> thing on the production site with my fingers crossed

Can you build a beta site with original intack?

Also wonder if using selenium to walk the site may work as a verification step?
I cannot recall if you can get an image of the browser window to do image 
compares with to look for rendering differences.

From my one task using bs4 I did not see it produce any bad results.
In my case the problems where in the code that built on bs1 using bad 
assumptions.



> 2) Instead of doing an intelligent reconstruction, just str.replace()
> one URL with another within the file
> 3) Split the file into lines, find the Nth line (elem.sourceline) and
> str.replace that line only
> 4) Attempt to use elem.sourceline and elem.sourcepos to find the start
> of the tag, manually find the end, and replace one tag with the
> reconstructed form.
> 
> I'm inclined to the first option, honestly. The others just seem like
> hard work, and I became a programmer so I could be lazy...
> 
> ChrisA
> -- 
> https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Chris Angelico
On Sun, 21 Aug 2022 at 13:41, dn  wrote:
>
> On 21/08/2022 13.00, Chris Angelico wrote:
> > Well, I don't like headaches, but I do appreciate what the G Archive
> > has given me over the years, so I'm taking this on as a means of
> > giving back to the community.
>
> This point will be picked-up in the conclusion. NB in the same way that
> you want to 'give back', so also do others - even if in minor ways or
> 'when-relevant'!

Very true.

> >> In fact, depending upon frequency, making the changes manually (and with
> >> improved confidence in the result).
> >
> > Unfortunately the frequency is very high.
>
> Screechingly so? Like you're singing Three Little Maids?

You don't want to hear me singing that although I do recall once
singing Lady Ella's part at a Qwert, to gales of laughter.

> > Yeah. I do a first pass to enumerate all domains that are ever linked
> > to with http:// URLs, and then I have a script that goes through and
> > checks to see if they redirect me to the same URL on the other
> > protocol, or other ways of checking. So yes, the list of valid domains
> > is part of the program's effective input.
>
> Wow! Having got that far, you have achieved data-validity. Is there a
> need to perform a before-after check or diff?

Yes, to ensure that nothing has changed that I *didn't* plan. The
planned changes aren't the problem here, I can verify those elsewhere.

> Perhaps start making the one-for-one replacements without further
> anxiety. As long as there's no silly-mistake, eg failing to remove an
> opening or closing angle-bracket; isn't that about all the checking needed?
> (for this category of updates)

Maybe, but probably not.

> BTW in talk of "line-number", you will have realised the need to re-run
> the identification of such after each of these steps - in case the 'new
> stuff' relating to earlier steps (assuming above became also a temporal
> sequence) is shorter/longer than the current HTML.

Yep, that's not usually a problem.

> >>> And there'll be other fixes to be done too. So it's a bit complicated,
> >>> and no simple solution is really sufficient. At the very very least, I
> >>> *need* to properly parse with BS4; the only question is whether I
> >>> reconstruct from the parse tree, or go back to the raw file and try to
> >>> edit it there.
> >>
> >> At least the diffs would give you something to work-from, but it's a bit
> >> like git-diffs claiming a 'change' when the only difference is that my
> >> IDE strips blanks from the ends of code-lines, or some-such silliness.
> >
> > Right; and the reconstructed version has a LOT of those unnecessary
> > changes. I'm seeing a lot of changes to whitespace. The only problem
> > is whether I can be confident that none of those changes could ever
> > matter.
>
> "White-space" has lesser-meaning in HTML - this is NOT Python! In HTML
> if I write "HTML  file" (with two spaces), the browser will shorten the
> display to a single space (hence some uses of  - non-broken
> space). Similarly, if attempt to use "\n" to start a new line of text...

Yes, whitespace has less meaning... except when it doesn't.

https://developer.mozilla.org/en-US/docs/Web/CSS/white-space

Text can become preformatted by the styling, and there could be
nothing whatsoever in the HTML page that shows this. I think most of
the HTML files in this site have been created by a WYSIWYG editor,
partly because of clues like a single bold space in a non-bold
sequence of text, and the styles aren't consistent everywhere. Given
that poetry comes up a lot on this site, I wouldn't put it past the
editor to have set a whitespace rule on something.

But I'm probably going to just ignore that and hope that any such
errors are less significant than the current set of broken links.

> Is there a danger of 'chasing your own tail', ie seeking a solution to a
> problem which really doesn't matter (particularly if we add the phrase:
> at the user-level)?

Unfortunately not. I now know of three categories of change that, in
theory, shouldn't affect anything: whitespace, order of attributes
("" becoming ""), and
self-closing tags. Whitespace probably won't matter, until it does.
Order of attributes is absolutely fine unless one of them is
miswritten and now we've lost a lot of information about how it ought
to have been written. And self-closing tags are probably
insignificant, but I don't know how browsers handle things like
"..." - and I wouldn't know whether the original
intention was for the second one to be a self-closing empty paragraph,
or a miswritten closing tag.

It's easy to say that these changes have no effect on well-formed
HTML. It's less easy to know what browsers will do with ill-formed
HTML.

> Agree with "properly parse". Question was an apparent dedication to BS4
> when there are other tools. Just checking you aren't wearing that type
> of 'blinders'.
> (didn't think so, but...)

No, but there's also always the option of some tool that I've never
heard of! The 

Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread dn
On 21/08/2022 13.00, Chris Angelico wrote:
> On Sun, 21 Aug 2022 at 09:48, dn  wrote:
>> On 20/08/2022 12.38, Chris Angelico wrote:
>>> On Sat, 20 Aug 2022 at 10:19, dn  wrote:
 On 20/08/2022 09.01, Chris Angelico wrote:
> On Sat, 20 Aug 2022 at 05:12, Barry  wrote:
>>> On 19 Aug 2022, at 19:33, Chris Angelico  wrote:

> So I'm left with a few options:
>
> 1) Give up on validation, give up on verification, and just run this
> thing on the production site with my fingers crossed
> 2) Instead of doing an intelligent reconstruction, just str.replace()
> one URL with another within the file
> 3) Split the file into lines, find the Nth line (elem.sourceline) and
> str.replace that line only
> 4) Attempt to use elem.sourceline and elem.sourcepos to find the start
> of the tag, manually find the end, and replace one tag with the
> reconstructed form.
>
> I'm inclined to the first option, honestly. The others just seem like
> hard work, and I became a programmer so I could be lazy...
 +1 - but I've noticed that sometimes I have to work quite hard to be
 this lazy!
>>>
>>> Yeah, that's very true...
>>>
 Am assuming that http -> https is not the only 'change' (if it were,
 you'd just do that without BS). How many such changes are planned/need
 checking? Care to list them?
>>
>> This project has many of the same 'smells' as a database-harmonisation
>> effort. Particularly one where 'the previous guy' used to use field-X
>> for certain data, but his replacement decided that field-Y 'sounded
>> better' (or some such user-logic). Arrrggg!
>>
>> If you like head-aches, and users coming to you with ifs-buts-and-maybes
>> AFTER you've 'done stuff', this is your sort of project!
> 
> Well, I don't like headaches, but I do appreciate what the G Archive
> has given me over the years, so I'm taking this on as a means of
> giving back to the community.

This point will be picked-up in the conclusion. NB in the same way that
you want to 'give back', so also do others - even if in minor ways or
'when-relevant'!


>>> Assumption is correct. The changes are more of the form "find all the
>>> problems, add to the list of fixes, try to minimize the ones that need
>>> to be done manually". So far, what I have is:
>>
>> Having taken the trouble to identify this list of improvements and given
>> the determination to verify each, consider working through one item at a
>> time, rather than in a single pass. This will enable individual logging
>> of changes, a manual check of each alteration, and the ability to
>> choose/tailor the best tool for that specific task.
>>
>> In fact, depending upon frequency, making the changes manually (and with
>> improved confidence in the result).
> 
> Unfortunately the frequency is very high.

Screechingly so? Like you're singing Three Little Maids?


>> The presence of (or allusion to) the word "some" in this list-items is
>> 'the killer'. Automation doesn't like 'some' (cf "all") unless the
>> criteria can be clearly and unambiguously defined. Ouch!
>>
>> (I don't think you need to be told any of this, but hey: dreams are free!)
> 
> Right; the criteria are quite well defined, but I omitted the details
> for brevity.
> 
>>> 1) A bunch of http -> https, but not all of them - only domains where
>>> I've confirmed that it's valid
>>
>> The search-criteria is the list of valid domains, rather than the
>> "http/https" which is likely the first focus.
> 
> Yeah. I do a first pass to enumerate all domains that are ever linked
> to with http:// URLs, and then I have a script that goes through and
> checks to see if they redirect me to the same URL on the other
> protocol, or other ways of checking. So yes, the list of valid domains
> is part of the program's effective input.

Wow! Having got that far, you have achieved data-validity. Is there a
need to perform a before-after check or diff?

Perhaps start making the one-for-one replacements without further
anxiety. As long as there's no silly-mistake, eg failing to remove an
opening or closing angle-bracket; isn't that about all the checking needed?
(for this category of updates)


>>> 2) Some absolute to relative conversions:
>>> https://www.gsarchive.net/whowaswho/index.htm should be referred to as
>>> /whowaswho/index.htm instead
>>
>> Similarly, if you have a list of these.
> 
> It's more just the pattern "https://www.gsarchive.net/" and
> "https://gsarchive.net/", and the corresponding "http://;
> URLs, plus a few other malformed versions that are worth correcting
> (if ever I find a link to "www.gsarchive.net/", it's almost
> certainly missing its protocol).

Isn't the inspection tool (described elsewhere) reporting an HTML/editor
line number?

That being the case, won't a bit of Swiss-Army knife Python-string work
enable appropriate processing and re-writing - as well as providing the
means to statistically-sample for QA?


>>> 3) A few outdated URLs for which we 

Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Chris Angelico
On Sun, 21 Aug 2022 at 09:48, dn  wrote:
>
> On 20/08/2022 12.38, Chris Angelico wrote:
> > On Sat, 20 Aug 2022 at 10:19, dn  wrote:
> >> On 20/08/2022 09.01, Chris Angelico wrote:
> >>> On Sat, 20 Aug 2022 at 05:12, Barry  wrote:
> > On 19 Aug 2022, at 19:33, Chris Angelico  wrote:
> >
> > What's the best way to precisely reconstruct an HTML file after
> > parsing it with BeautifulSoup?
> ...
>
> >>> well. Thanks for trying, anyhow.
> >>>
> >>> So I'm left with a few options:
> >>>
> >>> 1) Give up on validation, give up on verification, and just run this
> >>> thing on the production site with my fingers crossed
> >>> 2) Instead of doing an intelligent reconstruction, just str.replace()
> >>> one URL with another within the file
> >>> 3) Split the file into lines, find the Nth line (elem.sourceline) and
> >>> str.replace that line only
> >>> 4) Attempt to use elem.sourceline and elem.sourcepos to find the start
> >>> of the tag, manually find the end, and replace one tag with the
> >>> reconstructed form.
> >>>
> >>> I'm inclined to the first option, honestly. The others just seem like
> >>> hard work, and I became a programmer so I could be lazy...
> >> +1 - but I've noticed that sometimes I have to work quite hard to be
> >> this lazy!
> >
> > Yeah, that's very true...
> >
> >> Am assuming that http -> https is not the only 'change' (if it were,
> >> you'd just do that without BS). How many such changes are planned/need
> >> checking? Care to list them?
>
> This project has many of the same 'smells' as a database-harmonisation
> effort. Particularly one where 'the previous guy' used to use field-X
> for certain data, but his replacement decided that field-Y 'sounded
> better' (or some such user-logic). Arrrggg!
>
> If you like head-aches, and users coming to you with ifs-buts-and-maybes
> AFTER you've 'done stuff', this is your sort of project!

Well, I don't like headaches, but I do appreciate what the G Archive
has given me over the years, so I'm taking this on as a means of
giving back to the community.

> > Assumption is correct. The changes are more of the form "find all the
> > problems, add to the list of fixes, try to minimize the ones that need
> > to be done manually". So far, what I have is:
>
> Having taken the trouble to identify this list of improvements and given
> the determination to verify each, consider working through one item at a
> time, rather than in a single pass. This will enable individual logging
> of changes, a manual check of each alteration, and the ability to
> choose/tailor the best tool for that specific task.
>
> In fact, depending upon frequency, making the changes manually (and with
> improved confidence in the result).

Unfortunately the frequency is very high.

> The presence of (or allusion to) the word "some" in this list-items is
> 'the killer'. Automation doesn't like 'some' (cf "all") unless the
> criteria can be clearly and unambiguously defined. Ouch!
>
> (I don't think you need to be told any of this, but hey: dreams are free!)

Right; the criteria are quite well defined, but I omitted the details
for brevity.

> > 1) A bunch of http -> https, but not all of them - only domains where
> > I've confirmed that it's valid
>
> The search-criteria is the list of valid domains, rather than the
> "http/https" which is likely the first focus.

Yeah. I do a first pass to enumerate all domains that are ever linked
to with http:// URLs, and then I have a script that goes through and
checks to see if they redirect me to the same URL on the other
protocol, or other ways of checking. So yes, the list of valid domains
is part of the program's effective input.

> > 2) Some absolute to relative conversions:
> > https://www.gsarchive.net/whowaswho/index.htm should be referred to as
> > /whowaswho/index.htm instead
>
> Similarly, if you have a list of these.

It's more just the pattern "https://www.gsarchive.net/" and
"https://gsarchive.net/", and the corresponding "http://;
URLs, plus a few other malformed versions that are worth correcting
(if ever I find a link to "www.gsarchive.net/", it's almost
certainly missing its protocol).

> > 3) A few outdated URLs for which we know the replacement, eg
> > http://www.cris.com/~oakapple/gasdisc/ to
> > http://www.gasdisc.oakapplepress.com/ (this one can't go on
> > HTTPS, which is one reason I can't shortcut that)
>
> Again.

Same; although those are manually entered as patterns.

> > 4) Some internal broken links where the path is wrong - anything that
> > resolves to /books/ but can't be found might be better
> > rewritten as /html/perf_grps/websites/ if the file can be
> > found there
>
> Again.

The fixups are manually entered, but I also need to know about every
broken internal link so that I can look through them and figure out
what's wrong.

> > 5) Any external link that yields a permanent redirect should, to save
> > clientside requests, get replaced by the destination. We have some
> > Creative 

Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Chris Angelico
On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list
 wrote:
>
> On 2022-08-20, Chris Angelico  wrote:
> > On Sun, 21 Aug 2022 at 03:27, Stefan Ram  wrote:
> >> 2qdxy4rzwzuui...@potatochowder.com writes:
> >> >textual representations.  That way, the following two elements are the
> >> >same (and similar with a collection of sub-elements in a different order
> >> >in another document):
> >>
> >>   The /elements/ differ. They have the /same/ infoset.
> >
> > That's the bit that's hard to prove.
> >
> >>   The OP could edit the files with regexps to create a new version.
> >
> > To you and Jon, who also suggested this: how would that be beneficial?
> > With Beautiful Soup, I have the line number and position within the
> > line where the tag starts; what does a regex give me that I don't have
> > that way?
>
> You mean you could use BeautifulSoup to read the file and identify the
> bits you want to change by line number and offset, and then you could
> use that data to try and update the file, hoping like hell that your
> definition of "line" and "offset" are identical to BeautifulSoup's
> and that you don't mess up later changes when you do earlier ones (you
> could do them in reverse order of line and offset I suppose) and
> probably resorting to regexps anyway in order to find the part of the
> tag you want to change ...
>
> ... or you could avoid all that faff and just do re.sub()?

Stefan answered in part, but I'll add that it is far FAR easier to do
the analysis with BS4 than regular expressions. I'm not sure what
"hoping like hell" is supposed to mean here, since the line and offset
have been 100% accurate in my experience; the only part I'm unsure
about is where the _end_ of the tag is (and maybe there's a way I can
use BS4 again to get that??).

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread dn
On 20/08/2022 12.38, Chris Angelico wrote:
> On Sat, 20 Aug 2022 at 10:19, dn  wrote:
>> On 20/08/2022 09.01, Chris Angelico wrote:
>>> On Sat, 20 Aug 2022 at 05:12, Barry  wrote:
> On 19 Aug 2022, at 19:33, Chris Angelico  wrote:
>
> What's the best way to precisely reconstruct an HTML file after
> parsing it with BeautifulSoup?
...

>>> well. Thanks for trying, anyhow.
>>>
>>> So I'm left with a few options:
>>>
>>> 1) Give up on validation, give up on verification, and just run this
>>> thing on the production site with my fingers crossed
>>> 2) Instead of doing an intelligent reconstruction, just str.replace()
>>> one URL with another within the file
>>> 3) Split the file into lines, find the Nth line (elem.sourceline) and
>>> str.replace that line only
>>> 4) Attempt to use elem.sourceline and elem.sourcepos to find the start
>>> of the tag, manually find the end, and replace one tag with the
>>> reconstructed form.
>>>
>>> I'm inclined to the first option, honestly. The others just seem like
>>> hard work, and I became a programmer so I could be lazy...
>> +1 - but I've noticed that sometimes I have to work quite hard to be
>> this lazy!
> 
> Yeah, that's very true...
> 
>> Am assuming that http -> https is not the only 'change' (if it were,
>> you'd just do that without BS). How many such changes are planned/need
>> checking? Care to list them?

This project has many of the same 'smells' as a database-harmonisation
effort. Particularly one where 'the previous guy' used to use field-X
for certain data, but his replacement decided that field-Y 'sounded
better' (or some such user-logic). Arrrggg!

If you like head-aches, and users coming to you with ifs-buts-and-maybes
AFTER you've 'done stuff', this is your sort of project!


> Assumption is correct. The changes are more of the form "find all the
> problems, add to the list of fixes, try to minimize the ones that need
> to be done manually". So far, what I have is:

Having taken the trouble to identify this list of improvements and given
the determination to verify each, consider working through one item at a
time, rather than in a single pass. This will enable individual logging
of changes, a manual check of each alteration, and the ability to
choose/tailor the best tool for that specific task.

In fact, depending upon frequency, making the changes manually (and with
improved confidence in the result).

The presence of (or allusion to) the word "some" in this list-items is
'the killer'. Automation doesn't like 'some' (cf "all") unless the
criteria can be clearly and unambiguously defined. Ouch!

(I don't think you need to be told any of this, but hey: dreams are free!)


> 1) A bunch of http -> https, but not all of them - only domains where
> I've confirmed that it's valid

The search-criteria is the list of valid domains, rather than the
"http/https" which is likely the first focus.


> 2) Some absolute to relative conversions:
> https://www.gsarchive.net/whowaswho/index.htm should be referred to as
> /whowaswho/index.htm instead

Similarly, if you have a list of these.


> 3) A few outdated URLs for which we know the replacement, eg
> http://www.cris.com/~oakapple/gasdisc/ to
> http://www.gasdisc.oakapplepress.com/ (this one can't go on
> HTTPS, which is one reason I can't shortcut that)

Again.


> 4) Some internal broken links where the path is wrong - anything that
> resolves to /books/ but can't be found might be better
> rewritten as /html/perf_grps/websites/ if the file can be
> found there

Again.


> 5) Any external link that yields a permanent redirect should, to save
> clientside requests, get replaced by the destination. We have some
> Creative Commons badges that have moved to new URLs.

Do you have these as a list, or are you intending the automated-method
to auto-magically follow the link to determine any need for action?


> And there'll be other fixes to be done too. So it's a bit complicated,
> and no simple solution is really sufficient. At the very very least, I
> *need* to properly parse with BS4; the only question is whether I
> reconstruct from the parse tree, or go back to the raw file and try to
> edit it there.

At least the diffs would give you something to work-from, but it's a bit
like git-diffs claiming a 'change' when the only difference is that my
IDE strips blanks from the ends of code-lines, or some-such silliness.

Which brings me to ask: why "*need* to properly parse with BS4"?

What about selective use of tools, previously-mentioned in this thread?

Is Selenium worthy of consideration?

I'm assuming you've already been using a link-checker utility to locate
the links which need to be changed. They can be used in QA-mode
after-the-fact too.


> For the record, I have very long-term plans to migrate parts of the
> site to Markdown, which would make a lot of things easier. But for
> now, I need to fix the existing problems in the existing HTML files,
> without doing gigantic wholesale layout 

Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Jon Ribbens via Python-list
On 2022-08-20, Stefan Ram  wrote:
> Jon Ribbens  writes:
>>... or you could avoid all that faff and just do re.sub()?
>
> import bs4
> import re
>
> source = ''
>
> # Use Python to change the source, keeping the order of attributes.
>
> result = re.sub( r'href\s*=\s*"http"', r'href="https"', source )
> result = re.sub( r"href\s*=\s*'http'", r"href='https'", result )

You could go a bit harder with the regexp of course, e.g.:

  result = re.sub(
  r"""(<\s*a\s+[^>]*href\s*=\s*)(['"])\s*OLD\s*\2""",
  r"\1\2NEW\2",
  source,
  flags=re.IGNORECASE
  )

> # Now use BeautifulSoup only for the verification of the result.
>
> reference = bs4.BeautifulSoup( source, features="html.parser" )
> for a in reference.find_all( "a" ):
> if a[ 'href' ]== 'http': a[ 'href' ]='https'
>
> print( bs4.BeautifulSoup( result, features="html.parser" )== reference )

Hmm, yes that seems like a pretty good idea.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Jon Ribbens via Python-list
On 2022-08-20, Chris Angelico  wrote:
> On Sun, 21 Aug 2022 at 03:27, Stefan Ram  wrote:
>> 2qdxy4rzwzuui...@potatochowder.com writes:
>> >textual representations.  That way, the following two elements are the
>> >same (and similar with a collection of sub-elements in a different order
>> >in another document):
>>
>>   The /elements/ differ. They have the /same/ infoset.
>
> That's the bit that's hard to prove.
>
>>   The OP could edit the files with regexps to create a new version.
>
> To you and Jon, who also suggested this: how would that be beneficial?
> With Beautiful Soup, I have the line number and position within the
> line where the tag starts; what does a regex give me that I don't have
> that way?

You mean you could use BeautifulSoup to read the file and identify the
bits you want to change by line number and offset, and then you could
use that data to try and update the file, hoping like hell that your
definition of "line" and "offset" are identical to BeautifulSoup's
and that you don't mess up later changes when you do earlier ones (you
could do them in reverse order of line and offset I suppose) and
probably resorting to regexps anyway in order to find the part of the
tag you want to change ...

... or you could avoid all that faff and just do re.sub()?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Chris Angelico
On Sun, 21 Aug 2022 at 03:27, Stefan Ram  wrote:
>
> 2qdxy4rzwzuui...@potatochowder.com writes:
> >textual representations.  That way, the following two elements are the
> >same (and similar with a collection of sub-elements in a different order
> >in another document):
>
>   The /elements/ differ. They have the /same/ infoset.

That's the bit that's hard to prove.

>   The OP could edit the files with regexps to create a new version.

To you and Jon, who also suggested this: how would that be beneficial?
With Beautiful Soup, I have the line number and position within the
line where the tag starts; what does a regex give me that I don't have
that way?

>   Soup := BeautifulSoup.
>
>   Then have Soup read both the new version and the old version.
>
>   Then have Soup also edit the old version read in, the same way as
>   the regexps did and verify that now the old version edited by
>   Soup and the new version created using regexps agree.
>
>   Or just use Soup as a tool to show the diffs for visual inspection
>   by having Soup read both the original version and the version edited
>   with regexps. Now both are normalized by Soup and Soup can show the
>   diffs (such a diff feature might not be a part of Soup, but it should
>   not be too much effort to write one using Soup).
>

But as mentioned, the entire problem *is* the normalization, as I have
no proof that it has had no impact on the rendering of the page.
Comparing two normalized versions is no better than my original option
1, whereby I simply ignore the normalization and write out the
reconstructed content.

It's easy if you know for certain that the page is well-formed. Much
harder if you do not - or, as in some cases, if you know the page is
badly-formed.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Jon Ribbens via Python-list
On 2022-08-19, Chris Angelico  wrote:
> What's the best way to precisely reconstruct an HTML file after
> parsing it with BeautifulSoup?
>
> Using the Alice example from the BS4 docs:
>
 html_doc = """The Dormouse's story
>
>The Dormouse's story
>
>Once upon a time there were three little sisters; and
> their names were
>http://example.com/elsie; class="sister" id="link1">Elsie,
>http://example.com/lacie; class="sister" id="link2">Lacie and
>http://example.com/tillie; class="sister" id="link3">Tillie;
> and they lived at the bottom of a well.
>
>...
> """
 print(soup)
>The Dormouse's story
>
>The Dormouse's story
>Once upon a time there were three little sisters; and
> their names were
>http://example.com/elsie; id="link1">Elsie,
>http://example.com/lacie; id="link2">Lacie and
>http://example.com/tillie; id="link3">Tillie;
> and they lived at the bottom of a well.
>...
>

>
> Note two distinct changes: firstly, whitespace has been removed, and
> secondly, attributes are reordered (I think alphabetically). There are
> other canonicalizations being done, too.
>
> I'm trying to make some automated changes to a huge number of HTML
> files, with minimal diffs so they're easy to validate. That means that
> spurious changes like these are very much unwanted. Is there a way to
> get BS4 to reconstruct the original precisely?
>
> The mutation itself would be things like finding an anchor tag and
> changing its href attribute. Fairly simple changes, but might alter
> the length of the file (eg changing "http://example.com/; into
> "https://example.com/;). I'd like to do them intelligently rather than
> falling back on element.sourceline and element.sourcepos, but worst
> case, that's what I'll have to do (which would be fiddly).

I'm tempting the Wrath of Zalgo by saying it, but ... regexp?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-19 Thread Chris Angelico
On Sat, 20 Aug 2022 at 10:19, dn  wrote:
>
> On 20/08/2022 09.01, Chris Angelico wrote:
> > On Sat, 20 Aug 2022 at 05:12, Barry  wrote:
> >>
> >>
> >>
> >>> On 19 Aug 2022, at 19:33, Chris Angelico  wrote:
> >>>
> >>> What's the best way to precisely reconstruct an HTML file after
> >>> parsing it with BeautifulSoup?
> >>
> >> I recall that in bs4 it parses into an object tree and loses the detail of 
> >> the input.
> >> I recently ported from very old bs to bs4 and hit the same issue.
> >> So no it will not output the same as went in.
> >>
> >> If you can trust the input to be parsed as xml, meaning all the rules of 
> >> closing
> >> tags have been followed. Then I think you can parse and unparse thru xml to
> >> do what you want.
> >>
> >
> >
> > Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh
> > well. Thanks for trying, anyhow.
> >
> > So I'm left with a few options:
> >
> > 1) Give up on validation, give up on verification, and just run this
> > thing on the production site with my fingers crossed
> > 2) Instead of doing an intelligent reconstruction, just str.replace()
> > one URL with another within the file
> > 3) Split the file into lines, find the Nth line (elem.sourceline) and
> > str.replace that line only
> > 4) Attempt to use elem.sourceline and elem.sourcepos to find the start
> > of the tag, manually find the end, and replace one tag with the
> > reconstructed form.
> >
> > I'm inclined to the first option, honestly. The others just seem like
> > hard work, and I became a programmer so I could be lazy...
> +1 - but I've noticed that sometimes I have to work quite hard to be
> this lazy!

Yeah, that's very true...

> Am assuming that http -> https is not the only 'change' (if it were,
> you'd just do that without BS). How many such changes are planned/need
> checking? Care to list them?
>

Assumption is correct. The changes are more of the form "find all the
problems, add to the list of fixes, try to minimize the ones that need
to be done manually". So far, what I have is:

1) A bunch of http -> https, but not all of them - only domains where
I've confirmed that it's valid
2) Some absolute to relative conversions:
https://www.gsarchive.net/whowaswho/index.htm should be referred to as
/whowaswho/index.htm instead
3) A few outdated URLs for which we know the replacement, eg
http://www.cris.com/~oakapple/gasdisc/ to
http://www.gasdisc.oakapplepress.com/ (this one can't go on
HTTPS, which is one reason I can't shortcut that)
4) Some internal broken links where the path is wrong - anything that
resolves to /books/ but can't be found might be better
rewritten as /html/perf_grps/websites/ if the file can be
found there
5) Any external link that yields a permanent redirect should, to save
clientside requests, get replaced by the destination. We have some
Creative Commons badges that have moved to new URLs.

And there'll be other fixes to be done too. So it's a bit complicated,
and no simple solution is really sufficient. At the very very least, I
*need* to properly parse with BS4; the only question is whether I
reconstruct from the parse tree, or go back to the raw file and try to
edit it there.

For the record, I have very long-term plans to migrate parts of the
site to Markdown, which would make a lot of things easier. But for
now, I need to fix the existing problems in the existing HTML files,
without doing gigantic wholesale layout changes.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-19 Thread dn
On 20/08/2022 09.01, Chris Angelico wrote:
> On Sat, 20 Aug 2022 at 05:12, Barry  wrote:
>>
>>
>>
>>> On 19 Aug 2022, at 19:33, Chris Angelico  wrote:
>>>
>>> What's the best way to precisely reconstruct an HTML file after
>>> parsing it with BeautifulSoup?
>>
>> I recall that in bs4 it parses into an object tree and loses the detail of 
>> the input.
>> I recently ported from very old bs to bs4 and hit the same issue.
>> So no it will not output the same as went in.
>>
>> If you can trust the input to be parsed as xml, meaning all the rules of 
>> closing
>> tags have been followed. Then I think you can parse and unparse thru xml to
>> do what you want.
>>
> 
> 
> Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh
> well. Thanks for trying, anyhow.
> 
> So I'm left with a few options:
> 
> 1) Give up on validation, give up on verification, and just run this
> thing on the production site with my fingers crossed
> 2) Instead of doing an intelligent reconstruction, just str.replace()
> one URL with another within the file
> 3) Split the file into lines, find the Nth line (elem.sourceline) and
> str.replace that line only
> 4) Attempt to use elem.sourceline and elem.sourcepos to find the start
> of the tag, manually find the end, and replace one tag with the
> reconstructed form.
> 
> I'm inclined to the first option, honestly. The others just seem like
> hard work, and I became a programmer so I could be lazy...
+1 - but I've noticed that sometimes I have to work quite hard to be
this lazy!


Am assuming that http -> https is not the only 'change' (if it were,
you'd just do that without BS). How many such changes are planned/need
checking? Care to list them?

-- 
-- 
Regards,
=dn
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-19 Thread Chris Angelico
On Sat, 20 Aug 2022 at 10:04, David  wrote:
>
> On Sat, 20 Aug 2022 at 04:31, Chris Angelico  wrote:
>
> > What's the best way to precisely reconstruct an HTML file after
> > parsing it with BeautifulSoup?
>
> > Note two distinct changes: firstly, whitespace has been removed, and
> > secondly, attributes are reordered (I think alphabetically). There are
> > other canonicalizations being done, too.
>
> > I'm trying to make some automated changes to a huge number of HTML
> > files, with minimal diffs so they're easy to validate. That means that
> > spurious changes like these are very much unwanted. Is there a way to
> > get BS4 to reconstruct the original precisely?
>
> On Sat, 20 Aug 2022 at 07:02, Chris Angelico  wrote:
> > On Sat, 20 Aug 2022 at 05:12, Barry  wrote:
>
> > > I recall that in bs4 it parses into an object tree and loses the detail
> > > of the input.  I recently ported from very old bs to bs4 and hit the
> > > same issue.  So no it will not output the same as went in.
>
> > So I'm left with a few options:
>
> > 1) Give up on validation, give up on verification, and just run this
> >thing on the production site with my fingers crossed
>
> > 2) Instead of doing an intelligent reconstruction, just str.replace() one
> >URL with another within the file
>
> > 3) Split the file into lines, find the Nth line (elem.sourceline) and
> >str.replace that line only
>
> > 4) Attempt to use elem.sourceline and elem.sourcepos to find the start of
> >the tag, manually find the end, and replace one tag with the
> >reconstructed form.
>
> > I'm inclined to the first option, honestly. The others just seem like
> > hard work, and I became a programmer so I could be lazy...
>
> Hi, I don't know if you will like this option, but I don't see it on the
> list yet so ...

Hey, all options are welcomed :)

> I'm assuming that the phrase "with minimal diffs so they're easy to
> validate" means being eyeballed by a human.
>
> Have you considered two passes through BS? Do the first pass with no
> modification, so that the intermediate result gets the BS default
> "spurious" changes.
>
> Then do the second pass with the desired changes, so that the human will
> see only the desired changes in the diff.

I'm 100% confident of the actual changes, so that wouldn't really
solve anything. The problem is that, without eyeballing the actual
changes, I can't easily see if there's been something else changed or
broken. This is a scripted change that will affect probably hundreds
of HTML files across a large web site, so making sure I don't break
anything means either (a) minimize the diff so it's clearly correct,
or (b) eyeball the rendered versions of every page - manually - to see
if there were any unintended changes. (There WILL be intended visual
changes, so I can't render the page to bitmap and ensure that it
hasn't changed. This is not React snapshot testing, which IMO is one
of the most useless testing features ever devised. No, actually, that
can't be true, someone MUST have made a worse one.)

Appreciate the suggestion, though!

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-19 Thread David
On Sat, 20 Aug 2022 at 04:31, Chris Angelico  wrote:

> What's the best way to precisely reconstruct an HTML file after
> parsing it with BeautifulSoup?

> Note two distinct changes: firstly, whitespace has been removed, and
> secondly, attributes are reordered (I think alphabetically). There are
> other canonicalizations being done, too.

> I'm trying to make some automated changes to a huge number of HTML
> files, with minimal diffs so they're easy to validate. That means that
> spurious changes like these are very much unwanted. Is there a way to
> get BS4 to reconstruct the original precisely?

On Sat, 20 Aug 2022 at 07:02, Chris Angelico  wrote:
> On Sat, 20 Aug 2022 at 05:12, Barry  wrote:

> > I recall that in bs4 it parses into an object tree and loses the detail
> > of the input.  I recently ported from very old bs to bs4 and hit the
> > same issue.  So no it will not output the same as went in.

> So I'm left with a few options:

> 1) Give up on validation, give up on verification, and just run this
>thing on the production site with my fingers crossed

> 2) Instead of doing an intelligent reconstruction, just str.replace() one
>URL with another within the file

> 3) Split the file into lines, find the Nth line (elem.sourceline) and
>str.replace that line only

> 4) Attempt to use elem.sourceline and elem.sourcepos to find the start of
>the tag, manually find the end, and replace one tag with the
>reconstructed form.

> I'm inclined to the first option, honestly. The others just seem like
> hard work, and I became a programmer so I could be lazy...

Hi, I don't know if you will like this option, but I don't see it on the
list yet so ...

I'm assuming that the phrase "with minimal diffs so they're easy to
validate" means being eyeballed by a human.

Have you considered two passes through BS? Do the first pass with no
modification, so that the intermediate result gets the BS default
"spurious" changes.

Then do the second pass with the desired changes, so that the human will
see only the desired changes in the diff.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-19 Thread Chris Angelico
On Sat, 20 Aug 2022 at 05:12, Barry  wrote:
>
>
>
> > On 19 Aug 2022, at 19:33, Chris Angelico  wrote:
> >
> > What's the best way to precisely reconstruct an HTML file after
> > parsing it with BeautifulSoup?
>
> I recall that in bs4 it parses into an object tree and loses the detail of 
> the input.
> I recently ported from very old bs to bs4 and hit the same issue.
> So no it will not output the same as went in.
>
> If you can trust the input to be parsed as xml, meaning all the rules of 
> closing
> tags have been followed. Then I think you can parse and unparse thru xml to
> do what you want.
>


Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh
well. Thanks for trying, anyhow.

So I'm left with a few options:

1) Give up on validation, give up on verification, and just run this
thing on the production site with my fingers crossed
2) Instead of doing an intelligent reconstruction, just str.replace()
one URL with another within the file
3) Split the file into lines, find the Nth line (elem.sourceline) and
str.replace that line only
4) Attempt to use elem.sourceline and elem.sourcepos to find the start
of the tag, manually find the end, and replace one tag with the
reconstructed form.

I'm inclined to the first option, honestly. The others just seem like
hard work, and I became a programmer so I could be lazy...

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-19 Thread 2QdxY4RzWzUUiLuE
On 2022-08-19 at 20:12:35 +0100,
Barry  wrote:

> > On 19 Aug 2022, at 19:33, Chris Angelico  wrote:
> > 
> > What's the best way to precisely reconstruct an HTML file after
> > parsing it with BeautifulSoup?
> 
> I recall that in bs4 it parses into an object tree and loses the
> detail of the input.  I recently ported from very old bs to bs4 and
> hit the same issue.  So no it will not output the same as went in.
> 
> If you can trust the input to be parsed as xml, meaning all the rules
> of closing tags have been followed. Then I think you can parse and
> unparse thru xml to do what you want.

XML is in the same boat.  Except for "canonical form" (which underlies
cryptographically signed XML documents) the standards explicitly don't
require tools to round-trip the "source code."  The preferred method of
comparing XML documents is at the structural level rather than with
textual representations.  That way, the following two elements are the
same (and similar with a collection of sub-elements in a different order
in another document):



and



Dan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-19 Thread Barry


> On 19 Aug 2022, at 19:33, Chris Angelico  wrote:
> 
> What's the best way to precisely reconstruct an HTML file after
> parsing it with BeautifulSoup?

I recall that in bs4 it parses into an object tree and loses the detail of the 
input.
I recently ported from very old bs to bs4 and hit the same issue.
So no it will not output the same as went in.

If you can trust the input to be parsed as xml, meaning all the rules of closing
tags have been followed. Then I think you can parse and unparse thru xml to
do what you want.

Barry


> 
> Using the Alice example from the BS4 docs:
> 
 html_doc = """The Dormouse's story
> 
> The Dormouse's story
> 
> Once upon a time there were three little sisters; and
> their names were
> http://example.com/elsie; class="sister" id="link1">Elsie,
> http://example.com/lacie; class="sister" id="link2">Lacie and
> http://example.com/tillie; class="sister" id="link3">Tillie;
> and they lived at the bottom of a well.
> 
> ...
> """
 print(soup)
> The Dormouse's story
> 
> The Dormouse's story
> Once upon a time there were three little sisters; and
> their names were
> http://example.com/elsie; id="link1">Elsie,
> http://example.com/lacie; id="link2">Lacie and
> http://example.com/tillie; id="link3">Tillie;
> and they lived at the bottom of a well.
> ...
> 
 
> 
> Note two distinct changes: firstly, whitespace has been removed, and
> secondly, attributes are reordered (I think alphabetically). There are
> other canonicalizations being done, too.
> 
> I'm trying to make some automated changes to a huge number of HTML
> files, with minimal diffs so they're easy to validate. That means that
> spurious changes like these are very much unwanted. Is there a way to
> get BS4 to reconstruct the original precisely?
> 
> The mutation itself would be things like finding an anchor tag and
> changing its href attribute. Fairly simple changes, but might alter
> the length of the file (eg changing "http://example.com/; into
> "https://example.com/;). I'd like to do them intelligently rather than
> falling back on element.sourceline and element.sourcepos, but worst
> case, that's what I'll have to do (which would be fiddly).
> 
> ChrisA
> -- 
> https://mail.python.org/mailman/listinfo/python-list
> 

-- 
https://mail.python.org/mailman/listinfo/python-list