On 2022-10-25 03:09:33 +1100, Chris Angelico wrote: > On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list > <[email protected]> wrote: > > On 2022-10-24, Chris Angelico <[email protected]> wrote: > > > On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer <[email protected]> wrote: > > >> Yes, I got that. What I wanted to say was that this is indeed a bug in > > >> html.parser and not an error (or sloppyness, as you called it) in the > > >> input or ambiguity in the HTML standard. > > > > > > I described the HTML as "sloppy" for a number of reasons, but I was of > > > the understanding that it's generally recommended to have the closing > > > tags. Not that it matters much. > > > > Some elements don't need close tags, or even open tags. Unless you're > > using XHTML you don't need them and indeed for the case of void tags > > (e.g. <br>, <img>) you must not include the close tags. > > Yep, I'm aware of void tags, but I'm talking about the container tags > - in this case, <li> and <p> - which, in a lot of older HTML pages, > are treated as "separator" tags. Consider this content: > > <HTML> > Hello, world! > <P> > Paragraph 2 > <P> > Hey look, a third paragraph! > </HTML> > > Stick a doctype onto that and it should be valid HTML5, but as it is, > it's the exact sort of thing that was quite common in the 90s. > > The <p> tag is not a void tag, but according to the spec, it's legal > to omit the </p> if the element is followed directly by another <p> > element (or any of a specific set of others), or if there is no > further content.
Right. The parser knows the structure of an HTML document, which tags
are optional and which elements can be inside of which other elements.
For SGML-based HTML versions (2.0 to 4.01) this is formally described by
the DTD.
So when parsing your file, an HTML parser would work like this
<HTML> - Yup, I expect an HTML element here:
HTML
Hello, world! - #PCDATA? Not allowed as a child of HTML. There must
be a HEAD and a BODY, both of which have optional start tags.
HEAD can't contain #PCDATA either, so we must be inside of BODY
and HEAD was empty:
HTML
├─ HEAD
└─ BODY
└─ Hello, world!
<P> - Allowed in BODY, so just add that:
HTML
├─ HEAD
└─ BODY
├─ #PCDATA: Hello, world!
└─ P
Paragraph 2 - #PCDATA is allowed in P, so add it as a child:
HTML
├─ HEAD
└─ BODY
├─ #PCDATA: Hello, world!
└─ P
└─ #PCDATA: Paragraph 2
<P> - Not allowed inside of P, so that implicitely closes the
previous P element and we go up one level:
HTML
├─ HEAD
└─ BODY
├─ #PCDATA: Hello, world!
├─ P
│ └─ #PCDATA: Paragraph 2
└─ P
Hey look, a third paragraph! - Same as above:
HTML
├─ HEAD
└─ BODY
├─ #PCDATA: Hello, world!
├─ P
│ └─ #PCDATA: Paragraph 2
└─ P
└─ #PCDATA: Hey look, a third paragraph!
</HTML> - The end tags of P and BODY are optional, so the end of
HTML closes them implicitely, and we have our final parse tree
(unchanged from the last step):
HTML
├─ HEAD
└─ BODY
├─ #PCDATA: Hello, world!
├─ P
│ └─ #PCDATA: Paragraph 2
└─ P
└─ #PCDATA: Hey look, a third paragraph!
For a human, the <p> tags might feel like separators here. But
syntactically they aren't - they start a new element. Note especially
that "Hello, world!" is not part of a P element but a direct child of
BODY (which may or may not be intended by the author).
>
> > Adding in the omitted <head>, </head>, <body>, </body>, and </html>
> > would make no difference and there's no particular reason to recommend
> > doing so as far as I'm aware.
>
> And yet most people do it. Why?
There may be several reasons:
* Historically, some browsers differed in which end tags were actually
optional. Since (AFAIK) no mainstream browser ever implemented a real
SGML parser (they were always "tag soup" parsers with lots of ad-hoc
rules) this sometimes even changed within the same browser depending
on context (e.g. a simple table might work but nested tables woudn't).
So people started to use end-tags defensively.
* XHTML was for some time popular and it doesn't have any optional tags.
So people got into the habit of always using end tags and writing
empty tags as <XXX />.
* Aesthetics: Always writing the end tags is more consistent and may
look more balanced.
* Cargo-cult: People saw other people do that and copied the habit
without thinking about it.
> Are you saying that it's better to omit them all?
If you want to conserve keystrokes :-)
I think it doesn't matter. Both are valid.
> More importantly: Would you omit all the </p> closing tags you can, or
> would you include them?
I usually write them. I also indent the contents of an element, so I
would write your example as:
<!DOCTYPE html>
<html>
<body>
Hello, world!
<p>
Paragraph 2
</p>
<p>
Hey look, a third paragraph!
</p>
</body>
</html>
(As you can see I would also include the body tags to make that element
explicit. I would normally also add a bit of boilerplate (especially a
head with a charset and viewport definition), but I omit them here since
they would change the parse tree)
hp
--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | [email protected] | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
signature.asc
Description: PGP signature
-- https://mail.python.org/mailman/listinfo/python-list
