On 2022-10-25 03:09:33 +1100, Chris Angelico wrote: > On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list > <python-list@python.org> wrote: > > On 2022-10-24, Chris Angelico <ros...@gmail.com> wrote: > > > On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer <hjp-pyt...@hjp.at> wrote: > > >> Yes, I got that. What I wanted to say was that this is indeed a bug in > > >> html.parser and not an error (or sloppyness, as you called it) in the > > >> input or ambiguity in the HTML standard. > > > > > > I described the HTML as "sloppy" for a number of reasons, but I was of > > > the understanding that it's generally recommended to have the closing > > > tags. Not that it matters much. > > > > Some elements don't need close tags, or even open tags. Unless you're > > using XHTML you don't need them and indeed for the case of void tags > > (e.g. <br>, <img>) you must not include the close tags. > > Yep, I'm aware of void tags, but I'm talking about the container tags > - in this case, <li> and <p> - which, in a lot of older HTML pages, > are treated as "separator" tags. Consider this content: > > <HTML> > Hello, world! > <P> > Paragraph 2 > <P> > Hey look, a third paragraph! > </HTML> > > Stick a doctype onto that and it should be valid HTML5, but as it is, > it's the exact sort of thing that was quite common in the 90s. > > The <p> tag is not a void tag, but according to the spec, it's legal > to omit the </p> if the element is followed directly by another <p> > element (or any of a specific set of others), or if there is no > further content.
Right. The parser knows the structure of an HTML document, which tags are optional and which elements can be inside of which other elements. For SGML-based HTML versions (2.0 to 4.01) this is formally described by the DTD. So when parsing your file, an HTML parser would work like this <HTML> - Yup, I expect an HTML element here: HTML Hello, world! - #PCDATA? Not allowed as a child of HTML. There must be a HEAD and a BODY, both of which have optional start tags. HEAD can't contain #PCDATA either, so we must be inside of BODY and HEAD was empty: HTML ├─ HEAD └─ BODY └─ Hello, world! <P> - Allowed in BODY, so just add that: HTML ├─ HEAD └─ BODY ├─ #PCDATA: Hello, world! └─ P Paragraph 2 - #PCDATA is allowed in P, so add it as a child: HTML ├─ HEAD └─ BODY ├─ #PCDATA: Hello, world! └─ P └─ #PCDATA: Paragraph 2 <P> - Not allowed inside of P, so that implicitely closes the previous P element and we go up one level: HTML ├─ HEAD └─ BODY ├─ #PCDATA: Hello, world! ├─ P │ └─ #PCDATA: Paragraph 2 └─ P Hey look, a third paragraph! - Same as above: HTML ├─ HEAD └─ BODY ├─ #PCDATA: Hello, world! ├─ P │ └─ #PCDATA: Paragraph 2 └─ P └─ #PCDATA: Hey look, a third paragraph! </HTML> - The end tags of P and BODY are optional, so the end of HTML closes them implicitely, and we have our final parse tree (unchanged from the last step): HTML ├─ HEAD └─ BODY ├─ #PCDATA: Hello, world! ├─ P │ └─ #PCDATA: Paragraph 2 └─ P └─ #PCDATA: Hey look, a third paragraph! For a human, the <p> tags might feel like separators here. But syntactically they aren't - they start a new element. Note especially that "Hello, world!" is not part of a P element but a direct child of BODY (which may or may not be intended by the author). > > > Adding in the omitted <head>, </head>, <body>, </body>, and </html> > > would make no difference and there's no particular reason to recommend > > doing so as far as I'm aware. > > And yet most people do it. Why? There may be several reasons: * Historically, some browsers differed in which end tags were actually optional. Since (AFAIK) no mainstream browser ever implemented a real SGML parser (they were always "tag soup" parsers with lots of ad-hoc rules) this sometimes even changed within the same browser depending on context (e.g. a simple table might work but nested tables woudn't). So people started to use end-tags defensively. * XHTML was for some time popular and it doesn't have any optional tags. So people got into the habit of always using end tags and writing empty tags as <XXX />. * Aesthetics: Always writing the end tags is more consistent and may look more balanced. * Cargo-cult: People saw other people do that and copied the habit without thinking about it. > Are you saying that it's better to omit them all? If you want to conserve keystrokes :-) I think it doesn't matter. Both are valid. > More importantly: Would you omit all the </p> closing tags you can, or > would you include them? I usually write them. I also indent the contents of an element, so I would write your example as: <!DOCTYPE html> <html> <body> Hello, world! <p> Paragraph 2 </p> <p> Hey look, a third paragraph! </p> </body> </html> (As you can see I would also include the body tags to make that element explicit. I would normally also add a bit of boilerplate (especially a head with a charset and viewport definition), but I omit them here since they would change the parse tree) hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | h...@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
signature.asc
Description: PGP signature
-- https://mail.python.org/mailman/listinfo/python-list