Re: Beautiful Soup - close tags more promptly?
On Wed, 26 Oct 2022 at 04:59, Tim Delaney wrote: > > On Mon, 24 Oct 2022 at 19:03, Chris Angelico wrote: >> >> >> Ah, cool. Thanks. I'm not entirely sure of the various advantages and >> disadvantages of the different parsers; is there a tabulation >> anywhere, or at least a list of recommendations on choosing a suitable >> parser? > > > Coming to this a bit late, but from my experience with BeautifulSoup and HTML > produced by other people ... > > lxml is easily the fastest, but also the least forgiving. > html.parer is middling on performance, but as you've seen sometimes makes > mistakes. > html5lib is the slowest, but is most forgiving of malformed input and edge > cases. > > I use html5lib - it's fast enough for what I do, and the most likely to > return results matching what the author saw when they maybe tried it in a > single web browser. Cool cool. It sounds like html5lib should really be the recommended parser for HTML, unless performance or dependency reduction is important enough to change your plans. (But only for HTML. For XML, lxml would still be the right choice.) ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Beautiful Soup - close tags more promptly?
On Mon, 24 Oct 2022 at 19:03, Chris Angelico wrote: > > Ah, cool. Thanks. I'm not entirely sure of the various advantages and > disadvantages of the different parsers; is there a tabulation > anywhere, or at least a list of recommendations on choosing a suitable > parser? > Coming to this a bit late, but from my experience with BeautifulSoup and HTML produced by other people ... lxml is easily the fastest, but also the least forgiving. html.parer is middling on performance, but as you've seen sometimes makes mistakes. html5lib is the slowest, but is most forgiving of malformed input and edge cases. I use html5lib - it's fast enough for what I do, and the most likely to return results matching what the author saw when they maybe tried it in a single web browser. Tim Delaney -- https://mail.python.org/mailman/listinfo/python-list
Re: Beautiful Soup - close tags more promptly?
On Tue, 25 Oct 2022 at 09:34, Peter J. Holzer wrote: > > One thing I find quite interesting, though, is the way that browsers > > *differ* in the face of bad nesting of tags. Recently I was struggling > > to figure out a problem with an HTML form, and eventually found that > > there was a spurious tag way up higher in the page. Forms don't > > nest, so that's invalid, but different browsers had slightly different > > ways of showing it. > > Yeah, mismatched form tags can have weird effects. I don't remember the > details but I scratched my head over that one more than once. > Yeah. I think my weirdest issue was one time when I inadvertently had a element (with a form inside it) inside something else with a form (because the was missing). Neither "dialog inside main" nor "form in dialog separate from form in main" is a problem, and even "oops, missed a closing form tag" isn't that big a deal, but put them all together, and you end up with a bizarre situation where Firefox 91 behaves one way and Chrome (some-version) behaves another way. That was a fun day. Remember, folks, even if you think you ran the W3C validator on your code recently, it can still be worth checking. Just in case. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Beautiful Soup - close tags more promptly?
On 2022-10-25 06:56:58 +1100, Chris Angelico wrote: > On Tue, 25 Oct 2022 at 04:22, Peter J. Holzer wrote: > > There may be several reasons: > > > > * Historically, some browsers differed in which end tags were actually > > optional. Since (AFAIK) no mainstream browser ever implemented a real > > SGML parser (they were always "tag soup" parsers with lots of ad-hoc > > rules) this sometimes even changed within the same browser depending > > on context (e.g. a simple table might work but nested tables woudn't). > > So people started to use end-tags defensively. > > * XHTML was for some time popular and it doesn't have any optional tags. > > So people got into the habit of always using end tags and writing > > empty tags as . > > * Aesthetics: Always writing the end tags is more consistent and may > > look more balanced. > > * Cargo-cult: People saw other people do that and copied the habit > > without thinking about it. > > > > > > > Are you saying that it's better to omit them all? > > > > If you want to conserve keystrokes :-) > > > > I think it doesn't matter. Both are valid. > > > > > More importantly: Would you omit all the closing tags you can, or > > > would you include them? > > > > I usually write them. > > Interesting. So which of the above reasons is yours? Mostly the third one at this point I think. The first one has gone away for me with HTML5. The second one still lingers at the back of my brain, but I've gotten rid of the habit of writing , so I'm recevering ;-). But I still like my code to be nice and tidy, and whether my sense of tidyness was influenced by XML or not, if the end tags are missing it looks off, somehow. (That said, I do sometimes leave them off to reduce visual clutter.) > One thing I find quite interesting, though, is the way that browsers > *differ* in the face of bad nesting of tags. Recently I was struggling > to figure out a problem with an HTML form, and eventually found that > there was a spurious tag way up higher in the page. Forms don't > nest, so that's invalid, but different browsers had slightly different > ways of showing it. Yeah, mismatched form tags can have weird effects. I don't remember the details but I scratched my head over that one more than once. hp -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | h...@hjp.at |-- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" signature.asc Description: PGP signature -- https://mail.python.org/mailman/listinfo/python-list
Re: Beautiful Soup - close tags more promptly?
On Tue, 25 Oct 2022 at 04:22, Peter J. Holzer wrote: > There may be several reasons: > > * Historically, some browsers differed in which end tags were actually > optional. Since (AFAIK) no mainstream browser ever implemented a real > SGML parser (they were always "tag soup" parsers with lots of ad-hoc > rules) this sometimes even changed within the same browser depending > on context (e.g. a simple table might work but nested tables woudn't). > So people started to use end-tags defensively. > * XHTML was for some time popular and it doesn't have any optional tags. > So people got into the habit of always using end tags and writing > empty tags as . > * Aesthetics: Always writing the end tags is more consistent and may > look more balanced. > * Cargo-cult: People saw other people do that and copied the habit > without thinking about it. > > > > Are you saying that it's better to omit them all? > > If you want to conserve keystrokes :-) > > I think it doesn't matter. Both are valid. > > > More importantly: Would you omit all the closing tags you can, or > > would you include them? > > I usually write them. Interesting. So which of the above reasons is yours? Personally, I do it for a slightly different reason: Many end tags are *situationally* optional, and it's much easier to debug code when you change/insert/remove something and nothing changes, than when doing so affects the implicit closing tags. > I also indent the contents of an element, so I > would write your example as: > > > > > Hello, world! > > Paragraph 2 > > > Hey look, a third paragraph! > > > > > (As you can see I would also include the body tags to make that element > explicit. I would normally also add a bit of boilerplate (especially a > head with a charset and viewport definition), but I omit them here since > they would change the parse tree) > Yeah - any REAL page would want quite a bit (very few pages these days manage without a style sheet, and it seems that hardly any survive without importing a few gigabytes of JavaScript, but that's not mandatory), but in ancient pages, there's still a well-defined parse structure for every tag sequences. One thing I find quite interesting, though, is the way that browsers *differ* in the face of bad nesting of tags. Recently I was struggling to figure out a problem with an HTML form, and eventually found that there was a spurious tag way up higher in the page. Forms don't nest, so that's invalid, but different browsers had slightly different ways of showing it. (Obviously the W3C Validator was the most helpful tool here, since it reports it as an error rather than constructing any sort of DOM tree.) ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Beautiful Soup - close tags more promptly?
Jon Ribbens via Python-list schreef op 24/10/2022 om 19:01: On 2022-10-24, Chris Angelico wrote: > On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list wrote: >> Adding in the omitted , , , , and >> would make no difference and there's no particular reason to recommend >> doing so as far as I'm aware. > > And yet most people do it. Why? They agree with Tim Peters that "Explicit is better than implicit", I suppose? ;-) I don't write all that much HTML, but when I do, it include those tags largely for that reason indeed. We don't write HTML just for the browser, we also write it for the web developer. And I think it's easier for the web developer when the different sections are clearly distinguished, and what better way to do it than use their tags. > More importantly: Would you omit all the closing tags you can, or > would you include them? It would depend on how much content was inside them I guess. Something like: First item Second item Third item is very easy to understand, but if each item was many lines long then it may be less confusing to explicitly close - not least for indentation purposes. I mostly include closing tags, if for no other reason than that I have the impression that editors generally work better (i.e. get things like indentation and syntax highlighting right) that way. -- "Je ne suis pas d’accord avec ce que vous dites, mais je me battrai jusqu’à la mort pour que vous ayez le droit de le dire." -- Attribué à Voltaire "I disapprove of what you say, but I will defend to the death your right to say it." -- Attributed to Voltaire "Ik ben het niet eens met wat je zegt, maar ik zal je recht om het te zeggen tot de dood toe verdedigen" -- Toegeschreven aan Voltaire -- https://mail.python.org/mailman/listinfo/python-list
Re: Beautiful Soup - close tags more promptly?
On 2022-10-25 03:09:33 +1100, Chris Angelico wrote: > On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list > wrote: > > On 2022-10-24, Chris Angelico wrote: > > > On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer wrote: > > >> Yes, I got that. What I wanted to say was that this is indeed a bug in > > >> html.parser and not an error (or sloppyness, as you called it) in the > > >> input or ambiguity in the HTML standard. > > > > > > I described the HTML as "sloppy" for a number of reasons, but I was of > > > the understanding that it's generally recommended to have the closing > > > tags. Not that it matters much. > > > > Some elements don't need close tags, or even open tags. Unless you're > > using XHTML you don't need them and indeed for the case of void tags > > (e.g. , ) you must not include the close tags. > > Yep, I'm aware of void tags, but I'm talking about the container tags > - in this case, and - which, in a lot of older HTML pages, > are treated as "separator" tags. Consider this content: > > > Hello, world! > > Paragraph 2 > > Hey look, a third paragraph! > > > Stick a doctype onto that and it should be valid HTML5, but as it is, > it's the exact sort of thing that was quite common in the 90s. > > The tag is not a void tag, but according to the spec, it's legal > to omit the if the element is followed directly by another > element (or any of a specific set of others), or if there is no > further content. Right. The parser knows the structure of an HTML document, which tags are optional and which elements can be inside of which other elements. For SGML-based HTML versions (2.0 to 4.01) this is formally described by the DTD. So when parsing your file, an HTML parser would work like this - Yup, I expect an HTML element here: HTML Hello, world! - #PCDATA? Not allowed as a child of HTML. There must be a HEAD and a BODY, both of which have optional start tags. HEAD can't contain #PCDATA either, so we must be inside of BODY and HEAD was empty: HTML ├─ HEAD └─ BODY └─ Hello, world! - Allowed in BODY, so just add that: HTML ├─ HEAD └─ BODY ├─ #PCDATA: Hello, world! └─ P Paragraph 2 - #PCDATA is allowed in P, so add it as a child: HTML ├─ HEAD └─ BODY ├─ #PCDATA: Hello, world! └─ P └─ #PCDATA: Paragraph 2 - Not allowed inside of P, so that implicitely closes the previous P element and we go up one level: HTML ├─ HEAD └─ BODY ├─ #PCDATA: Hello, world! ├─ P │ └─ #PCDATA: Paragraph 2 └─ P Hey look, a third paragraph! - Same as above: HTML ├─ HEAD └─ BODY ├─ #PCDATA: Hello, world! ├─ P │ └─ #PCDATA: Paragraph 2 └─ P └─ #PCDATA: Hey look, a third paragraph! - The end tags of P and BODY are optional, so the end of HTML closes them implicitely, and we have our final parse tree (unchanged from the last step): HTML ├─ HEAD └─ BODY ├─ #PCDATA: Hello, world! ├─ P │ └─ #PCDATA: Paragraph 2 └─ P └─ #PCDATA: Hey look, a third paragraph! For a human, the tags might feel like separators here. But syntactically they aren't - they start a new element. Note especially that "Hello, world!" is not part of a P element but a direct child of BODY (which may or may not be intended by the author). > > > Adding in the omitted , , , , and > > would make no difference and there's no particular reason to recommend > > doing so as far as I'm aware. > > And yet most people do it. Why? There may be several reasons: * Historically, some browsers differed in which end tags were actually optional. Since (AFAIK) no mainstream browser ever implemented a real SGML parser (they were always "tag soup" parsers with lots of ad-hoc rules) this sometimes even changed within the same browser depending on context (e.g. a simple table might work but nested tables woudn't). So people started to use end-tags defensively. * XHTML was for some time popular and it doesn't have any optional tags. So people got into the habit of always using end tags and writing empty tags as . * Aesthetics: Always writing the end tags is more consistent and may look more balanced. * Cargo-cult: People saw other people do that and copied the habit without thinking about it. > Are you saying that it's better to omit them all? If you want to conserve keystrokes :-) I think it doesn't matter. Both are valid. > More importantly: Would you omit all the closing tags you can, or > would you include them? I usually write them. I also indent the contents of an
Re: Beautiful Soup - close tags more promptly?
On 2022-10-24, Chris Angelico wrote: > On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list > wrote: >> >> On 2022-10-24, Chris Angelico wrote: >> > On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer wrote: >> >> Yes, I got that. What I wanted to say was that this is indeed a bug in >> >> html.parser and not an error (or sloppyness, as you called it) in the >> >> input or ambiguity in the HTML standard. >> > >> > I described the HTML as "sloppy" for a number of reasons, but I was of >> > the understanding that it's generally recommended to have the closing >> > tags. Not that it matters much. >> >> Some elements don't need close tags, or even open tags. Unless you're >> using XHTML you don't need them and indeed for the case of void tags >> (e.g. , ) you must not include the close tags. > > Yep, I'm aware of void tags, but I'm talking about the container tags > - in this case, and - which, in a lot of older HTML pages, > are treated as "separator" tags. Yes, hence why I went on to talk about container tags. > Consider this content: > > > Hello, world! > > Paragraph 2 > > Hey look, a third paragraph! > > > Stick a doctype onto that and it should be valid HTML5, Nope, it's missing a . >> Adding in the omitted , , , , and >> would make no difference and there's no particular reason to recommend >> doing so as far as I'm aware. > > And yet most people do it. Why? They agree with Tim Peters that "Explicit is better than implicit", I suppose? ;-) > Are you saying that it's better to omit them all? No, I'm saying it's neither option is necessarily better than the other. > More importantly: Would you omit all the closing tags you can, or > would you include them? It would depend on how much content was inside them I guess. Something like: First item Second item Third item is very easy to understand, but if each item was many lines long then it may be less confusing to explicitly close - not least for indentation purposes. -- https://mail.python.org/mailman/listinfo/python-list
Re: Beautiful Soup - close tags more promptly?
On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list wrote: > > On 2022-10-24, Chris Angelico wrote: > > On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer wrote: > >> Yes, I got that. What I wanted to say was that this is indeed a bug in > >> html.parser and not an error (or sloppyness, as you called it) in the > >> input or ambiguity in the HTML standard. > > > > I described the HTML as "sloppy" for a number of reasons, but I was of > > the understanding that it's generally recommended to have the closing > > tags. Not that it matters much. > > Some elements don't need close tags, or even open tags. Unless you're > using XHTML you don't need them and indeed for the case of void tags > (e.g. , ) you must not include the close tags. Yep, I'm aware of void tags, but I'm talking about the container tags - in this case, and - which, in a lot of older HTML pages, are treated as "separator" tags. Consider this content: Hello, world! Paragraph 2 Hey look, a third paragraph! Stick a doctype onto that and it should be valid HTML5, but as it is, it's the exact sort of thing that was quite common in the 90s. (I'm not sure when lowercase tags became more popular, but in any case (pun intended), that won't affect validity.) The tag is not a void tag, but according to the spec, it's legal to omit the if the element is followed directly by another element (or any of a specific set of others), or if there is no further content. > Adding in the omitted , , , , and > would make no difference and there's no particular reason to recommend > doing so as far as I'm aware. And yet most people do it. Why? Are you saying that it's better to omit them all? More importantly: Would you omit all the closing tags you can, or would you include them? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Beautiful Soup - close tags more promptly?
On 2022-10-24, Chris Angelico wrote: > On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer wrote: >> Yes, I got that. What I wanted to say was that this is indeed a bug in >> html.parser and not an error (or sloppyness, as you called it) in the >> input or ambiguity in the HTML standard. > > I described the HTML as "sloppy" for a number of reasons, but I was of > the understanding that it's generally recommended to have the closing > tags. Not that it matters much. Some elements don't need close tags, or even open tags. Unless you're using XHTML you don't need them and indeed for the case of void tags (e.g. , ) you must not include the close tags. A minimal HTML file might look like this: Minimal HTML file Minimal HTML fileThis is a minimal HTML file. which would be parsed into this: Minimal HTML file Minimal HTML file This is a minimal HTML file. Adding in the omitted , , , , and would make no difference and there's no particular reason to recommend doing so as far as I'm aware. -- https://mail.python.org/mailman/listinfo/python-list
Re: Beautiful Soup - close tags more promptly?
On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer wrote: > > On 2022-10-24 21:56:13 +1100, Chris Angelico wrote: > > On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer wrote: > > > Ron has already noted that the lxml and html5 parser do the right thing, > > > so just for the record: > > > > > > The HTML fragment above is well-formed and contains a number of li > > > elements at the same level directly below the ol element, not lots of > > > nested li elements. The end tag of the li element is optional (except in > > > XHTML) and li elements don't nest. > > > > That's correct. However, parsing it with html.parser and then > > reconstituting it as shown in the example code results in all the > > tags coming up right before the , indicating that the > > tags were parsed as deeply nested rather than as siblings. > > Yes, I got that. What I wanted to say was that this is indeed a bug in > html.parser and not an error (or sloppyness, as you called it) in the > input or ambiguity in the HTML standard. I described the HTML as "sloppy" for a number of reasons, but I was of the understanding that it's generally recommended to have the closing tags. Not that it matters much. > > which html5lib seems to be doing fine. Whether > > it has other issues, I don't know, but I guess I'll find out > > The link somebody posted mentions that it's "very slow". Which may or > may not be a problem when you have to parse 9000 files. But if it does > implement HTML5 correctly, it should parse any file the same as a modern > browser does (maybe excluding quirks mode). > Yeah. TBH I think the two-hour run time is primarily dominated by network delays, not parsing time, but if I had a service where people could upload HTML to be parsed, that might affect throughput. For the record, if anyone else is considering html5lib: It is likely "fast enough", even if not fast. Give it a try. (And I know what slow parsing feels like. Parsing a ~100MB file with a decently-fast grammar-based lexer takes a good while. Parsing the same content after it's been converted to JSON? Fast.) ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Beautiful Soup - close tags more promptly?
On 2022-10-24 21:56:13 +1100, Chris Angelico wrote: > On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer wrote: > > Ron has already noted that the lxml and html5 parser do the right thing, > > so just for the record: > > > > The HTML fragment above is well-formed and contains a number of li > > elements at the same level directly below the ol element, not lots of > > nested li elements. The end tag of the li element is optional (except in > > XHTML) and li elements don't nest. > > That's correct. However, parsing it with html.parser and then > reconstituting it as shown in the example code results in all the > tags coming up right before the , indicating that the > tags were parsed as deeply nested rather than as siblings. Yes, I got that. What I wanted to say was that this is indeed a bug in html.parser and not an error (or sloppyness, as you called it) in the input or ambiguity in the HTML standard. > In order to get a successful parse out of this, I need something which > sees them as siblings, Right, but Roel (correct name this time) had already posted that lxml and html5lib parse this correctly, so I saw no need to belabour that point. > which html5lib seems to be doing fine. Whether > it has other issues, I don't know, but I guess I'll find out The link somebody posted mentions that it's "very slow". Which may or may not be a problem when you have to parse 9000 files. But if it does implement HTML5 correctly, it should parse any file the same as a modern browser does (maybe excluding quirks mode). hp -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | h...@hjp.at |-- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" signature.asc Description: PGP signature -- https://mail.python.org/mailman/listinfo/python-list
Re: Beautiful Soup - close tags more promptly?
On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer wrote: > Ron has already noted that the lxml and html5 parser do the right thing, > so just for the record: > > The HTML fragment above is well-formed and contains a number of li > elements at the same level directly below the ol element, not lots of > nested li elements. The end tag of the li element is optional (except in > XHTML) and li elements don't nest. That's correct. However, parsing it with html.parser and then reconstituting it as shown in the example code results in all the tags coming up right before the , indicating that the tags were parsed as deeply nested rather than as siblings. In order to get a successful parse out of this, I need something which sees them as siblings, which html5lib seems to be doing fine. Whether it has other issues, I don't know, but I guess I'll find out it's currently running on the live site and taking several hours (due to network delays and the server being slow, so I don't really want to parallelize and overload the thing). ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Beautiful Soup - close tags more promptly?
On 2022-10-24 12:32:11 +0200, Peter J. Holzer wrote: > Ron has already noted that the lxml and html5 parser do the right thing, ^^^ Oops, sorry. That was Roel. hp -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | h...@hjp.at |-- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" signature.asc Description: PGP signature -- https://mail.python.org/mailman/listinfo/python-list
Re: Beautiful Soup - close tags more promptly?
On 2022-10-24 13:29:13 +1100, Chris Angelico wrote: > Parsing ancient HTML files is something Beautiful Soup is normally > great at. But I've run into a small problem, caused by this sort of > sloppy HTML: > > from bs4 import BeautifulSoup > # See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm > blob = b""" > > 'THERE sinks the nebulous star we call the Sun, > If that hypothesis of theirs be sound,' [...] > Stirring a sudden transport rose and fell. > > """ > soup = BeautifulSoup(blob, "html.parser") > print(soup) > > > On this small snippet, it works acceptably, but puts a large number of > tags immediately before the . Ron has already noted that the lxml and html5 parser do the right thing, so just for the record: The HTML fragment above is well-formed and contains a number of li elements at the same level directly below the ol element, not lots of nested li elements. The end tag of the li element is optional (except in XHTML) and li elements don't nest. hp -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | h...@hjp.at |-- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" signature.asc Description: PGP signature -- https://mail.python.org/mailman/listinfo/python-list
Re: Beautiful Soup - close tags more promptly?
(Oops, accidentally only sent to Chris instead of to the list) Op 24/10/2022 om 10:02 schreef Chris Angelico: On Mon, 24 Oct 2022 at 18:43, Roel Schroeven wrote: > Using html5lib (install package html5lib) instead of html.parser seems > to do the trick: it inserts right before the next , and one > before the closing . On my system the same happens when I don't > specify a parser, but IIRC that's a bit fragile because other systems > can choose different parsers of you don't explicity specify one. > Ah, cool. Thanks. I'm not entirely sure of the various advantages and disadvantages of the different parsers; is there a tabulation anywhere, or at least a list of recommendations on choosing a suitable parser? There's a bit of information here: https://beautiful-soup-4.readthedocs.io/en/latest/#installing-a-parser Not much but maybe it can be helpful. I'm dealing with a HUGE mess of different coding standards, all the way from 1990s-level stuff (images for indentation, tables for formatting, and ) up through HTML4 (a good few of the pages have at least some tags and declare their encodings, mostly ISO-8859-1 or similar), to fairly modern HTML5. There's even a couple of pages that use frames - yes, the old style with a block in case the browser can't handle it. I went with html.parser on the expectation that it'd give the best "across all standards" results, but I'll give html5lib a try and see if it does better. Would rather not try to use different parsers for different files, but if necessary, I'll figure something out. (For reference, this is roughly 9000 HTML files that have to be parsed. Doing things by hand is basically not an option.) I'd give lxml a try too. Maybe try to preprocess the HTML using html-tidy (https://www.html-tidy.org/), that might actually do a pretty good job of getting rid of all kinds of historical inconsistencies. Somehow checking if any solution works for thousands of input files will always be a pain, I'm afraid. -- "I've come up with a set of rules that describe our reactions to technologies: 1. Anything that is in the world when you’re born is normal and ordinary and is just a natural part of the way the world works. 2. Anything that's invented between when you’re fifteen and thirty-five is new and exciting and revolutionary and you can probably get a career in it. 3. Anything invented after you're thirty-five is against the natural order of things." -- Douglas Adams, The Salmon of Doubt -- https://mail.python.org/mailman/listinfo/python-list
Re: Beautiful Soup - close tags more promptly?
Op 24/10/2022 om 9:42 schreef Roel Schroeven: Using html5lib (install package html5lib) instead of html.parser seems to do the trick: it inserts right before the next , and one before the closing . On my system the same happens when I don't specify a parser, but IIRC that's a bit fragile because other systems can choose different parsers of you don't explicity specify one. Just now I noticed: when I don't specify a parser, BeautifulSoup emits a warning with the parser it selected. In one of my venv's it's html5lib, in another it's lxml. Both seem to get a correct result. -- "I love science, and it pains me to think that to so many are terrified of the subject or feel that choosing science means you cannot also choose compassion, or the arts, or be awed by nature. Science is not meant to cure us of mystery, but to reinvent and reinvigorate it." -- Robert Sapolsky -- https://mail.python.org/mailman/listinfo/python-list
Re: Beautiful Soup - close tags more promptly?
On Mon, 24 Oct 2022 at 18:43, Roel Schroeven wrote: > > Op 24/10/2022 om 4:29 schreef Chris Angelico: > > Parsing ancient HTML files is something Beautiful Soup is normally > > great at. But I've run into a small problem, caused by this sort of > > sloppy HTML: > > > > from bs4 import BeautifulSoup > > # See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm > > blob = b""" > > > > 'THERE sinks the nebulous star we call the Sun, > > If that hypothesis of theirs be sound,' > > Said Ida;' let us down and rest:' and we > > Down from the lean and wrinkled precipices, > > By every coppice-feather'd chasm and cleft, > > Dropt thro' the ambrosial gloom to where below > > No bigger than a glow-worm shone the tent > > Lamp-lit from the inner. Once she lean'd on me, > > Descending; once or twice she lent her hand, > > And blissful palpitations in the blood, > > Stirring a sudden transport rose and fell. > > > > """ > > soup = BeautifulSoup(blob, "html.parser") > > print(soup) > > > > > > On this small snippet, it works acceptably, but puts a large number of > > tags immediately before the . On the original file (see > > link if you want to try it), this blows right through the default > > recursion limit, due to the crazy number of "nested" list items. > > > > Is there a way to tell BS4 on parse that these elements end at > > the next , rather than waiting for the final ? This would > > make tidier output, and also eliminate most of the recursion levels. > > > Using html5lib (install package html5lib) instead of html.parser seems > to do the trick: it inserts right before the next , and one > before the closing . On my system the same happens when I don't > specify a parser, but IIRC that's a bit fragile because other systems > can choose different parsers of you don't explicity specify one. > Ah, cool. Thanks. I'm not entirely sure of the various advantages and disadvantages of the different parsers; is there a tabulation anywhere, or at least a list of recommendations on choosing a suitable parser? I'm dealing with a HUGE mess of different coding standards, all the way from 1990s-level stuff (images for indentation, tables for formatting, and ) up through HTML4 (a good few of the pages have at least some tags and declare their encodings, mostly ISO-8859-1 or similar), to fairly modern HTML5. There's even a couple of pages that use frames - yes, the old style with a block in case the browser can't handle it. I went with html.parser on the expectation that it'd give the best "across all standards" results, but I'll give html5lib a try and see if it does better. Would rather not try to use different parsers for different files, but if necessary, I'll figure something out. (For reference, this is roughly 9000 HTML files that have to be parsed. Doing things by hand is basically not an option.) ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Beautiful Soup - close tags more promptly?
Op 24/10/2022 om 4:29 schreef Chris Angelico: Parsing ancient HTML files is something Beautiful Soup is normally great at. But I've run into a small problem, caused by this sort of sloppy HTML: from bs4 import BeautifulSoup # See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm blob = b""" 'THERE sinks the nebulous star we call the Sun, If that hypothesis of theirs be sound,' Said Ida;' let us down and rest:' and we Down from the lean and wrinkled precipices, By every coppice-feather'd chasm and cleft, Dropt thro' the ambrosial gloom to where below No bigger than a glow-worm shone the tent Lamp-lit from the inner. Once she lean'd on me, Descending; once or twice she lent her hand, And blissful palpitations in the blood, Stirring a sudden transport rose and fell. """ soup = BeautifulSoup(blob, "html.parser") print(soup) On this small snippet, it works acceptably, but puts a large number of tags immediately before the . On the original file (see link if you want to try it), this blows right through the default recursion limit, due to the crazy number of "nested" list items. Is there a way to tell BS4 on parse that these elements end at the next , rather than waiting for the final ? This would make tidier output, and also eliminate most of the recursion levels. Using html5lib (install package html5lib) instead of html.parser seems to do the trick: it inserts right before the next , and one before the closing . On my system the same happens when I don't specify a parser, but IIRC that's a bit fragile because other systems can choose different parsers of you don't explicity specify one. -- "I love science, and it pains me to think that to so many are terrified of the subject or feel that choosing science means you cannot also choose compassion, or the arts, or be awed by nature. Science is not meant to cure us of mystery, but to reinvent and reinvigorate it." -- Robert Sapolsky -- https://mail.python.org/mailman/listinfo/python-list
Beautiful Soup - close tags more promptly?
Parsing ancient HTML files is something Beautiful Soup is normally great at. But I've run into a small problem, caused by this sort of sloppy HTML: from bs4 import BeautifulSoup # See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm blob = b""" 'THERE sinks the nebulous star we call the Sun, If that hypothesis of theirs be sound,' Said Ida;' let us down and rest:' and we Down from the lean and wrinkled precipices, By every coppice-feather'd chasm and cleft, Dropt thro' the ambrosial gloom to where below No bigger than a glow-worm shone the tent Lamp-lit from the inner. Once she lean'd on me, Descending; once or twice she lent her hand, And blissful palpitations in the blood, Stirring a sudden transport rose and fell. """ soup = BeautifulSoup(blob, "html.parser") print(soup) On this small snippet, it works acceptably, but puts a large number of tags immediately before the . On the original file (see link if you want to try it), this blows right through the default recursion limit, due to the crazy number of "nested" list items. Is there a way to tell BS4 on parse that these elements end at the next , rather than waiting for the final ? This would make tidier output, and also eliminate most of the recursion levels. ChrisA -- https://mail.python.org/mailman/listinfo/python-list