Re: Beautiful Soup - close tags more promptly?

2022-10-25 Thread Chris Angelico
On Wed, 26 Oct 2022 at 04:59, Tim Delaney  wrote:
>
> On Mon, 24 Oct 2022 at 19:03, Chris Angelico  wrote:
>>
>>
>> Ah, cool. Thanks. I'm not entirely sure of the various advantages and
>> disadvantages of the different parsers; is there a tabulation
>> anywhere, or at least a list of recommendations on choosing a suitable
>> parser?
>
>
> Coming to this a bit late, but from my experience with BeautifulSoup and HTML 
> produced by other people ...
>
> lxml is easily the fastest, but also the least forgiving.
> html.parer is middling on performance, but as you've seen sometimes makes 
> mistakes.
> html5lib is the slowest, but is most forgiving of malformed input and edge 
> cases.
>
> I use html5lib - it's fast enough for what I do, and the most likely to 
> return results matching what the author saw when they maybe tried it in a 
> single web browser.

Cool cool. It sounds like html5lib should really be the recommended
parser for HTML, unless performance or dependency reduction is
important enough to change your plans. (But only for HTML. For XML,
lxml would still be the right choice.)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Beautiful Soup - close tags more promptly?

2022-10-25 Thread Tim Delaney
On Mon, 24 Oct 2022 at 19:03, Chris Angelico  wrote:

>
> Ah, cool. Thanks. I'm not entirely sure of the various advantages and
> disadvantages of the different parsers; is there a tabulation
> anywhere, or at least a list of recommendations on choosing a suitable
> parser?
>

Coming to this a bit late, but from my experience with BeautifulSoup and
HTML produced by other people ...

lxml is easily the fastest, but also the least forgiving.
html.parer is middling on performance, but as you've seen sometimes makes
mistakes.
html5lib is the slowest, but is most forgiving of malformed input and edge
cases.

I use html5lib - it's fast enough for what I do, and the most likely to
return results matching what the author saw when they maybe tried it in a
single web browser.

Tim Delaney
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Chris Angelico
On Tue, 25 Oct 2022 at 09:34, Peter J. Holzer  wrote:
> > One thing I find quite interesting, though, is the way that browsers
> > *differ* in the face of bad nesting of tags. Recently I was struggling
> > to figure out a problem with an HTML form, and eventually found that
> > there was a spurious  tag way up higher in the page. Forms don't
> > nest, so that's invalid, but different browsers had slightly different
> > ways of showing it.
>
> Yeah, mismatched form tags can have weird effects. I don't remember the
> details but I scratched my head over that one more than once.
>

Yeah. I think my weirdest issue was one time when I inadvertently had
a  element (with a form inside it) inside something else with
a form (because the  was missing). Neither "dialog inside main"
nor "form in  dialog separate from form in main" is a problem, and
even "oops, missed a closing form tag" isn't that big a deal, but put
them all together, and you end up with a bizarre situation where
Firefox 91 behaves one way and Chrome (some-version) behaves another
way.

That was a fun day. Remember, folks, even if you think you ran the W3C
validator on your code recently, it can still be worth checking. Just
in case.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Peter J. Holzer
On 2022-10-25 06:56:58 +1100, Chris Angelico wrote:
> On Tue, 25 Oct 2022 at 04:22, Peter J. Holzer  wrote:
> > There may be several reasons:
> >
> > * Historically, some browsers differed in which end tags were actually
> >   optional. Since (AFAIK) no mainstream browser ever implemented a real
> >   SGML parser (they were always "tag soup" parsers with lots of ad-hoc
> >   rules) this sometimes even changed within the same browser depending
> >   on context (e.g. a simple table might work but nested tables woudn't).
> >   So people started to use end-tags defensively.
> > * XHTML was for some time popular and it doesn't have any optional tags.
> >   So people got into the habit of always using end tags and writing
> >   empty tags as .
> > * Aesthetics: Always writing the end tags is more consistent and may
> >   look more balanced.
> > * Cargo-cult: People saw other people do that and copied the habit
> >   without thinking about it.
> >
> >
> > > Are you saying that it's better to omit them all?
> >
> > If you want to conserve keystrokes :-)
> >
> > I think it doesn't matter. Both are valid.
> >
> > > More importantly: Would you omit all the  closing tags you can, or
> > > would you include them?
> >
> > I usually write them.
> 
> Interesting. So which of the above reasons is yours?

Mostly the third one at this point I think. The first one has gone away
for me with HTML5. The second one still lingers at the back of
my brain, but I've gotten rid of the habit of writing , so I'm
recevering ;-). But I still like my code to be nice and tidy, and
whether my sense of tidyness was influenced by XML or not, if the end
tags are missing it looks off, somehow.

(That said, I do sometimes leave them off to reduce visual clutter.)


> One thing I find quite interesting, though, is the way that browsers
> *differ* in the face of bad nesting of tags. Recently I was struggling
> to figure out a problem with an HTML form, and eventually found that
> there was a spurious  tag way up higher in the page. Forms don't
> nest, so that's invalid, but different browsers had slightly different
> ways of showing it.

Yeah, mismatched form tags can have weird effects. I don't remember the
details but I scratched my head over that one more than once.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Chris Angelico
On Tue, 25 Oct 2022 at 04:22, Peter J. Holzer  wrote:
> There may be several reasons:
>
> * Historically, some browsers differed in which end tags were actually
>   optional. Since (AFAIK) no mainstream browser ever implemented a real
>   SGML parser (they were always "tag soup" parsers with lots of ad-hoc
>   rules) this sometimes even changed within the same browser depending
>   on context (e.g. a simple table might work but nested tables woudn't).
>   So people started to use end-tags defensively.
> * XHTML was for some time popular and it doesn't have any optional tags.
>   So people got into the habit of always using end tags and writing
>   empty tags as .
> * Aesthetics: Always writing the end tags is more consistent and may
>   look more balanced.
> * Cargo-cult: People saw other people do that and copied the habit
>   without thinking about it.
>
>
> > Are you saying that it's better to omit them all?
>
> If you want to conserve keystrokes :-)
>
> I think it doesn't matter. Both are valid.
>
> > More importantly: Would you omit all the  closing tags you can, or
> > would you include them?
>
> I usually write them.

Interesting. So which of the above reasons is yours? Personally, I do
it for a slightly different reason: Many end tags are *situationally*
optional, and it's much easier to debug code when you
change/insert/remove something and nothing changes, than when doing so
affects the implicit closing tags.

> I also indent the contents of an element, so I
> would write your example as:
>
> 
> 
>   
> Hello, world!
> 
>   Paragraph 2
> 
> 
>   Hey look, a third paragraph!
> 
>   
> 
>
> (As you can see I would also include the body tags to make that element
> explicit. I would normally also add a bit of boilerplate (especially a
> head with a charset and viewport definition), but I omit them here since
> they would change the parse tree)
>

Yeah - any REAL page would want quite a bit (very few pages these days
manage without a style sheet, and it seems that hardly any survive
without importing a few gigabytes of JavaScript, but that's not
mandatory), but in ancient pages, there's still a well-defined parse
structure for every tag sequences.

One thing I find quite interesting, though, is the way that browsers
*differ* in the face of bad nesting of tags. Recently I was struggling
to figure out a problem with an HTML form, and eventually found that
there was a spurious  tag way up higher in the page. Forms don't
nest, so that's invalid, but different browsers had slightly different
ways of showing it. (Obviously the W3C Validator was the most helpful
tool here, since it reports it as an error rather than constructing
any sort of DOM tree.)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Roel Schroeven

Jon Ribbens via Python-list schreef op 24/10/2022 om 19:01:

On 2022-10-24, Chris Angelico  wrote:
> On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list 
  wrote:
>> Adding in the omitted , , , , and 
>> would make no difference and there's no particular reason to recommend
>> doing so as far as I'm aware.
>
> And yet most people do it. Why?

They agree with Tim Peters that "Explicit is better than implicit",
I suppose? ;-)


I don't write all that much HTML, but when I do, it include those tags 
largely for that reason indeed. We don't write HTML just for the 
browser, we also write it for the web developer. And I think it's easier 
for the web developer when the different sections are clearly 
distinguished, and what better way to do it than use their tags.



> More importantly: Would you omit all the  closing tags you can, or
> would you include them?

It would depend on how much content was inside them I guess.
Something like:

   
 First item
 Second item
 Third item
   

is very easy to understand, but if each item was many lines long then it
may be less confusing to explicitly close - not least for indentation
purposes.
I mostly include closing tags, if for no other reason than that I have 
the impression that editors generally work better (i.e. get things like 
indentation and syntax highlighting right) that way.


--
"Je ne suis pas d’accord avec ce que vous dites, mais je me battrai jusqu’à
la mort pour que vous ayez le droit de le dire."
-- Attribué à Voltaire
"I disapprove of what you say, but I will defend to the death your right to
say it."
-- Attributed to Voltaire
"Ik ben het niet eens met wat je zegt, maar ik zal je recht om het te zeggen
tot de dood toe verdedigen"
-- Toegeschreven aan Voltaire
--
https://mail.python.org/mailman/listinfo/python-list


Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Peter J. Holzer
On 2022-10-25 03:09:33 +1100, Chris Angelico wrote:
> On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list
>  wrote:
> > On 2022-10-24, Chris Angelico  wrote:
> > > On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer  wrote:
> > >> Yes, I got that. What I wanted to say was that this is indeed a bug in
> > >> html.parser and not an error (or sloppyness, as you called it) in the
> > >> input or ambiguity in the HTML standard.
> > >
> > > I described the HTML as "sloppy" for a number of reasons, but I was of
> > > the understanding that it's generally recommended to have the closing
> > > tags. Not that it matters much.
> >
> > Some elements don't need close tags, or even open tags. Unless you're
> > using XHTML you don't need them and indeed for the case of void tags
> > (e.g. , ) you must not include the close tags.
> 
> Yep, I'm aware of void tags, but I'm talking about the container tags
> - in this case,  and  - which, in a lot of older HTML pages,
> are treated as "separator" tags. Consider this content:
> 
> 
> Hello, world!
> 
> Paragraph 2
> 
> Hey look, a third paragraph!
> 
> 
> Stick a doctype onto that and it should be valid HTML5, but as it is,
> it's the exact sort of thing that was quite common in the 90s.
> 
> The  tag is not a void tag, but according to the spec, it's legal
> to omit the  if the element is followed directly by another 
> element (or any of a specific set of others), or if there is no
> further content.

Right. The parser knows the structure of an HTML document, which tags
are optional and which elements can be inside of which other elements.
For SGML-based HTML versions (2.0 to 4.01) this is formally described by
the DTD.

So when parsing your file, an HTML parser would work like this

 - Yup, I expect an HTML element here:
HTML
Hello, world! - #PCDATA? Not allowed as a child of HTML. There must
be a HEAD and a BODY, both of which have optional start tags.
HEAD can't contain #PCDATA either, so we must be inside of BODY
and HEAD was empty:
HTML
  ├─ HEAD
  └─ BODY
   └─ Hello, world!
 - Allowed in BODY, so just add that:
HTML
  ├─ HEAD
  └─ BODY
   ├─ #PCDATA: Hello, world!
   └─ P
Paragraph 2 - #PCDATA is allowed in P, so add it as a child:
HTML
  ├─ HEAD
  └─ BODY
   ├─ #PCDATA: Hello, world!
   └─ P
   └─ #PCDATA: Paragraph 2
 - Not allowed inside of P, so that implicitely closes the
previous P element and we go up one level:
HTML
  ├─ HEAD
  └─ BODY
   ├─ #PCDATA: Hello, world!
   ├─ P
   │   └─ #PCDATA: Paragraph 2
   └─ P
Hey look, a third paragraph! - Same as above:
HTML
  ├─ HEAD
  └─ BODY
   ├─ #PCDATA: Hello, world!
   ├─ P
   │   └─ #PCDATA: Paragraph 2
   └─ P
   └─ #PCDATA: Hey look, a third paragraph!
 - The end tags of P and BODY are optional, so the end of
HTML closes them implicitely, and we have our final parse tree
(unchanged from the last step):
HTML
  ├─ HEAD
  └─ BODY
   ├─ #PCDATA: Hello, world!
   ├─ P
   │   └─ #PCDATA: Paragraph 2
   └─ P
   └─ #PCDATA: Hey look, a third paragraph!

For a human, the  tags might feel like separators here. But
syntactically they aren't - they start a new element. Note especially
that "Hello, world!" is not part of a P element but a direct child of
BODY (which may or may not be intended by the author).

> 
> > Adding in the omitted , , , , and 
> > would make no difference and there's no particular reason to recommend
> > doing so as far as I'm aware.
> 
> And yet most people do it. Why?

There may be several reasons:

* Historically, some browsers differed in which end tags were actually
  optional. Since (AFAIK) no mainstream browser ever implemented a real
  SGML parser (they were always "tag soup" parsers with lots of ad-hoc
  rules) this sometimes even changed within the same browser depending
  on context (e.g. a simple table might work but nested tables woudn't).
  So people started to use end-tags defensively.
* XHTML was for some time popular and it doesn't have any optional tags.
  So people got into the habit of always using end tags and writing
  empty tags as .
* Aesthetics: Always writing the end tags is more consistent and may
  look more balanced.
* Cargo-cult: People saw other people do that and copied the habit
  without thinking about it.


> Are you saying that it's better to omit them all?

If you want to conserve keystrokes :-)

I think it doesn't matter. Both are valid.

> More importantly: Would you omit all the  closing tags you can, or
> would you include them?

I usually write them. I also indent the contents of an 

Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Jon Ribbens via Python-list
On 2022-10-24, Chris Angelico  wrote:
> On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list
> wrote:
>>
>> On 2022-10-24, Chris Angelico  wrote:
>> > On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer  wrote:
>> >> Yes, I got that. What I wanted to say was that this is indeed a bug in
>> >> html.parser and not an error (or sloppyness, as you called it) in the
>> >> input or ambiguity in the HTML standard.
>> >
>> > I described the HTML as "sloppy" for a number of reasons, but I was of
>> > the understanding that it's generally recommended to have the closing
>> > tags. Not that it matters much.
>>
>> Some elements don't need close tags, or even open tags. Unless you're
>> using XHTML you don't need them and indeed for the case of void tags
>> (e.g. , ) you must not include the close tags.
>
> Yep, I'm aware of void tags, but I'm talking about the container tags
> - in this case,  and  - which, in a lot of older HTML pages,
> are treated as "separator" tags.

Yes, hence why I went on to talk about container tags.

> Consider this content:
>
>
> Hello, world!
>
> Paragraph 2
>
> Hey look, a third paragraph!
>
>
> Stick a doctype onto that and it should be valid HTML5,

Nope, it's missing a .

>> Adding in the omitted , , , , and 
>> would make no difference and there's no particular reason to recommend
>> doing so as far as I'm aware.
>
> And yet most people do it. Why?

They agree with Tim Peters that "Explicit is better than implicit",
I suppose? ;-)

> Are you saying that it's better to omit them all?

No, I'm saying it's neither option is necessarily better than the other.

> More importantly: Would you omit all the  closing tags you can, or
> would you include them?

It would depend on how much content was inside them I guess.
Something like:

  
First item
Second item
Third item
  

is very easy to understand, but if each item was many lines long then it
may be less confusing to explicitly close - not least for indentation
purposes.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Chris Angelico
On Tue, 25 Oct 2022 at 02:45, Jon Ribbens via Python-list
 wrote:
>
> On 2022-10-24, Chris Angelico  wrote:
> > On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer  wrote:
> >> Yes, I got that. What I wanted to say was that this is indeed a bug in
> >> html.parser and not an error (or sloppyness, as you called it) in the
> >> input or ambiguity in the HTML standard.
> >
> > I described the HTML as "sloppy" for a number of reasons, but I was of
> > the understanding that it's generally recommended to have the closing
> > tags. Not that it matters much.
>
> Some elements don't need close tags, or even open tags. Unless you're
> using XHTML you don't need them and indeed for the case of void tags
> (e.g. , ) you must not include the close tags.

Yep, I'm aware of void tags, but I'm talking about the container tags
- in this case,  and  - which, in a lot of older HTML pages,
are treated as "separator" tags. Consider this content:


Hello, world!

Paragraph 2

Hey look, a third paragraph!


Stick a doctype onto that and it should be valid HTML5, but as it is,
it's the exact sort of thing that was quite common in the 90s. (I'm
not sure when lowercase tags became more popular, but in any case (pun
intended), that won't affect validity.)

The  tag is not a void tag, but according to the spec, it's legal
to omit the  if the element is followed directly by another 
element (or any of a specific set of others), or if there is no
further content.

> Adding in the omitted , , , , and 
> would make no difference and there's no particular reason to recommend
> doing so as far as I'm aware.

And yet most people do it. Why? Are you saying that it's better to
omit them all?

More importantly: Would you omit all the  closing tags you can, or
would you include them?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Jon Ribbens via Python-list
On 2022-10-24, Chris Angelico  wrote:
> On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer  wrote:
>> Yes, I got that. What I wanted to say was that this is indeed a bug in
>> html.parser and not an error (or sloppyness, as you called it) in the
>> input or ambiguity in the HTML standard.
>
> I described the HTML as "sloppy" for a number of reasons, but I was of
> the understanding that it's generally recommended to have the closing
> tags. Not that it matters much.

Some elements don't need close tags, or even open tags. Unless you're
using XHTML you don't need them and indeed for the case of void tags
(e.g. , ) you must not include the close tags.

A minimal HTML file might look like this:


Minimal HTML file
Minimal HTML fileThis is a minimal HTML file.

which would be parsed into this:



  

Minimal HTML file
  
  

  Minimal HTML file
  This is a minimal HTML file.

  


Adding in the omitted , , , , and 
would make no difference and there's no particular reason to recommend
doing so as far as I'm aware.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Chris Angelico
On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer  wrote:
>
> On 2022-10-24 21:56:13 +1100, Chris Angelico wrote:
> > On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer  wrote:
> > > Ron has already noted that the lxml and html5 parser do the right thing,
> > > so just for the record:
> > >
> > > The HTML fragment above is well-formed and contains a number of li
> > > elements at the same level directly below the ol element, not lots of
> > > nested li elements. The end tag of the li element is optional (except in
> > > XHTML) and li elements don't nest.
> >
> > That's correct. However, parsing it with html.parser and then
> > reconstituting it as shown in the example code results in all the
> >  tags coming up right before the , indicating that the 
> > tags were parsed as deeply nested rather than as siblings.
>
> Yes, I got that. What I wanted to say was that this is indeed a bug in
> html.parser and not an error (or sloppyness, as you called it) in the
> input or ambiguity in the HTML standard.

I described the HTML as "sloppy" for a number of reasons, but I was of
the understanding that it's generally recommended to have the closing
tags. Not that it matters much.

> > which html5lib seems to be doing fine. Whether
> > it has other issues, I don't know, but I guess I'll find out
>
> The link somebody posted mentions that it's "very slow". Which may or
> may not be a problem when you have to parse 9000 files. But if it does
> implement HTML5 correctly, it should parse any file the same as a modern
> browser does (maybe excluding quirks mode).
>

Yeah. TBH I think the two-hour run time is primarily dominated by
network delays, not parsing time, but if I had a service where people
could upload HTML to be parsed, that might affect throughput.

For the record, if anyone else is considering html5lib: It is likely
"fast enough", even if not fast. Give it a try.

(And I know what slow parsing feels like. Parsing a ~100MB file with a
decently-fast grammar-based lexer takes a good while. Parsing the same
content after it's been converted to JSON? Fast.)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Peter J. Holzer
On 2022-10-24 21:56:13 +1100, Chris Angelico wrote:
> On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer  wrote:
> > Ron has already noted that the lxml and html5 parser do the right thing,
> > so just for the record:
> >
> > The HTML fragment above is well-formed and contains a number of li
> > elements at the same level directly below the ol element, not lots of
> > nested li elements. The end tag of the li element is optional (except in
> > XHTML) and li elements don't nest.
> 
> That's correct. However, parsing it with html.parser and then
> reconstituting it as shown in the example code results in all the
>  tags coming up right before the , indicating that the 
> tags were parsed as deeply nested rather than as siblings.

Yes, I got that. What I wanted to say was that this is indeed a bug in
html.parser and not an error (or sloppyness, as you called it) in the
input or ambiguity in the HTML standard.


> In order to get a successful parse out of this, I need something which
> sees them as siblings,

Right, but Roel (correct name this time) had already posted that lxml
and html5lib parse this correctly, so I saw no need to belabour that
point.

> which html5lib seems to be doing fine. Whether
> it has other issues, I don't know, but I guess I'll find out

The link somebody posted mentions that it's "very slow". Which may or
may not be a problem when you have to parse 9000 files. But if it does
implement HTML5 correctly, it should parse any file the same as a modern
browser does (maybe excluding quirks mode).

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Chris Angelico
On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer  wrote:
> Ron has already noted that the lxml and html5 parser do the right thing,
> so just for the record:
>
> The HTML fragment above is well-formed and contains a number of li
> elements at the same level directly below the ol element, not lots of
> nested li elements. The end tag of the li element is optional (except in
> XHTML) and li elements don't nest.

That's correct. However, parsing it with html.parser and then
reconstituting it as shown in the example code results in all the
 tags coming up right before the , indicating that the 
tags were parsed as deeply nested rather than as siblings.

In order to get a successful parse out of this, I need something which
sees them as siblings, which html5lib seems to be doing fine. Whether
it has other issues, I don't know, but I guess I'll find out it's
currently running on the live site and taking several hours (due to
network delays and the server being slow, so I don't really want to
parallelize and overload the thing).

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Peter J. Holzer
On 2022-10-24 12:32:11 +0200, Peter J. Holzer wrote:
> Ron has already noted that the lxml and html5 parser do the right thing,
  ^^^
  Oops, sorry. That was Roel.

hp



-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Peter J. Holzer
On 2022-10-24 13:29:13 +1100, Chris Angelico wrote:
> Parsing ancient HTML files is something Beautiful Soup is normally
> great at. But I've run into a small problem, caused by this sort of
> sloppy HTML:
> 
> from bs4 import BeautifulSoup
> # See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm
> blob = b"""
> 
> 'THERE sinks the nebulous star we call the Sun,
> If that hypothesis of theirs be sound,'
[...]
> Stirring a sudden transport rose and fell.
> 
> """
> soup = BeautifulSoup(blob, "html.parser")
> print(soup)
> 
> 
> On this small snippet, it works acceptably, but puts a large number of
>  tags immediately before the .

Ron has already noted that the lxml and html5 parser do the right thing,
so just for the record:

The HTML fragment above is well-formed and contains a number of li
elements at the same level directly below the ol element, not lots of
nested li elements. The end tag of the li element is optional (except in
XHTML) and li elements don't nest.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Roel Schroeven

(Oops, accidentally only sent to Chris instead of to the list)

Op 24/10/2022 om 10:02 schreef Chris Angelico:
On Mon, 24 Oct 2022 at 18:43, Roel Schroeven  
wrote:

> Using html5lib (install package html5lib) instead of html.parser seems
> to do the trick: it inserts  right before the next , and one
> before the closing  . On my system the same happens when I don't
> specify a parser, but IIRC that's a bit fragile because other systems
> can choose different parsers of you don't explicity specify one.
>

Ah, cool. Thanks. I'm not entirely sure of the various advantages and
disadvantages of the different parsers; is there a tabulation
anywhere, or at least a list of recommendations on choosing a suitable
parser?
There's a bit of information here: 
https://beautiful-soup-4.readthedocs.io/en/latest/#installing-a-parser

Not much but maybe it can be helpful.

I'm dealing with a HUGE mess of different coding standards, all the
way from 1990s-level stuff (images for indentation, tables for
formatting, and ) up through HTML4 (a good few
of the pages have at least some  tags and declare their
encodings, mostly ISO-8859-1 or similar), to fairly modern HTML5.
There's even a couple of pages that use frames - yes, the old style
with a  block in case the browser can't handle it. I went
with html.parser on the expectation that it'd give the best "across
all standards" results, but I'll give html5lib a try and see if it
does better.

Would rather not try to use different parsers for different files, but
if necessary, I'll figure something out.

(For reference, this is roughly 9000 HTML files that have to be
parsed. Doing things by hand is basically not an option.)

I'd give lxml a try too. Maybe try to preprocess the HTML using 
html-tidy (https://www.html-tidy.org/), that might actually do a pretty 
good job of getting rid of all kinds of historical inconsistencies.
Somehow checking if any solution works for thousands of input files will 
always be a pain, I'm afraid.


--
"I've come up with a set of rules that describe our reactions to technologies:
1. Anything that is in the world when you’re born is normal and ordinary and is
   just a natural part of the way the world works.
2. Anything that's invented between when you’re fifteen and thirty-five is new
   and exciting and revolutionary and you can probably get a career in it.
3. Anything invented after you're thirty-five is against the natural order of 
things."
-- Douglas Adams, The Salmon of Doubt

--
https://mail.python.org/mailman/listinfo/python-list


Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Roel Schroeven

Op 24/10/2022 om 9:42 schreef Roel Schroeven:
Using html5lib (install package html5lib) instead of html.parser seems 
to do the trick: it inserts  right before the next , and one 
before the closing  . On my system the same happens when I don't 
specify a parser, but IIRC that's a bit fragile because other systems 
can choose different parsers of you don't explicity specify one.


Just now I noticed: when I don't specify a parser, BeautifulSoup emits a 
warning with the parser it selected. In one of my venv's it's html5lib, 
in another it's lxml. Both seem to get a correct result.


--

"I love science, and it pains me to think that to so many are terrified
of the subject or feel that choosing science means you cannot also
choose compassion, or the arts, or be awed by nature. Science is not
meant to cure us of mystery, but to reinvent and reinvigorate it."
-- Robert Sapolsky

--
https://mail.python.org/mailman/listinfo/python-list


Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Chris Angelico
On Mon, 24 Oct 2022 at 18:43, Roel Schroeven  wrote:
>
> Op 24/10/2022 om 4:29 schreef Chris Angelico:
> > Parsing ancient HTML files is something Beautiful Soup is normally
> > great at. But I've run into a small problem, caused by this sort of
> > sloppy HTML:
> >
> > from bs4 import BeautifulSoup
> > # See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm
> > blob = b"""
> > 
> > 'THERE sinks the nebulous star we call the Sun,
> > If that hypothesis of theirs be sound,'
> > Said Ida;' let us down and rest:' and we
> > Down from the lean and wrinkled precipices,
> > By every coppice-feather'd chasm and cleft,
> > Dropt thro' the ambrosial gloom to where below
> > No bigger than a glow-worm shone the tent
> > Lamp-lit from the inner. Once she lean'd on me,
> > Descending; once or twice she lent her hand,
> > And blissful palpitations in the blood,
> > Stirring a sudden transport rose and fell.
> > 
> > """
> > soup = BeautifulSoup(blob, "html.parser")
> > print(soup)
> >
> >
> > On this small snippet, it works acceptably, but puts a large number of
> >  tags immediately before the . On the original file (see
> > link if you want to try it), this blows right through the default
> > recursion limit, due to the crazy number of "nested" list items.
> >
> > Is there a way to tell BS4 on parse that these  elements end at
> > the next , rather than waiting for the final ? This would
> > make tidier output, and also eliminate most of the recursion levels.
> >
> Using html5lib (install package html5lib) instead of html.parser seems
> to do the trick: it inserts  right before the next , and one
> before the closing  . On my system the same happens when I don't
> specify a parser, but IIRC that's a bit fragile because other systems
> can choose different parsers of you don't explicity specify one.
>

Ah, cool. Thanks. I'm not entirely sure of the various advantages and
disadvantages of the different parsers; is there a tabulation
anywhere, or at least a list of recommendations on choosing a suitable
parser?

I'm dealing with a HUGE mess of different coding standards, all the
way from 1990s-level stuff (images for indentation, tables for
formatting, and ) up through HTML4 (a good few
of the pages have at least some  tags and declare their
encodings, mostly ISO-8859-1 or similar), to fairly modern HTML5.
There's even a couple of pages that use frames - yes, the old style
with a  block in case the browser can't handle it. I went
with html.parser on the expectation that it'd give the best "across
all standards" results, but I'll give html5lib a try and see if it
does better.

Would rather not try to use different parsers for different files, but
if necessary, I'll figure something out.

(For reference, this is roughly 9000 HTML files that have to be
parsed. Doing things by hand is basically not an option.)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Beautiful Soup - close tags more promptly?

2022-10-24 Thread Roel Schroeven

Op 24/10/2022 om 4:29 schreef Chris Angelico:

Parsing ancient HTML files is something Beautiful Soup is normally
great at. But I've run into a small problem, caused by this sort of
sloppy HTML:

from bs4 import BeautifulSoup
# See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm
blob = b"""

'THERE sinks the nebulous star we call the Sun,
If that hypothesis of theirs be sound,'
Said Ida;' let us down and rest:' and we
Down from the lean and wrinkled precipices,
By every coppice-feather'd chasm and cleft,
Dropt thro' the ambrosial gloom to where below
No bigger than a glow-worm shone the tent
Lamp-lit from the inner. Once she lean'd on me,
Descending; once or twice she lent her hand,
And blissful palpitations in the blood,
Stirring a sudden transport rose and fell.

"""
soup = BeautifulSoup(blob, "html.parser")
print(soup)


On this small snippet, it works acceptably, but puts a large number of
 tags immediately before the . On the original file (see
link if you want to try it), this blows right through the default
recursion limit, due to the crazy number of "nested" list items.

Is there a way to tell BS4 on parse that these  elements end at
the next , rather than waiting for the final ? This would
make tidier output, and also eliminate most of the recursion levels.

Using html5lib (install package html5lib) instead of html.parser seems 
to do the trick: it inserts  right before the next , and one 
before the closing  . On my system the same happens when I don't 
specify a parser, but IIRC that's a bit fragile because other systems 
can choose different parsers of you don't explicity specify one.


--
"I love science, and it pains me to think that to so many are terrified
of the subject or feel that choosing science means you cannot also
choose compassion, or the arts, or be awed by nature. Science is not
meant to cure us of mystery, but to reinvent and reinvigorate it."
-- Robert Sapolsky

--
https://mail.python.org/mailman/listinfo/python-list


Beautiful Soup - close tags more promptly?

2022-10-23 Thread Chris Angelico
Parsing ancient HTML files is something Beautiful Soup is normally
great at. But I've run into a small problem, caused by this sort of
sloppy HTML:

from bs4 import BeautifulSoup
# See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm
blob = b"""

'THERE sinks the nebulous star we call the Sun,
If that hypothesis of theirs be sound,'
Said Ida;' let us down and rest:' and we
Down from the lean and wrinkled precipices,
By every coppice-feather'd chasm and cleft,
Dropt thro' the ambrosial gloom to where below
No bigger than a glow-worm shone the tent
Lamp-lit from the inner. Once she lean'd on me,
Descending; once or twice she lent her hand,
And blissful palpitations in the blood,
Stirring a sudden transport rose and fell.

"""
soup = BeautifulSoup(blob, "html.parser")
print(soup)


On this small snippet, it works acceptably, but puts a large number of
 tags immediately before the . On the original file (see
link if you want to try it), this blows right through the default
recursion limit, due to the crazy number of "nested" list items.

Is there a way to tell BS4 on parse that these  elements end at
the next , rather than waiting for the final ? This would
make tidier output, and also eliminate most of the recursion levels.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list