On 19/05/2026 13:00, [email protected] wrote:
Send dorset mailing list submissions to
[email protected]
To subscribe or unsubscribe via the World Wide Web, visit
https://mailman.lug.org.uk/mailman/listinfo/dorset
or, via email, send a message with subject or body 'help' to
[email protected]
You can reach the person managing the list at
[email protected]
When replying, please edit your Subject line so it is more specific
than "Re: Contents of dorset digest..."
Today's Topics:
1. Re: Web scraping (Graeme Gemmill)
2. Re: Web scraping (Ralph Corderoy)
3. Re: Books no longer opening in Calibre (Terry Coles)
----------------------------------------------------------------------
Message: 1
Date: Mon, 18 May 2026 16:41:56 +0100
From: Graeme Gemmill <[email protected]>
To: [email protected]
Subject: Re: [Dorset] Web scraping
Message-ID: <[email protected]>
Content-Type: text/plain; charset=UTF-8; format=flowed
Ralph, can you help with a problem relating to this chunk of HTML please?
I found my place in the tree by finding all the <h2 lines, selecting the
one for area7, then selecting the parent with
section = area_lists[13].parent, which begins
<section aria-labelled by.....
I need to find the value of the class attribute, which could be
"marine-card" or "marine-card warning" but
print(section.contents) starts with the <h2 line, and
print(section.contents[0] prints only a blank line.
I tried converting the content to a list of lines, but again, it started
with the <h2 line.
Thanks for your help.
Graeme
On 13/05/2026 14:41, Ralph Corderoy wrote:
Hi Graeme,
https://weather.metoffice.gov.uk/specialist-forecasts/coast-and-sea/inshore-waters-forecast
Look particularly in area-labelledby area7
class="card-content"
class=forecast-block"
etc.
Looking at the HTML, as soon as you click on a lower level tag,
the == $0 disappears
I'm not sure what clicking you mean, but the <section> for 'area7' looks
filled out in the HTML returned by the web server to me.
$ curl -sSg
https://weather.metoffice.gov.uk/specialist-forecasts/coast-and-sea/inshore-waters-forecast
|
sed -rn '/labelledby.*\<area7\>/,/<\/section>/p'
<section aria-labelledby="area7" id="inshore-waters-7" class="marine-card warning"
data-value="inshore-waters-7">
<h2 id="area7" class="card-name">Selsey Bill to Lyme Regis (7)</h2>
<div class="card-content">
<p>Strong winds are forecast</p>
<div class="forecast-block">
<h3>24 hour forecast:</h3>
<div class="forecast-info">
<dl>
<dt>Wind</dt>
<dd>West veering northwest 5 or 6, occasionally 7 at first, then decreasing 4
at times later.</dd>
<dt>Sea state</dt>
<dd>Slight or moderate.</dd>
<dt>Weather</dt>
<dd>Showers, thundery at first.</dd>
<dt>Visibility</dt>
<dd>Good, occasionally poor at first.</dd>
</dl>
</div>
</div>
<div class="outlook-block">
<h3>Outlook for the following 24 hours:</h3>
<div class="forecast-info">
<dl>
<dt>Wind</dt>
<dd>North or northwest 3 to 5.</dd>
<dt>Sea state</dt>
<dd>Slight, occasionally smooth later.</dd>
<dt>Weather</dt>
<dd>Showers, thundery at first.</dd>
<dt>Visibility</dt>
<dd>Good, occasionally poor at first.</dd>
</dl>
</div>
</div>
</div>
</section>
$
------------------------------
Message: 2
Date: Tue, 19 May 2026 09:09:10 +0100
From: Ralph Corderoy <[email protected]>
To: Graeme Gemmill <[email protected]>
Cc: [email protected]
Subject: Re: [Dorset] Web scraping
Message-ID: <[email protected]>
Content-Type: text/plain; charset=utf-8
Hi Graeme,
I need to find the value of the class attribute, which could be
"marine-card" or "marine-card warning" but
print(section.contents) starts with the <h2 line, and
print(section.contents[0] prints only a blank line.
I think you need to know about a Tag's attributes.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#Tag.attrs
My knowledge of Python is out of date, and I've not used BeautifulSoup
before, but here's some working code.
$ cat parse
#! /usr/bin/python3Re: dorset Digest, Vol 1075, Issue 2
import bs4
import requests
import warnings
def main():
warnings.simplefilter('ignore', category=bs4.GuessedAtParserWarning)
url =
'https://weather.metoffice.gov.uk/specialist-forecasts/coast-and-sea/inshore-waters-forecast'
resp = requests.get(url)
html = bs4.BeautifulSoup(resp.content)
sect = html('section', id='inshore-waters-7', limit=2)
assert len(sect) == 1, sect
sect = sect[0]
cls = sect['class']
print(sorted(cls))
main()
$
$ ./parse
['marine-card', 'warning']
$
That worked a treat. Thank you.
Graeme
--
Next meeting: Online, Jitsi, Tuesday, 2026-06-02 20:00
Check to whom you are replying
Meetings, mailing list, IRC, ... https://dorset.lug.org.uk
New thread, don't hijack: mailto:[email protected]