Re: [Dorset] Web scraping (Ralph Corderoy)

Graeme Gemmill Tue, 19 May 2026 07:52:55 -0700

On 19/05/2026 13:00, [email protected] wrote:

Send dorset mailing list submissions to
        [email protected]


To subscribe or unsubscribe via the World Wide Web, visit
        https://mailman.lug.org.uk/mailman/listinfo/dorset
or, via email, send a message with subject or body 'help' to
        [email protected]

You can reach the person managing the list at
        [email protected]

When replying, please edit your Subject line so it is more specific
than "Re: Contents of dorset digest..."


Today's Topics:

    1. Re: Web scraping (Graeme Gemmill)
    2. Re: Web scraping (Ralph Corderoy)
    3. Re: Books no longer opening in Calibre (Terry Coles)


----------------------------------------------------------------------

Message: 1
Date: Mon, 18 May 2026 16:41:56 +0100
From: Graeme Gemmill <[email protected]>
To: [email protected]
Subject: Re: [Dorset] Web scraping
Message-ID: <[email protected]>
Content-Type: text/plain; charset=UTF-8; format=flowed

Ralph, can you help with a problem relating to this chunk of HTML please?
I found my place in the tree by finding all the <h2 lines, selecting the
one for area7, then selecting the parent with
section = area_lists[13].parent, which begins
<section aria-labelled by.....
I need to find the value of the class attribute, which could be
"marine-card" or "marine-card warning" but
print(section.contents) starts with the <h2 line, and
print(section.contents[0] prints only a blank line.
I tried converting the content to a list of lines, but again, it started
with the <h2 line.
Thanks for your help.
Graeme
On 13/05/2026 14:41, Ralph Corderoy wrote:

Hi Graeme,

https://weather.metoffice.gov.uk/specialist-forecasts/coast-and-sea/inshore-waters-forecast
Look particularly in area-labelledby area7
class="card-content"
class=forecast-block"
etc.
Looking at the HTML, as soon as you click on a lower level tag,
the == $0 disappears

I'm not sure what clicking you mean, but the <section> for 'area7' looks
filled out in the HTML returned by the web server to me.

      $ curl -sSg 
https://weather.metoffice.gov.uk/specialist-forecasts/coast-and-sea/inshore-waters-forecast
 |
       sed -rn '/labelledby.*\<area7\>/,/<\/section>/p'
      <section aria-labelledby="area7" id="inshore-waters-7" class="marine-card warning" 
data-value="inshore-waters-7">
      <h2 id="area7" class="card-name">Selsey Bill to Lyme Regis (7)</h2>
      <div class="card-content">
      <p>Strong winds are forecast</p>
      <div class="forecast-block">
      <h3>24 hour forecast:</h3>
      <div class="forecast-info">
      <dl>
      <dt>Wind</dt>
      <dd>West veering northwest 5 or 6, occasionally 7 at first, then decreasing 4 
at times later.</dd>
      <dt>Sea state</dt>
      <dd>Slight or moderate.</dd>
      <dt>Weather</dt>
      <dd>Showers, thundery at first.</dd>
      <dt>Visibility</dt>
      <dd>Good, occasionally poor at first.</dd>
      </dl>
      </div>
      </div>
      <div class="outlook-block">
      <h3>Outlook for the following 24 hours:</h3>
      <div class="forecast-info">
      <dl>
      <dt>Wind</dt>
      <dd>North or northwest 3 to 5.</dd>
      <dt>Sea state</dt>
      <dd>Slight, occasionally smooth later.</dd>
      <dt>Weather</dt>
      <dd>Showers, thundery at first.</dd>
      <dt>Visibility</dt>
      <dd>Good, occasionally poor at first.</dd>
      </dl>
      </div>
      </div>
      </div>
      </section>
      $




------------------------------

Message: 2
Date: Tue, 19 May 2026 09:09:10 +0100
From: Ralph Corderoy <[email protected]>
To: Graeme Gemmill <[email protected]>
Cc: [email protected]
Subject: Re: [Dorset] Web scraping
Message-ID: <[email protected]>
Content-Type: text/plain; charset=utf-8

Hi Graeme,

I need to find the value of the class attribute, which could be
"marine-card" or "marine-card warning" but
print(section.contents) starts with the <h2 line, and
print(section.contents[0] prints only a blank line.

I think you need to know about a Tag's attributes.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#Tag.attrs

My knowledge of Python is out of date, and I've not used BeautifulSoup
before, but here's some working code.

     $ cat parse
     #! /usr/bin/python3Re: dorset Digest, Vol 1075, Issue 2

     import bs4
     import requests
     import warnings

     def main():
        warnings.simplefilter('ignore', category=bs4.GuessedAtParserWarning)
        url = 
'https://weather.metoffice.gov.uk/specialist-forecasts/coast-and-sea/inshore-waters-forecast'
        resp = requests.get(url)
        html = bs4.BeautifulSoup(resp.content)

        sect = html('section', id='inshore-waters-7', limit=2)
        assert len(sect) == 1, sect
        sect = sect[0]

        cls = sect['class']
        print(sorted(cls))

     main()
     $
     $ ./parse
     ['marine-card', 'warning']
     $

That worked a treat. Thank you.
Graeme

--
 Next meeting: Online, Jitsi, Tuesday, 2026-06-02 20:00
 Check to whom you are replying
 Meetings, mailing list, IRC, ...  https://dorset.lug.org.uk
 New thread, don't hijack:  mailto:[email protected]

Re: [Dorset] Web scraping (Ralph Corderoy)

Reply via email to