On Fri, 2021-09-17 at 12:15 +0200, Julius Hamilton via xml wrote:
> Hey,
> 
> I would like to write a script which extracts article text content
> from
> webpage HTML. 

You might want to look at xidel for that.
> 
> I believe I should first inspect the HTML tree, i.e. the raw HTML
> returned
> by wget, to figure out which nodes tend to have the text content I am
> seeking. Should I do that in Firefox or Chrome, or is there a good
> standalone tool for that?

The browsers will *modify* the HTML. For example, they will insert
tbody elements into tables, and they will change element nesting in
some cases.


But it's in relatively few cases, so the element inspector in the
browser isn' a bad start.

Liam


-- 
Liam Quin, https://www.delightfulcomputing.com/
Available for XML/Document/Information Architecture/XSLT/
XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
Barefoot Web-slave, antique illustrations:  http://www.fromoldbooks.org

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml

Reply via email to