On Fri, 2021-09-17 at 12:15 +0200, Julius Hamilton via xml wrote: > Hey, > > I would like to write a script which extracts article text content > from > webpage HTML.
You might want to look at xidel for that. > > I believe I should first inspect the HTML tree, i.e. the raw HTML > returned > by wget, to figure out which nodes tend to have the text content I am > seeking. Should I do that in Firefox or Chrome, or is there a good > standalone tool for that? The browsers will *modify* the HTML. For example, they will insert tbody elements into tables, and they will change element nesting in some cases. But it's in relatively few cases, so the element inspector in the browser isn' a bad start. Liam -- Liam Quin, https://www.delightfulcomputing.com/ Available for XML/Document/Information Architecture/XSLT/ XSL/XQuery/Web/Text Processing/A11Y training, work & consulting. Barefoot Web-slave, antique illustrations: http://www.fromoldbooks.org _______________________________________________ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml