Hey,

I would like to write a script which extracts article text content from
webpage HTML. The webpages have similar structure because they are all
documentation pages from the same website, Microsoft Visual Basic for
Applications homepage.

I believe I should first inspect the HTML tree, i.e. the raw HTML returned
by wget, to figure out which nodes tend to have the text content I am
seeking. Should I do that in Firefox or Chrome, or is there a good
standalone tool for that?

Then, could I use this xml parsing library, or is there some other standard
one, for retrieving the text content at the nodes I have identified?

Thanks very much,
Julius
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml

Reply via email to