Hey, I would like to write a script which extracts article text content from webpage HTML. The webpages have similar structure because they are all documentation pages from the same website, Microsoft Visual Basic for Applications homepage.
I believe I should first inspect the HTML tree, i.e. the raw HTML returned by wget, to figure out which nodes tend to have the text content I am seeking. Should I do that in Firefox or Chrome, or is there a good standalone tool for that? Then, could I use this xml parsing library, or is there some other standard one, for retrieving the text content at the nodes I have identified? Thanks very much, Julius
_______________________________________________ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml