Tim Chase wrote:
is it possible to use a program to get all the text only of the html? as if I open the html with a browser, then click ctrl+a and then copy paste all the selected text
You need to define what you mean by this much more precisely. Do you want title elements to be included? Do you want initial values for input elements (which are attributes, not text nodes) to be included? Do you want text nodes in a pre element handled specially. Can you constrain the input to be valid HTML (most web sites aren't)? If not, what error recovery do you want? Etc.
Sounds like you're reaching for the "-dump" parameter that Lynx supports, as described in the man-page: lynx -dump http://www.example.com
This will insert extra characters to achieve a rendering of the text. If only the input characters are wanted, one might be better using a Perl script to strip out all the tags, directives, etc., and resolve entities.
You could also use the nsgmls tools, provided the input is valid, to get the infoset representation and then strip out the tag and attribute lines. You still need to decide how you will deal with the resulting newlines and any newlines in the original text nodes.
You could probably modify lynx to dump the text nodes as it identifies them, but Lynx is quite big and complex.
P.S. when posting to support lists, please use a subject that is a precis of the complete question.
This can then be automated via a script, or you may be able to use the '-crawl' parameter in conjunction with -dump to walk a site. I didn't see anything in my man-page to limit link recursion-depth as wget offers. If you don't want the link-lists, you can use the -nolist parameter as well. -tim _______________________________________________ Lynx-dev mailing list [email protected] http://lists.nongnu.org/mailman/listinfo/lynx-dev
-- David Woolley Emails are not formal business letters, whatever businesses may want. RFC1855 says there should be an address here, but, in a world of spam, that is no longer good advice, as archive address hiding may not work. _______________________________________________ Lynx-dev mailing list [email protected] http://lists.nongnu.org/mailman/listinfo/lynx-dev
