On Tue, January 14, 2014 12:45 pm, kfos...@tpg.com.au wrote:

> There are man html reader libraries out there.  I have used one for perl
> for example.   It enables you to look for some other tag to find your data
> (eg
> the css class name of that particular element) and rip the data by walking
>  the html tree.
>
> Pick a language and let us know I am sure you will get specific
> recommendations on html reader / parser libraries.  (eg html agility for
> C#)

Ken,

thanks. I think you might have anwsered my next question already:

what I'm doing is like: wget url > html; lynx html > text

initially, I was getting no text output, as html was html5 something;

neither links nor lynx knew how to process html, blank output

after installing latest dev build lynx, I started getting data in text;

- some data in text gets 'trimmed' at 30 cols;
editing html 'cols=30' and, re-dumping text fixes such fields;

- some other data in text gets 'trimmed' at 20 cols;
I can't find any 'cols=' statements in html, haven't found a way to output
all data from these fields

opening html in lynx, these fields are also 'trimmed' on screen, BUT,
scrollable <>

so, I guess the libraries you suggest would overcome such html5? issues?

even it's somewhat outside of my abilities, I guess I could try Perl?
(if I can find some sample code)

thanks again
V

-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html

Reply via email to