Hey there hi there ho there,
I was wondering what others have used to strip the content out of web
pages? I am working on a system that collects pages and archives them;
however, only the content needs to be stored (i.e. not the navigation,
images, extra page fodder).
The sites it is archiving are vast so it would have to rather generic
solution. I have seen this kind of thing before, but only for single
specific sites. Does anyone know a good method to do it generically?
I was leaning toward one of these but I am open to whatever
* run the collected html through tidy (or jtidy) then (somehow) use xslt
* (somehow) use a regular _expression_ on the collected html
if anyone has done this before please let me know of pitfalls or
recommendations - BTW I have time not money so any pay solutions are
right out.
Thanks
--
Vale,
Rob
Luxuria immodica insaniam creat.
Sanam formam viatae conservate!
http://www.rohanclan.com
http://treebeard.sourceforge.net
http://ashpool.sourceforge.net
[Todays Threads]
[This Message]
[Subscription]
[Fast Unsubscribe]
[User Settings]
- Re: CFMX - best way to strip content from html page Rob Rohan
- Re: CFMX - best way to strip content from html page Michael Dinowitz
- Re: CFMX - best way to strip content from html p... Thomas Chiverton
- Re: CFMX - best way to strip content from ht... Rob Rohan
- Re: CFMX - best way to strip content fro... Thomas Chiverton
- Re: CFMX - best way to strip content from html page Tyler Clendenin
- Re: CFMX - best way to strip content from html p... Rob Rohan
- RE: CFMX - best way to strip content from ht... Michael Wolfe
- RE: CFMX - best way to strip content from html page Hugo Ahlenius
- RE: CFMX - best way to strip content from html page Andre Turrettini