Hey there hi there ho there,

I was wondering what others have used to strip the content out of web
pages? I am working on a system that collects pages and archives them;
however, only the content needs to be stored (i.e. not the navigation,
images, extra page fodder).

The sites it is archiving are vast so it would have to rather generic
solution. I have seen this kind of thing before, but only for single
specific sites. Does anyone know a good method to do it generically?

I was leaning toward one of these but I am open to whatever

* run the collected html through tidy (or jtidy) then (somehow) use xslt
* (somehow) use a regular _expression_ on the collected html

if anyone has done this before please let me know of pitfalls or
recommendations - BTW I have time not money so any pay solutions are
right out.

Thanks

--
Vale,
Rob

Luxuria immodica insaniam creat.
Sanam formam viatae conservate!

http://www.rohanclan.com
http://treebeard.sourceforge.net
http://ashpool.sourceforge.net
[Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings]

Reply via email to