+++ Raj Mathur [2004-08-22 08:27:21]:
>
> find . -type f -print0 | xargs -0 file | fgrep 'HTML document text' | cut
> -d: -f1 | while read f ; do perl -ne 'undef $/ ;
> while(s/<a\s*href="([^>]*)">//is){print "$1\n";}' "$f" ; done
The regex is not very good as it takes everything between <a href" and ">
including stuff like target= class= etc...
And it doesn't check for base href etc...
This is pretty crude too... but it gives complete urls.
find . -name "*html"|xargs -i lynx -dump {} |grep http:// |cut -f2-100 -d.
[The http:// was crude filtering ... you can replace it with a better regex
for more accurate results (ftp:// mailto: javascript: etc...) ]
Kingsly
--
-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
linux-india-help mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/linux-india-help