Re: [LIH]script to extract urls from web pages

Kingsly John Sun, 22 Aug 2004 01:07:12 -0700

+++ Raj Mathur [2004-08-22 08:27:21]:

> 
> find . -type f -print0 | xargs -0 file | fgrep 'HTML document text' | cut
> -d: -f1 | while read f ; do perl -ne 'undef $/ ;
> while(s/<a\s*href="([^>]*)">//is){print "$1\n";}' "$f" ; done


The regex is not very good as it takes everything between <a href" and  ">
including stuff like target= class= etc...

And it doesn't check for base href etc...

This is pretty crude too... but it gives complete urls.

find . -name "*html"|xargs -i lynx -dump {} |grep http:// |cut -f2-100 -d.

[The http:// was crude filtering ... you can replace it with a better regex
for more accurate results (ftp:// mailto: javascript: etc...) ]

Kingsly
-- 


-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
linux-india-help mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/linux-india-help

Re: [LIH]script to extract urls from web pages

Reply via email to