This isn't exactly what your requesting, but it will get the job done in about
the same time possibly even less.
Lets use grep on that file:
grep '\.co\.uk/' urls > co-uk-urls
The "\" tells it to use "." in the search, normally its used for wild-carding.
The forward slash at the end is more useful with other TLD's, example would be
using ".ca" without you would get domains like www.caexample.net because it
still does match. The ">" outputs it into our new file, which is "co-uk-urls"
and ready to be injected into the Nutch DB.
Lazy mans solution right here. Enjoy!
----- Original Message ----
From: Justin Hartman <[EMAIL PROTECTED]>
To: [email protected]
Sent: Thursday, December 28, 2006 5:08:30 AM
Subject: DmozParser Question
Hi All
I'm a newbie to Nutch and as such have a few questions. For now I'll
limit my questions simply because I want to try and see if I can get
my issues resolved myself but there is a question about the DmozParser
which I would like to ask.
Does anyone know if it is possible to filter the Dmoz file to only
include certain tld's such as .co.uk only in the dmoz/url file?
I noticed that DmozParser supports both boolean and pattern however
I'm not really sure how to implement it.
Any help appreciated.
--
Regards
Justin Hartman
PGP Key ID: 102CC123
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general