This isn't exactly what your requesting, but it will get the job done in about 
the same time possibly even less.
 
Lets use grep on that file:
 
grep '\.co\.uk/' urls > co-uk-urls

The "\" tells it to use "." in the search, normally its used for wild-carding. 
The forward slash at the end is more useful with other TLD's, example would be 
using ".ca" without you would get domains like www.caexample.net because it 
still does match. The ">" outputs it into our new file, which is "co-uk-urls" 
and ready to be injected into the Nutch DB.
 
Lazy mans solution right here. Enjoy!
 
----- Original Message ----
From: Justin Hartman <[EMAIL PROTECTED]>
To: [email protected]
Sent: Thursday, December 28, 2006 5:08:30 AM
Subject: DmozParser Question


Hi All

I'm a newbie to Nutch and as such have a few questions. For now I'll
limit my questions simply because I want to try and see if I can get
my issues resolved myself but there is a question about the DmozParser
which I would like to ask.

Does anyone know if it is possible to filter the Dmoz file to only
include certain tld's such as .co.uk only in the dmoz/url file?

I noticed that DmozParser supports both boolean and pattern however
I'm not really sure how to implement it.

Any help appreciated.
-- 
Regards
Justin Hartman
PGP Key ID: 102CC123
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to