Thanks Sean, I appreciate this help. In the interim what I did was create a regular expression in my regex-urlfilter.txt as follows:
+^http://([a-z0-9]*\.)([a-z0-9]*\.)co.uk/ and then duplicated this for other domains such as .org, .net etc. In my nutch-site.xml file I added this: <property> <name>urlfilter.regex.file</name> <value>regex-urlfilter.txt</value> </property> Because I didn't know how to parse the Dmoz file I figured this would be a work-around only crawling the sites for the tld's I want indexed and the above works. I'm pleased you sent your mail because now I can exclude the other domains before the crawl which is exactly what I wanted to do. The only potential problem I see with my regular expression in regex-urlfilter.txt is that domains such as www.my-domain-name.co.uk will not be included because the regex doesn't include hyphens in the expression. Any ideas how I can refine the regex better? Regards Justin On 12/28/06, Sean Dean <[EMAIL PROTECTED]> wrote: > This isn't exactly what your requesting, but it will get the job done in > about the same time possibly even less. > > Lets use grep on that file: > > grep '\.co\.uk/' urls > co-uk-urls > > The "\" tells it to use "." in the search, normally its used for > wild-carding. The forward slash at the end is more useful with other TLD's, > example would be using ".ca" without you would get domains like > www.caexample.net because it still does match. The ">" outputs it into our > new file, which is "co-uk-urls" and ready to be injected into the Nutch DB. > > Lazy mans solution right here. Enjoy! > > ----- Original Message ---- > From: Justin Hartman <[EMAIL PROTECTED]> > To: [email protected] > Sent: Thursday, December 28, 2006 5:08:30 AM > Subject: DmozParser Question > > > Hi All > > I'm a newbie to Nutch and as such have a few questions. For now I'll > limit my questions simply because I want to try and see if I can get > my issues resolved myself but there is a question about the DmozParser > which I would like to ask. > > Does anyone know if it is possible to filter the Dmoz file to only > include certain tld's such as .co.uk only in the dmoz/url file? > > I noticed that DmozParser supports both boolean and pattern however > I'm not really sure how to implement it. > > Any help appreciated. > -- > Regards > Justin Hartman > PGP Key ID: 102CC123 > -- Regards Justin Hartman PGP Key ID: 102CC123 ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
