Thanks Sean, I appreciate this help. In the interim what I did was
create a regular expression in my regex-urlfilter.txt as follows:

+^http://([a-z0-9]*\.)([a-z0-9]*\.)co.uk/

and then duplicated this for other domains such as .org, .net etc.

In my nutch-site.xml file I added this:

<property>
  <name>urlfilter.regex.file</name>
  <value>regex-urlfilter.txt</value>
</property>

Because I didn't know how to parse the Dmoz file I figured this would
be a work-around only crawling the sites for the tld's I want indexed
and the above works. I'm pleased you sent your mail because now I can
exclude the other domains before the crawl which is exactly what I
wanted to do.

The only potential problem I see with my regular expression in
regex-urlfilter.txt is that domains such as www.my-domain-name.co.uk
will not be included because the regex doesn't include hyphens in the
expression.

Any ideas how I can refine the regex better?

Regards
Justin

On 12/28/06, Sean Dean <[EMAIL PROTECTED]> wrote:
> This isn't exactly what your requesting, but it will get the job done in 
> about the same time possibly even less.
>
> Lets use grep on that file:
>
> grep '\.co\.uk/' urls > co-uk-urls
>
> The "\" tells it to use "." in the search, normally its used for 
> wild-carding. The forward slash at the end is more useful with other TLD's, 
> example would be using ".ca" without you would get domains like 
> www.caexample.net because it still does match. The ">" outputs it into our 
> new file, which is "co-uk-urls" and ready to be injected into the Nutch DB.
>
> Lazy mans solution right here. Enjoy!
>
> ----- Original Message ----
> From: Justin Hartman <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Thursday, December 28, 2006 5:08:30 AM
> Subject: DmozParser Question
>
>
> Hi All
>
> I'm a newbie to Nutch and as such have a few questions. For now I'll
> limit my questions simply because I want to try and see if I can get
> my issues resolved myself but there is a question about the DmozParser
> which I would like to ask.
>
> Does anyone know if it is possible to filter the Dmoz file to only
> include certain tld's such as .co.uk only in the dmoz/url file?
>
> I noticed that DmozParser supports both boolean and pattern however
> I'm not really sure how to implement it.
>
> Any help appreciated.
> --
> Regards
> Justin Hartman
> PGP Key ID: 102CC123
>


-- 
Regards
Justin Hartman
PGP Key ID: 102CC123

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to