Hello,

I want to exclude for indexing subdirectories in a website
and i have not found the goods parameters.
I use Nutch-0.7.2 because it is impossible
for me to index with Nutch-0.8.1 (it crash).

I want to exclude in my website the subdirectories :
/de/*
/en/*
/fr/mv/*

I try the command line
-^http://toto.web-site.net/de/([a-z0-9]*)
and
-^http://toto.web-site.net/de/*
in my crawl-urlfilter.txt file but
they don't work and nutch index these url but i don't want this.
Any idea ?

I have the default regex-urlfilter.txt
and my personnal crawl-urlfilter.txt is:

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the 
file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*.)*MY.DOMAIN.NAME/

# Website hostname for indexing
+^http://toto.web-site.net

# URL to exclude for indexing
-^http://toto.web-site.net/de/([a-z0-9]*)
-^http://toto.web-site.net/en/([a-z0-9]*)
-^http://toto.web-site.net/fr/mv/([a-z0-9]*)

# skip everything else
-.


*********** my default regex-urlfilter.txt file is **************

# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the 
file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# accept anything else
+.
________________________________________________________________________
iFRANCE, exprimez-vous !
http://web.ifrance.com
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to