Hello,

I'm begining to play with nutch to index our own web site.
I have done a first crawl and I have trid the recrawl script.
While fetching I have lines like that :

fetching http://www.yourdictionary.com/grammars.html
fetching http://www.cours.polymtl.ca/if540/hiv_00.htm
fetching http://www.maxim-ic.com/quick_view2.cfm/qv_pk/</font></a>

but by crawl-urlfilter.txt is :

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|
exe|png)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*enseirb.fr/
+^http://www.enseirb.fr/

# skip everything else
-.


So... I think I miss some point.

Btw as a beginner, totally ignorant of java, and timeless system ingeneer in
charge of too many things, is there any doc that really explain the behaviour 
of nutch ?

f.g.






-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to