Hello,
I'm begining to play with nutch to index our own web site. I have done a first crawl and I have trid the recrawl script. While fetching I have lines like that : fetching http://www.yourdictionary.com/grammars.html fetching http://www.cours.polymtl.ca/if540/hiv_00.htm fetching http://www.maxim-ic.com/quick_view2.cfm/qv_pk/</font></a> but by crawl-urlfilter.txt is : # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV| exe|png)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/.+?)/.*?\1/.*?\1/ # accept hosts in MY.DOMAIN.NAME #+^http://([a-z0-9]*\.)*enseirb.fr/ +^http://www.enseirb.fr/ # skip everything else -. So... I think I miss some point. Btw as a beginner, totally ignorant of java, and timeless system ingeneer in charge of too many things, is there any doc that really explain the behaviour of nutch ? f.g. ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
