2006/10/18, Frederic Goudal <[EMAIL PROTECTED]>:
>
> Hello,
>
> I'm begining to play with nutch to index our own web site.
> I have done a first crawl and I have trid the recrawl script.
> While fetching I have lines like that :
>
> fetching http://www.yourdictionary.com/grammars.html
> fetching http://www.cours.polymtl.ca/if540/hiv_00.htm
> fetching http://www.maxim-ic.com/quick_view2.cfm/qv_pk/</font></a>
>
> but by crawl-urlfilter.txt is :
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|
> exe|png)$
>
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*\.)*enseirb.fr/
> +^http://www.enseirb.fr/
>
> # skip everything else
> -.
>
> So... I think I miss some point.

Frederic, what exactly is the problem? You'd like the recrawl not to
leave your web site? You can do that very easily: set the
"db.ignore.external.links" property in nutch-site.xml to "true" (you
can copy the xml property from nutch-default and then change the value
to "true");

> Btw as a beginner, totally ignorant of java, and timeless system ingeneer in
> charge of too many things, is there any doc that really explain the behaviour
> of nutch ?

A good place to read about nutch is the nutch wiki:
http://wiki.apache.org/nutch/

Cheers,
t.n.a.

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to