Hi, thanx for your answer. It is definitely the same host. I'll give you an example:
in crawl-urlfilter host is set to "uni-siegen.de http://www.uni-siegen.de/dept/fb05/dekanat/ is indexed, but http://www.uni-siegen.de/~merk/ isn't indexed. Any idea? What about the code. I'd like to see how it works. Doğacan Güney schrieb: > Hi, > > Peter Swoboda wrote: > >> Hi, >> we're using Nutch 0.8. >> In deafault.xml "ignore external links" is set "true". >> Can anybody tell me where we can find the code to this property? >> We've got the problem, that now, there are many "intern" pages, that >> aren't indexed. >> Doesn't seem to make sense, because they are on the same server, like >> other indexed pages. >> When we set "ignore external links" "false" they are indexed. >> What could be the problem? >> >> > Do you have different hosts in your server? > > ignore.external.links property, if set to true, ignores links whose > _host_ is different from the source page. > > For example, > Assume page www.bar.com/index.html contains a link to foo.bar.com/page.html. > if ignore.external.links is true, host of the source page (www.bar.com) > and host of the > link (foo.bar.com) will be compared and since they are different this > link will be ignored. > Even though, they are probably on the same server. > > So only links within the exact same host (in this case, www.bar.com) are > followed. > > -- > Doğacan Güney > > >> Peter >> >> >> >> . >> >> > > ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier. Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
