Hi,
thanx for your answer.

It is definitely the same host.
I'll give you an example:

in crawl-urlfilter host is set to "uni-siegen.de

http://www.uni-siegen.de/dept/fb05/dekanat/
is indexed, but
http://www.uni-siegen.de/~merk/
isn't indexed.
Any idea?
What about the code. I'd like to see how it works.


Doğacan Güney schrieb:
> Hi,
>
> Peter Swoboda wrote:
>   
>> Hi,
>> we're using Nutch 0.8.
>> In deafault.xml "ignore external links" is set "true".
>> Can anybody tell me where we can find the code to this property?
>> We've got the problem, that now, there are many "intern" pages, that
>> aren't indexed.
>> Doesn't seem to make sense, because they are on the same server, like
>> other indexed pages.
>> When we set "ignore external links" "false" they are indexed.
>> What could be the problem?
>>
>>     
> Do you have different hosts in your server?
>
> ignore.external.links property, if set to true, ignores links whose
> _host_ is different from the source page.
>
> For example,
> Assume page www.bar.com/index.html contains a link to foo.bar.com/page.html.
> if ignore.external.links is true, host of the source page (www.bar.com)
> and host of the
> link (foo.bar.com) will be compared and since they are different this
> link will be ignored.
> Even though, they are probably on the same server.
>
> So only links within the exact same host (in this case, www.bar.com) are
> followed.
>
> --
> Doğacan Güney
>
>   
>> Peter
>>
>>
>>
>> .
>>
>>     
>
>   


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to