Niti - currently Nutch does not resolve back to the IP address to fig re out
which are virtual hosts. In reality that may not be correct if small sites
are virtually hosted at an ISP.

Figuring out if a site is the same / alias requires a little more checking
which is not really require for the default Nutch installation.

(BTW, the segment merge tool will remove pages that have the same content,
so that should solve the problem after the fact)
 

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Niti
Witthayawiroj
Sent: Wednesday, February 23, 2005 9:11 AM
To: [EMAIL PROTECTED]
Subject: Re: [Nutch-dev] Why found the unabsolute links from Nutch

Hi Olaf,
   
    I have used the Intranet crawling of Nutch to crawl, the root URLs are:

http://www.l3s.uni-hannover.de/
http://www.l3s.de/
http://www.learninglab.uni-hannover.de/
http://www.learninglab.de/

and the domain names of the root URLs above refer to the same IP
address(Host names aliases). After the crawling has completed, i used the
WebDBReader command line(bin/nutch readdb <db> -dumplinks) to get data about
link of URLs.

>From the dumplinks, i found some link is not correct (see the example at
below). Why the source page(/morob/Galleries/ER1/pages/09_DSCF0492.html)on
the host http://www.l3s.uni-hannover.de has outlinks to pages of the other
hosts (http://www.learninglab.de/ and
http://www.learninglab.uni-hannover.de/). 

In fact, the source page has only 3 outlinks(absolute
outlinks) but from the dumplinks it has in total 9
outlinks(6 outlinks are false). The detail in pages of the 6 false outlinks
are same the 3 pages of absolute outlinks but on other host name. 

Is maybe problem about the Host names aliases and can you tell me why?

Thank a lot!
Niti


Date: Mon, 21 Feb 2005 20:50:54 +0100
From: Olaf Thiele <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: Re: [Nutch-dev] Why found the unabsolute links from Nutch
Reply-To: [EMAIL PROTECTED]

Hi Niti,
I don't get your question. Just write it in German and I will post it in
English.

Bye
Olaf



On Mon, 21 Feb 2005 05:29:43 -0800 (PST), Niti Witthayawiroj
<[EMAIL PROTECTED]> wrote:
> Hi,
>   
> I have used Nutch to crawl four hosts and the four
host names
correspond to
> the same IP address. I used the WebDBReader to get
the dump links of
URLs.
> Why it found the unabsolute links (pages in one host
have links to
pages in
> other hosts). 
>   
> For example: 
>   
> from
> 
http://www.l3s.uni-hannover.de/morob/Galleries/ER1/pages/09_DSCF0492.html
>  to
http://www.l3s.uni-hannover.de/morob/Galleries/ER1/index.html
>  to
> 
http://www.l3s.uni-hannover.de/morob/Galleries/ER1/pages/08_DSCF0493.html
>  to
> 
http://www.l3s.uni-hannover.de/morob/Galleries/ER1/pages/10_DSCF0499.html
>  to
http://www.learninglab.de/morob/Galleries/ER1/index.html
>  to
http://www.learninglab.de/morob/Galleries/ER1/pages/08_DSCF0493.html
>  to
http://www.learninglab.de/morob/Galleries/ER1/pages/10_DSCF0499.html
>  to
http://www.learninglab.uni-hannover.de/morob/Galleries/ER1/index.html
>  to
> 
http://www.learninglab.uni-hannover.de/morob/Galleries/ER1/pages/08_DSCF0493
.html
>  to
> 
http://www.learninglab.uni-hannover.de/morob/Galleries/ER1/pages/10_DSCF0499
.html
>  
> ragards,
> Niti


                
__________________________________ 
Do you Yahoo!? 
Yahoo! Sports - Sign up for Fantasy Baseball. 
http://baseball.fantasysports.yahoo.com/


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers




-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to