Hi Nuther,

I am not sure whether this is the only way to solve this problem but
this works very well for me in an Intranet.

Which one of the following two do you want to achieve by coding?

1. Block one domain name completely.
2. Allow both the domain names but remember that both point to the
same resource. So when a page is obtained from one domain, keep a note
of it and do not request the same page from another domain.

I don't think coding for the point 1 is a good idea because that can
already be achieved through URL filters.

For point 2, a good starting point would be
src/java/org/apache/nutch/crawl/Generator.java and
src/java/org/apache/nutch/fetcher/Fetcher.java

Regards,
Susam Pal
http://susam.in/

On 7/6/07, Nuther <[EMAIL PROTECTED]> wrote:
> Hi, Susam.
>
> But that's wrong. Your solution is the easiest way to get rid of duplicates
> If you know DataParkSearch engine, it has this option.
> So, is the usage of url filter the only way to avoid duplicates?
> Or is there any way to code this feature, and if so, then how?
>
> > I have faced this issue. I block the duplicate domain using the URL
> > filters. So only one domain is crawled by the bot and the other domain
> > is ignored.
>
> > Regards,
> > Susam Pal
> > http://susam.in/
>
> > On 7/6/07, Nuther <[EMAIL PROTECTED]> wrote:
> >> Hi,
> >> I was wondering if nutch has alias option
> >> Let's say we have two domains www.site1.com and www.site2.com that point on
> >> one site. How can I tell nutch that they pooint on that site? This is 
> >> problem
> >> because there are a lot of duplicates in search results.
> >> Thanks.
>
> >> --
> >> Regards,
> >>  Nuther                          mailto:[EMAIL PROTECTED]

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to