Hi Nuther, I am not sure whether this is the only way to solve this problem but this works very well for me in an Intranet.
Which one of the following two do you want to achieve by coding? 1. Block one domain name completely. 2. Allow both the domain names but remember that both point to the same resource. So when a page is obtained from one domain, keep a note of it and do not request the same page from another domain. I don't think coding for the point 1 is a good idea because that can already be achieved through URL filters. For point 2, a good starting point would be src/java/org/apache/nutch/crawl/Generator.java and src/java/org/apache/nutch/fetcher/Fetcher.java Regards, Susam Pal http://susam.in/ On 7/6/07, Nuther <[EMAIL PROTECTED]> wrote: > Hi, Susam. > > But that's wrong. Your solution is the easiest way to get rid of duplicates > If you know DataParkSearch engine, it has this option. > So, is the usage of url filter the only way to avoid duplicates? > Or is there any way to code this feature, and if so, then how? > > > I have faced this issue. I block the duplicate domain using the URL > > filters. So only one domain is crawled by the bot and the other domain > > is ignored. > > > Regards, > > Susam Pal > > http://susam.in/ > > > On 7/6/07, Nuther <[EMAIL PROTECTED]> wrote: > >> Hi, > >> I was wondering if nutch has alias option > >> Let's say we have two domains www.site1.com and www.site2.com that point on > >> one site. How can I tell nutch that they pooint on that site? This is > >> problem > >> because there are a lot of duplicates in search results. > >> Thanks. > > >> -- > >> Regards, > >> Nuther mailto:[EMAIL PROTECTED] ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
