RE: Duplicate pages with and without www. prefix being indexed

Markus Jelsma Wed, 08 Jul 2015 11:44:22 -0700

Hello - yes, you need to maintain the list manually. There is, however, 
experimental code capable of emitting these rules automatically but it is not 
open source at this time.


Markus

-----Original message-----
> From:Arthur Yarwood <art...@fubaby.com>
> Sent: Wednesday 8th July 2015 18:49
> To: user@nutch.apache.org
> Subject: RE: Duplicate pages with and without www. prefix being indexed
> 
> OK, many thanks for the clarification. One thing I'm still uncertain of, 
> is do I need a config line for each specific domain that is in need of 
> normalizing? For example, if I notice I have 12 different sites with 
> duplicate pages indexed, all of them www. and non-www versions, do I 
> need 12 lines of config in this text file?
> 
> e.g.
> www.apples.com apples.com
> www.oranges.com oranges.com
> www.pears.com pears.com
> www.blackberries.com blackberries.com
> www.blueberries.com blueberries.com
> www.bananas.com bananas.com
> ...
> and so on.
> 
> That is, from a long term maintainability point of view, will I be 
> continuous adding new lines to this host-urlnormalizer.txt file as and 
> when I notice another site has content duplicated? Or is there some way 
> to write a www. => no-www rule to cover any domain Nutch happens to 
> encounter in the future?
> 
> Arthur.
> 
> On 2015-07-07 12:59, Markus Jelsma wrote:
> > Hello, i added an example to the issue, Hope it helps.
> > 
> > # Force all sub domains to non-www.
> > *.example.com example.com
> > 
> > # Force www sub domain to non-www.
> > www.example.net example.net
> > 
> > # Force non-www. sub domain to www
> > example.org www.example.org
> > 
> > 
> > -----Original message-----
> >> From:Arthur Yarwood <art...@fubaby.com>
> >> Sent: Tuesday 7th July 2015 11:14
> >> To: user@nutch.apache.org
> >> Subject: Re: Duplicate pages with and without www. prefix being 
> >> indexed
> >> 
> >> Cool, thanks, that looks like it may do the job.
> >> Am I correct in saying I need to define mapping for each specific 
> >> domain
> >> in host-urlnormalizer.txt? Or can I add a generic rule to cover any 
> >> host?
> >> 
> >> i.e.
> >> www.example1.org .example1.org
> >> www.example2.org .example2.org
> >> www.example3.org .example3.org
> >> ...
> >> 
> >> or can I do something along the lines of:-
> >> www.* .($1)
> >> 
> >> Arthur.
> >> 
> >> On 07/07/2015 09:01, Markus Jelsma wrote:
> >> > You can use the host normalizer for this.
> >> > https://issues.apache.org/jira/browse/NUTCH-1319
> >> >  
> >> > -----Original message-----
> >> >> From:Arthur Yarwood <art...@fubaby.com>
> >> >> Sent: Tuesday 7th July 2015 0:02
> >> >> To: user@nutch.apache.org
> >> >> Subject: Duplicate pages with and without www. prefix being indexed
> >> >>
> >> >> I have a Nutch 1.10 and Solr 5.2 setup. Still playing around with it all
> >> >> and quite new to me, but I've noticed for one site I have crawled, I'm
> >> >> getting content indexed twice in Solr, once with the www. domain prefix
> >> >> and once without. E.g.
> >> >>
> >> >> http://www.example.com/somepage.html
> >> >> http://example.com/somepage.html
> >> >>
> >> >> How can I avoid this duplication? At least from being indexed into Solr.
> >> >> And preferable a generic solution that will work with any other site I
> >> >> crawl in the future, some of which may default to www.xyz.com and some
> >> >> xyz.com. i.e. I know I could add a regex-urlfilter for this one domain,
> >> >> but I'd like to avoid this duplication in call instances in arise.
> >> >>
> >> >> Thanks!
> >> >>
> >> >>
> >> >> --
> >> >> Arthur Yarwood
> >> >>
> >> >>
> >> 
> >> 
> >> --
> >> Arthur Yarwood
> >> 
> >> 
> 
> -- 
> Arthur Yarwood
> 
>

RE: Duplicate pages with and without www. prefix being indexed

Reply via email to