Hello - yes, you need to maintain the list manually. There is, however, experimental code capable of emitting these rules automatically but it is not open source at this time.
Markus -----Original message----- > From:Arthur Yarwood <art...@fubaby.com> > Sent: Wednesday 8th July 2015 18:49 > To: user@nutch.apache.org > Subject: RE: Duplicate pages with and without www. prefix being indexed > > OK, many thanks for the clarification. One thing I'm still uncertain of, > is do I need a config line for each specific domain that is in need of > normalizing? For example, if I notice I have 12 different sites with > duplicate pages indexed, all of them www. and non-www versions, do I > need 12 lines of config in this text file? > > e.g. > www.apples.com apples.com > www.oranges.com oranges.com > www.pears.com pears.com > www.blackberries.com blackberries.com > www.blueberries.com blueberries.com > www.bananas.com bananas.com > ... > and so on. > > That is, from a long term maintainability point of view, will I be > continuous adding new lines to this host-urlnormalizer.txt file as and > when I notice another site has content duplicated? Or is there some way > to write a www. => no-www rule to cover any domain Nutch happens to > encounter in the future? > > Arthur. > > On 2015-07-07 12:59, Markus Jelsma wrote: > > Hello, i added an example to the issue, Hope it helps. > > > > # Force all sub domains to non-www. > > *.example.com example.com > > > > # Force www sub domain to non-www. > > www.example.net example.net > > > > # Force non-www. sub domain to www > > example.org www.example.org > > > > > > -----Original message----- > >> From:Arthur Yarwood <art...@fubaby.com> > >> Sent: Tuesday 7th July 2015 11:14 > >> To: user@nutch.apache.org > >> Subject: Re: Duplicate pages with and without www. prefix being > >> indexed > >> > >> Cool, thanks, that looks like it may do the job. > >> Am I correct in saying I need to define mapping for each specific > >> domain > >> in host-urlnormalizer.txt? Or can I add a generic rule to cover any > >> host? > >> > >> i.e. > >> www.example1.org .example1.org > >> www.example2.org .example2.org > >> www.example3.org .example3.org > >> ... > >> > >> or can I do something along the lines of:- > >> www.* .($1) > >> > >> Arthur. > >> > >> On 07/07/2015 09:01, Markus Jelsma wrote: > >> > You can use the host normalizer for this. > >> > https://issues.apache.org/jira/browse/NUTCH-1319 > >> > > >> > -----Original message----- > >> >> From:Arthur Yarwood <art...@fubaby.com> > >> >> Sent: Tuesday 7th July 2015 0:02 > >> >> To: user@nutch.apache.org > >> >> Subject: Duplicate pages with and without www. prefix being indexed > >> >> > >> >> I have a Nutch 1.10 and Solr 5.2 setup. Still playing around with it all > >> >> and quite new to me, but I've noticed for one site I have crawled, I'm > >> >> getting content indexed twice in Solr, once with the www. domain prefix > >> >> and once without. E.g. > >> >> > >> >> http://www.example.com/somepage.html > >> >> http://example.com/somepage.html > >> >> > >> >> How can I avoid this duplication? At least from being indexed into Solr. > >> >> And preferable a generic solution that will work with any other site I > >> >> crawl in the future, some of which may default to www.xyz.com and some > >> >> xyz.com. i.e. I know I could add a regex-urlfilter for this one domain, > >> >> but I'd like to avoid this duplication in call instances in arise. > >> >> > >> >> Thanks! > >> >> > >> >> > >> >> -- > >> >> Arthur Yarwood > >> >> > >> >> > >> > >> > >> -- > >> Arthur Yarwood > >> > >> > > -- > Arthur Yarwood > >