Cool, thanks, that looks like it may do the job.
Am I correct in saying I need to define mapping for each specific domain in host-urlnormalizer.txt? Or can I add a generic rule to cover any host?

i.e.
www.example1.org .example1.org
www.example2.org .example2.org
www.example3.org .example3.org
...

or can I do something along the lines of:-
www.* .($1)

Arthur.

On 07/07/2015 09:01, Markus Jelsma wrote:
You can use the host normalizer for this.
https://issues.apache.org/jira/browse/NUTCH-1319
-----Original message-----
From:Arthur Yarwood <art...@fubaby.com>
Sent: Tuesday 7th July 2015 0:02
To: user@nutch.apache.org
Subject: Duplicate pages with and without www. prefix being indexed

I have a Nutch 1.10 and Solr 5.2 setup. Still playing around with it all
and quite new to me, but I've noticed for one site I have crawled, I'm
getting content indexed twice in Solr, once with the www. domain prefix
and once without. E.g.

http://www.example.com/somepage.html
http://example.com/somepage.html

How can I avoid this duplication? At least from being indexed into Solr.
And preferable a generic solution that will work with any other site I
crawl in the future, some of which may default to www.xyz.com and some
xyz.com. i.e. I know I could add a regex-urlfilter for this one domain,
but I'd like to avoid this duplication in call instances in arise.

Thanks!


--
Arthur Yarwood




--
Arthur Yarwood
http://www.fubaby.com

Reply via email to