Hi Arthur, this problem has been recently discussed in https://issues.apache.org/jira/browse/NUTCH-2065 and addressed by urlnormalizer-protocol https://issues.apache.org/jira/browse/NUTCH-2190
Of course, you have to decide for every host which protocol shall be used. Cheers, Sebastian On 03/04/2016 08:50 PM, Arthur Yarwood wrote: > I have recently discovered my crawl had a fetched a number of sites in > duplicate - once over http, > and again over https. In a similar manner one can add a host to the > host-urlnormlize file to avoid > a similar issue with www.example.com vs example.com urls - is there a tactic > to address http vs https? > > Ideally always favouring http over https (for efficiency), but not totally > discounting https > totally, if an entire host is setup to always serve over https. i.e. I don't > really want to block > all https hosts via a regex-urlfilter. > > I have worked around it to some degree via specific regex-urlfilters, but it > would be nice if there > was a global option, rather than have to tweak config everytime I discover > duplicate content in my > crawl. >