I have recently discovered my crawl had a fetched a number of sites in duplicate - once over http, and again over https. In a similar manner one can add a host to the host-urlnormlize file to avoid a similar issue with www.example.com vs example.com urls - is there a tactic to address http vs https?

Ideally always favouring http over https (for efficiency), but not totally discounting https totally, if an entire host is setup to always serve over https. i.e. I don't really want to block all https hosts via a regex-urlfilter.

I have worked around it to some degree via specific regex-urlfilters, but it would be nice if there was a global option, rather than have to tweak config everytime I discover duplicate content in my crawl.

Arthur Yarwood

Reply via email to