I have recently discovered my crawl had a fetched a number of sites in
duplicate - once over http, and again over https. In a similar manner
one can add a host to the host-urlnormlize file to avoid a similar issue
with www.example.com vs example.com urls - is there a tactic to address
http vs https?
Ideally always favouring http over https (for efficiency), but not
totally discounting https totally, if an entire host is setup to always
serve over https. i.e. I don't really want to block all https hosts via
a regex-urlfilter.
I have worked around it to some degree via specific regex-urlfilters,
but it would be nice if there was a global option, rather than have to
tweak config everytime I discover duplicate content in my crawl.
--
Arthur Yarwood
- ttp vs https duplicate fetches - host-urlnormalize? Arthur Yarwood
-