URL:
  <http://savannah.gnu.org/bugs/?50516>

                 Summary: domain.com vs www.domain.com site duplication
                 Project: GNU Wget
            Submitted by: ages2500
            Submitted on: Sat 11 Mar 2017 08:01:57 PM UTC
                Category: Feature Request
                Severity: 3 - Normal
                Priority: 5 - Normal
                  Status: None
                 Privacy: Public
             Assigned to: None
         Originator Name: 
        Originator Email: 
             Open/Closed: Open
         Discussion Lock: Any
                 Release: None
        Operating System: None
         Reproducibility: None
           Fixed Release: None
         Planned Release: None
              Regression: None
           Work Required: None
          Patch Included: No

    _______________________________________________________

Details:

When retrieving http://www.domain.com/, the site author may link a file to
domain.com, without the www. This also occurs when the opposite is true.

Either scenario results in the website being downloaded twice, creating a
hapazard mesh of file links between:

/domain.com/

and

/www.domain.com/

It also means that 404 pages will link to http://domain.com/ in the html of
files of one folder, and http://www.domain.com/ in the other.

If one were to overlook the local mess this creates, it still puts extra
strain on a large wget process by crawling and downloading near twice as much
data than it needs to.

Restricting the site to -D www.domain.com runs the risk of missing data. To
ensure I get all of the data from the domain in question, I use -D
domain.com.

It would be nice for an extra flag to treat domain.com and www.domain.com
content the same in wget, and store the content in the same folder without
content duplication.

I am not requesting that this feature be a default function, but rather an
additional flag/feature that treats www.domain.com and domain.com as coming
from the same domain.

The following URL will exhibit this behavior in wget:


wget -rkE -np -l inf -D runequake.com http://www.runequake.com/







    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?50516>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/


Reply via email to