I am trying to use wget 1.19.1 to back up a club website. Here is a reduced version of my wget command, which only accesses the public parts of the website: >> cd /volume1/Backup/ wget -EkKrNpH \ --output-file=wget.log \ --domains=imcz.club,sf.wildapricot.org \ --exclude-domains=webmail.imcz.club \ --exclude-directories=calendar,Club-Events,External-Events,Sys,Fonts,fonts \ --ignore-case \ --level=2 \ --no-parent \ --no-proxy \ --random-wait \ --reject=ashx,"overlay*" \ --reject-regex="calendar[@\?].*|Club-Events[@\?].*|External-Events[@\?].*|event-\d+[@\?].*" \ --rejected-log=wget-rejected.log \ --restrict-file-names=windows \ --wait=1 \ https://imcz.club/ <<
Two of the exclusions in the command are failing: 1. -exclude-directories=Fonts, fonts This is a workaround for wget’s creation of spurious font directories. The server has only one such directory, but the website’s backend platform (over which I have no control) sometimes addresses it as “fonts” and sometimes as “Fonts”. I expected that the option "--ignore-case" in the absence of "--no-clobber" would take care of this problem, but since the contents are static, I don’t need to back it up regularly. Despite the exclusion, wget still insists on creating the following directories: "W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\fonts" "W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230456-2021_Conflict" "W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230459-2021_Conflict" "W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230501-2021_Conflict" "W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230504-2021_Conflict" The resulting backup website does not find the fonts in the "_Conflict" directories; they have to be copied into the "fonts" directory for the pages in the mirrored site to display properly. 2. --reject-regex="calendar[@\?].*|Club-Events[@\?].*|External-Events[@\?].*|event-\d+[@\?].*" \ This is an attempt to prevent duplicate downloading of files. The following file is downloaded, even though https://regex101.com says that it matches my regex: "W:\imcz.club\event-4193082@CalendarViewType=1&SelectedDate=6%2F27%2F2021.html" It is effectively a duplicate of: "W:\imcz.club\event-4193082.html" Increasing "--level" produces additional examples. I am aware that 1.19.1 is not the latest version, but wget is running on a Synology DiskStation, which makes it difficult to update. I haven't found any indication that these problems are known bugs which have since been fixed. Any advice is welcome!