Thanks for the explanations. Unfortunately, I don't find them convincing: >> So the fonts/ directory is not automatically deleted by wget when it is empty. It was used for temporary files during the download. << Actually, the "fonts" directory is *not* empty, nor are the "Fonts_* _Conflict" directories.
>> Why should '@CalendarView' match 'calendar[@/?]' ? << The component of the regex which should match is not "calendar[@\?].*" (the first term in the regex). It is "event-\d+[@\?].*" (the fourth and last term in the regex). Once again, https://regex101.com/ confirms that "event-4193082@CalendarViewType=1&SelectedDate=6%2F27%2F2021.html" matches this term. Thanks for your support. -----Original Message----- From: Tim Rühsen <tim.rueh...@gmx.de> Sent: Monday, July 5, 2021 4:09 PM To: Roger Brooks <r.s.bro...@ieee.org>; bug-wget@gnu.org Subject: Re: Exclusion failures On 28.06.21 19:36, Roger Brooks wrote: > I am trying to use wget 1.19.1 to back up a club website. Here is a > reduced version of my wget command, which only accesses the public > parts of the > website: >>> > cd /volume1/Backup/ > wget -EkKrNpH \ > --output-file=wget.log \ > --domains=imcz.club,sf.wildapricot.org \ > --exclude-domains=webmail.imcz.club \ > > --exclude-directories=calendar,Club-Events,External-Events,Sys,Fonts,f > onts > \ > --ignore-case \ > --level=2 \ > --no-parent \ > --no-proxy \ > --random-wait \ > --reject=ashx,"overlay*" \ > > --reject-regex="calendar[@\?].*|Club-Events[@\?].*|External-Events[@\?].*|event-\d+[@\?].*" > \ > --rejected-log=wget-rejected.log \ > --restrict-file-names=windows \ > --wait=1 \ > https://imcz.club/ > << > > Two of the exclusions in the command are failing: > > 1. -exclude-directories=Fonts, fonts > This is a workaround for wget’s creation of spurious font directories. > The server has only one such directory, but the website’s backend > platform (over which I have no control) sometimes addresses it as > “fonts” and sometimes as “Fonts”. > I expected that the option "--ignore-case" in the absence of > "--no-clobber" > would take care of this problem, but since the contents are static, I > don’t need to back it up regularly. Despite the exclusion, wget still > insists on creating the following directories: > "W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\fonts" > "W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230456-2021_Conflict" > "W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230459-2021_Conflict" > "W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230501-2021_Conflict" > "W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230504-2021_Conflict" > The resulting backup website does not find the fonts in the "_Conflict" > directories; they have to be copied into the "fonts" directory for the > pages in the mirrored site to display properly. So the fonts/ directory is not automatically deleted by wget when it is empty. It was used for temporary files during the download. This is a known "issue", but since an empty directory doesn't eat too much space on a disk, it wasn't fixed yet (maybe nobody thought it is relevant). Wget2 doesn't have this issue. I don't know where the *_Conflict/ directories are from. Seems like a server thing. > 2. > --reject-regex="calendar[@\?].*|Club-Events[@\?].*|External-Events[@\?].*|event-\d+[@\?].*" > \ > This is an attempt to prevent duplicate downloading of files. The > following > file is downloaded, even though https://regex101.com says that it matches > my > regex: > "W:\imcz.club\event-4193082@CalendarViewType=1&SelectedDate=6%2F27%2F2021.html" > It is effectively a duplicate of: > "W:\imcz.club\event-4193082.html" > Increasing "--level" produces additional examples. Why should '@CalendarView' match 'calendar[@/?]' ? Maybe your regex should be '[@\?]calendar.*' !? Regards, Tim