Thanks for the further tips. Adding --regex-type=pcre resolved the problem with "event-4193082@CalendarViewType=1&SelectedDate=6%2F27%2F2021.html", even though I am using wget 1.19.1 I am running wget on a Synology NAS, so the newest Windows build won't help. I am using --restrict-file-names=windows to allow the resulting mirrored website to be viewed on a Windows client. The advice in this forum post: https://serverfault.com/questions/324555/how-to-exclude-certain-directories-while-using-wget made me realize that --exclude-directories probably didn't work for "fonts" and "Fonts" because they are subdirectories. The workaround suggested there of using --reject-regex instead is working satisfactorily for me. That said, I am still curious as to why directories of the form "Fonts_ADMIN_<date>_Conflict" are being created at all. Their parent directory is being recreated with a new GUID more often than I anticipated, so I will pursue that question under a different title. Here is the script with the working exclusions: >> wget -EkKrNpH \ --output-file=wget.log \ --domains=imcz.club,sf.wildapricot.org \ --exclude-domains=webmail.imcz.club \ --exclude-directories=calendar,Club-Events,External-Events,Fonts,fonts,Sys \ --ignore-case \ --level=2 \ --no-parent \ --no-proxy \ --random-wait \ --regex-type=pcre \ --reject=ashx,"overlay*" \ --reject-regex="calendar[@\?].*|Club-Events[@\?].*|External-Events[@\?].*|event-\d+[@\?].*|/[Ff]onts" \ --rejected-log=wget-rejected.log \ --restrict-file-names=windows \ --wait=1 \ https://imcz.club/ << Thanks for your help! Regards, Roger
-----Original Message----- From: Tim Rühsen <tim.rueh...@gmx.de> Sent: Thursday, July 8, 2021 7:54 PM To: Roger Brooks <r.s.bro...@ieee.org> Cc: bug-wget@gnu.org Subject: Re: Exclusion failures I think i don't understand your font/ problem correctly, sorry. The regex issue seems to be that wget is using POSIX regex by default. Please try to use --regex-type=pcre for PCRE regex. You can get the latest version of wget built for Windows (incl. PCRE support) at https://eternallybored.org/misc/wget/. Regards, Tim On 08.07.21 16:26, Roger Brooks wrote: > Thanks for the explanations. Unfortunately, I don't find them convincing: > >>> > So the fonts/ directory is not automatically deleted by wget when it > is empty. It was used for temporary files during the download. > << > Actually, the "fonts" directory is *not* empty, nor are the "Fonts_* > _Conflict" directories. > >>> > Why should '@CalendarView' match 'calendar[@/?]' ? > << > The component of the regex which should match is not "calendar[@\?].*" > (the first term in the regex). It is "event-\d+[@\?].*" (the fourth > and last term in the regex). > Once again, https://regex101.com/ confirms that > "event-4193082@CalendarViewType=1&SelectedDate=6%2F27%2F2021.html" > matches this term. > > Thanks for your support. > > -----Original Message----- > From: Tim Rühsen <tim.rueh...@gmx.de> > Sent: Monday, July 5, 2021 4:09 PM > To: Roger Brooks <r.s.bro...@ieee.org>; bug-wget@gnu.org > Subject: Re: Exclusion failures > > On 28.06.21 19:36, Roger Brooks wrote: >> I am trying to use wget 1.19.1 to back up a club website. Here is a >> reduced version of my wget command, which only accesses the public >> parts of the >> website: >>>> >> cd /volume1/Backup/ >> wget -EkKrNpH \ >> --output-file=wget.log \ >> --domains=imcz.club,sf.wildapricot.org \ >> --exclude-domains=webmail.imcz.club \ >> >> --exclude-directories=calendar,Club-Events,External-Events,Sys,Fonts, >> f >> onts >> \ >> --ignore-case \ >> --level=2 \ >> --no-parent \ >> --no-proxy \ >> --random-wait \ >> --reject=ashx,"overlay*" \ >> >> --reject-regex="calendar[@\?].*|Club-Events[@\?].*|External-Events[@\?].*|event-\d+[@\?].*" >> \ >> --rejected-log=wget-rejected.log \ >> --restrict-file-names=windows \ >> --wait=1 \ >> https://imcz.club/ >> << >> >> Two of the exclusions in the command are failing: >> >> 1. -exclude-directories=Fonts, fonts >> This is a workaround for wget’s creation of spurious font directories. >> The server has only one such directory, but the website’s backend >> platform (over which I have no control) sometimes addresses it as >> “fonts” and sometimes as “Fonts”. >> I expected that the option "--ignore-case" in the absence of >> "--no-clobber" >> would take care of this problem, but since the contents are static, I >> don’t need to back it up regularly. Despite the exclusion, wget >> still insists on creating the following directories: >> "W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\fonts" >> "W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230456-2021_Conflict" >> "W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230459-2021_Conflict" >> "W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230501-2021_Conflict" >> "W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230504-2021_Conflict" >> The resulting backup website does not find the fonts in the "_Conflict" >> directories; they have to be copied into the "fonts" directory for >> the pages in the mirrored site to display properly. > > So the fonts/ directory is not automatically deleted by wget when it > is empty. It was used for temporary files during the download. > This is a known "issue", but since an empty directory doesn't eat too > much space on a disk, it wasn't fixed yet (maybe nobody thought it is > relevant). > Wget2 doesn't have this issue. > > I don't know where the *_Conflict/ directories are from. Seems like a > server thing. > > >> 2. >> --reject-regex="calendar[@\?].*|Club-Events[@\?].*|External-Events[@\?].*|event-\d+[@\?].*" >> \ >> This is an attempt to prevent duplicate downloading of files. The >> following file is downloaded, even though https://regex101.com says >> that it matches my >> regex: >> "W:\imcz.club\event-4193082@CalendarViewType=1&SelectedDate=6%2F27%2F2021.html" >> It is effectively a duplicate of: >> "W:\imcz.club\event-4193082.html" >> Increasing "--level" produces additional examples. > > Why should '@CalendarView' match 'calendar[@/?]' ? > Maybe your regex should be '[@\?]calendar.*' !? > > Regards, Tim >