The man page of wget 1.21.2 (also 1.20.3) describes the following options concerning regular expressions.
> --accept-regex urlregex > --reject-regex urlregex > Specify a regular expression to accept or reject > the complete URL. > > > --regex-type regextype > Specify the regular expression type. > Possible types are posix or pcre. > Note that to be able to use pcre type > wget has to be compiled with libpcre support. However, the above option description forgets to mention which kind of POSIX regular expression wget uses. The info page of wget also forgets to mention which. There are two kinds of POSIX regular expressions: 1. POSIX Extended Regular Expression (ERE) 2. POSIX Basic Regular Expression (BRE) The difference between BRE and ERE follows: POSIX ERE ? + | ( ) { } have special meanings by themselves without being preceded by a backslash (\). To be literal, they need be escaped. POSIX BRE ? + | are always literal and never have special meanings, no matter whether preceded by a backslash (\). ( ) { } are literal by themselves, but have special meanings if and only if they are escaped as in \( \) \{ \} All other special symbols have no difference between POSIX ERE and POSIX BRE. While the man page of the latest version of wget still forgets to mention whether wget uses ERE or BRE, a very old mail in the mailing list system suggests that wget should use ERE. Gijs van Tulder wrote on 11 Apr 2012 (https://lists.gnu.org/archive/html/bug-wget/2012-05/msg00021.html): > Here is a new version of the regular expressions patch. > The new version combines POSIX (always, from gnulib) > and PCRE (if available). > > The patch adds these options: > > --accept-regex="..." > --reject-regex="..." > > --regex-type=posix for POSIX extended regexes (the default) > --regex-type=pcre for PCRE regexes (if PCRE is available) Please verify that wget currently uses ERE (as opposed to BRE) and that it is the default, by looking at the source code and by running wget. If so verified, then, please add the sentence "posix is the default, and refers to POSIX Extended Regular Expression (ERE)." to the manpage and the infopage. Thus, the option description should become: --regex-type regextype Specify the regular expression type. Possible types are posix or pcre. posix is the default, and refers to POSIX Extended Regular Expression (ERE). Note that to be able to use pcre type wget has to be compiled with libpcre support. To test whether the regex of wget is ERE, you need know the following. ? + | ( ) { } have the following meanings when they have special meanings. ? zero or one of the preceding element + one or more of the preceding element | alternation ( ) grouping {n} the preceding element occurs exactly n times {n,} the preceding element occurs at least n times {n,m} the preceding element occurs at least n times but at most m times Before actually running `wget` to see whether the posix regex of wget is ERE, let us get familiar with the behavior of ERE by running `grep`. The -E option of GNU grep enables POSIX Extended Regular Expression (ERE). Without -E, the regex of GNU grep is basic but slightly deviated from POSIX BRE. Here is the difference between the three: POSIX ERE ? + | ( ) { } have special meanings by themselves without being preceded by a backslash (\). To be literal, they need be escaped. POSIX BRE ? + | are always literal and never have special meanings, no matter whether preceded by a backslash (\). ( ) { } are literal by themselves, but have special meanings if and only if they are escaped as in \( \) \{ \} GNU-grep basic (default for GNU grep) ? + | ( ) { } are literal by themselves, but have special meanings if and only if escaped as in \? \+ \| \( \) \{ \} All other special symbols have no difference between POSIX ERE, POSIX BRE, and GNU-grep basic. Let me mention two of such symbols. * zero or more of the preceding element . matches any character except newline The dot character '.' appears in a domain name such as "ftp.gnu.org" and before a file extension such as "report.pdf". For '.' to literally mean a dot in regex, it has to be escaped like "ftp\.gnu\.org" and "report\.pdf". Note that, in the context of regular expression, a special character means a character that has a meaning special to regular expression. This is not to be confused with a special character for bash. Many characters special to regex are also special to bash (but the meanings to regex and the meanings to bash may differ). Thus, when passing a regex string to `grep` or `wget` on command line, characters that happen to be special to bash must be protected from bash. This protection is usually done by enclosing the regex string with single-quotes (''). The only special characters that double-quotes fail to protect from bash are the following four: dollar ($), backslash (\), backtick (`), exclamation (!) Now, let us run `grep` to get familiar with the behavior of ERE. In the following, output lines are commented out by '#' to distinguish them from commands. [code] # ? question mark quest='ac abc abbc ab?c' echo "$quest" | grep -E 'ab?c' # ac # abc echo "$quest" | grep -E 'ab\?c' # ab?c echo "$quest" | grep 'ab?c' # ab?c # + plus='ac abc abbc ab+c' echo "$plus" | grep -E 'ab+c' # abc # abbc echo "$plus" | grep -E 'ab\+c' # ab+c echo "$plus" | grep 'ab+c' # ab+c # | vertical line vert='ab cd ad bc b|c' echo "$vert" | grep -E 'ab|cd' # ab # cd echo "$vert" | grep -E 'ab\|cd' # none matched echo "$vert" | grep 'ab|cd' # none matched # () parentheses paren='ad abcd abcbcd bc acbd ebcf a(bc)d a(bcd' echo "$paren" | grep -E 'a(bc)*d' # ad # abcd # abcbcd echo "$paren" | grep -E 'a\(bc\)*d' # a(bc)d # a(bcd echo "$paren" | grep 'a(bc)*d' # a(bc)d # a(bcd echo "$paren" | grep 'a\(bc\)*d' # ad # abcd # abcbcd # {} curly braces brace='ac abc abbc ab{0,1}c' echo "$brace" | grep -E 'ab{0,1}c' # same as 'ab?c' # ac # abc echo "$brace" | grep -E 'ab\{0,1\}c' # ab{0,1}c echo "$brace" | grep 'ab{0,1}c' # ab{0,1}c echo "$brace" | grep 'ab\{0,1\}c' # ac # abc [/code] If I were an admin of a web site, I would test `wget` with the same strings and regexes as the above `grep` test. I would create files whose names are the same as the sample strings of the above `grep` test. Then, I would run `wget` with the regex for --accept-regex being fundamentally the same as the above `grep` test. The following code creates such files in the directories whose names are the same as the five variables of the above `grep` test. [code] mkdir /quest /plus /vert /paren /brace cd /quest questAry=(ac abc abbc 'ab?c') echo foo | tee "${questAry[@]}" > /dev/null cd /plus plusAry=(ac abc abbc 'ab+c') echo foo | tee "${plusAry[@]}" > /dev/null cd /vert vertAry=(ab cd ad bc 'b|c') echo foo | tee "${vertAry[@]}" > /dev/null cd /paren parenAry=(ad abcd abcbcd bc acbd ebcf 'a(bc)d' 'a(bcd') echo foo | tee "${parenAry[@]}" > /dev/null cd /brace braceAry=(ac abc abbc 'ab{0,1}c') echo foo | tee "${braceAry[@]}" > /dev/null [/code] Suppose unrealistically that the above directories and files had been created in "https://www.gnu.org" such that "https://www.gnu.org/quest/abc", "https://www.gnu.org/plus/abc" and so on. The following code would test `wget` to see how `wget` would handle '+', which ERE and BRE handle differently. [code] links='<a href="https://www.gnu.org/plus/ac"> <a href="https://www.gnu.org/plus/abc"> <a href="https://www.gnu.org/plus/abbc"> <a href="https://www.gnu.org/plus/ab+c">' re=".*/ab+c" wget -rl1 --accept-regex "$re" -Fi <(echo "$links") -w1 [/code] I thought that `wget` would request only the files whose names match the regex and would not request the files whose names do not match the regex. Hence, even though www.gnu.org <http://www.gnu.org> actually lacks such directories and files as "plus/abc", "plus/abbc" and so on, I thought that, by looking at which files `wget` would request, I would be able to see whether `wget` works identically to the ERE of `grep`. Unfortunately, however, `wget` requests all the files no matter whether they match the regex. `wget` may be asking the web server the existence of every file before the regex filter is performed, which is an inefficient behavior. Because I am not an admin of any web sites, and because `wget` requests all the files no matter whether they match the regex, I gave up testing whether the posix regex of `wget` works identically to the ERE of `grep`. Anyway, I would like the maintainer of `wget` to verify that wget currently uses ERE (as opposed to BRE) and that it is the default, by looking at the source code and by running wget. If so verified, then, please add the sentence "posix is the default, and refers to POSIX Extended Regular Expression (ERE)." to the manpage and the infopage. --- Rabvit