Manpage and infopage of wget need mention whether regex of wget is Extended or Basic

Rabvit via Primary discussion list for GNU Wget Fri, 17 Dec 2021 23:18:42 -0800

The man page of wget 1.21.2 (also 1.20.3) describes the following options 
concerning regular expressions.


> --accept-regex urlregex
> --reject-regex urlregex
>     Specify a regular expression to accept or reject
>     the complete URL.
>
>
> --regex-type regextype
>     Specify the regular expression type.
>     Possible types are posix or pcre.
>     Note that to be able to use pcre type
>     wget has to be compiled with libpcre support.

However, the above option description forgets to mention which kind of POSIX 
regular expression wget uses.  The info page of wget also forgets to mention 
which.

There are two kinds of POSIX regular expressions:
1. POSIX Extended Regular Expression (ERE)
2. POSIX Basic Regular Expression (BRE)

The difference between BRE and ERE follows:

  POSIX ERE
    ?  +  |    ( )   { }   have special meanings by themselves
    without being preceded by a backslash (\).
    To be literal, they need be escaped.

  POSIX BRE
    ?  +  |   are always literal and
    never have special meanings,
    no matter whether preceded by a backslash (\).

    ( )   { }   are literal by themselves,
    but have special meanings if and only if
    they are escaped as in   \(  \)    \{  \}

All other special symbols have no difference between POSIX ERE and POSIX BRE.


While the man page of the latest version of wget still forgets to mention 
whether wget uses ERE or BRE, a very old mail in the mailing list system 
suggests that wget should use ERE.

Gijs van Tulder wrote on 11 Apr 2012
(https://lists.gnu.org/archive/html/bug-wget/2012-05/msg00021.html):

> Here is a new version of the regular expressions patch.
> The new version combines POSIX (always, from gnulib)
> and PCRE (if available).
>
> The patch adds these options:
>
>  --accept-regex="..."
>  --reject-regex="..."
>
>  --regex-type=posix   for POSIX extended regexes (the default)
>  --regex-type=pcre    for PCRE regexes (if PCRE is available)


Please verify that wget currently uses ERE (as opposed to BRE) and that it is 
the default, by looking at the source code and by running wget.  If so 
verified, then, please add the sentence "posix is the default, and refers to 
POSIX Extended Regular Expression (ERE)." to the manpage and the infopage.  
Thus, the option description should become:

  --regex-type regextype
      Specify the regular expression type.
      Possible types are posix or pcre.
      posix is the default, and refers to
      POSIX Extended Regular Expression (ERE).
      Note that to be able to use pcre type
      wget has to be compiled with libpcre support.


To test whether the regex of wget is ERE, you need know the following.

?  +  |    ( )   { }   have the following meanings
when they have special meanings.

  ?    zero or one of the preceding element
  +    one or more of the preceding element
  |    alternation
  ( )  grouping

  {n}    the preceding element occurs exactly n times
  {n,}   the preceding element occurs at least n times

  {n,m}  the preceding element occurs at least n times
                                  but at most m times



Before actually running `wget` to see whether the posix regex of wget is ERE, 
let us get familiar with the behavior of ERE by running `grep`.  The -E option 
of GNU grep enables POSIX Extended Regular Expression (ERE).  Without -E, the 
regex of GNU grep is basic but slightly deviated from POSIX BRE.

Here is the difference between the three:

  POSIX ERE
    ?  +  |    ( )   { }   have special meanings by themselves
    without being preceded by a backslash (\).
    To be literal, they need be escaped.

  POSIX BRE
    ?  +  |   are always literal and
    never have special meanings,
    no matter whether preceded by a backslash (\).

    ( )   { }   are literal by themselves,
    but have special meanings if and only if
    they are escaped as in   \(  \)    \{  \}


  GNU-grep basic (default for GNU grep)
    ?  +  |    ( )   { }   are literal by themselves,
    but have special meanings if and only if
    escaped as in
    \?  \+  \|     \(  \)   \{  \}

All other special symbols have no difference between POSIX ERE, POSIX BRE, and 
GNU-grep basic.  Let me mention two of such symbols.

  *    zero or more of the preceding element
  .    matches any character except newline

The dot character '.' appears in a domain name such as "ftp.gnu.org" and before 
a file extension such as "report.pdf".  For '.' to literally mean a dot in 
regex, it has to be escaped like "ftp\.gnu\.org" and "report\.pdf".

Note that, in the context of regular expression, a special character means a 
character that has a meaning special to regular expression.  This is not to be 
confused with a special character for bash.  Many characters special to regex 
are also special to bash (but the meanings to regex and the meanings to bash 
may differ).  Thus, when passing a regex string to `grep` or `wget` on command 
line, characters that happen to be special to bash must be protected from bash. 
 This protection is usually done by enclosing the regex string with 
single-quotes ('').  The only special characters that double-quotes fail to 
protect from bash are the following four:

    dollar ($), backslash (\), backtick (`), exclamation (!)



Now, let us run `grep` to get familiar with the behavior of ERE.  In the 
following, output lines are commented out by '#' to distinguish them from 
commands.

[code]

# ? question mark
quest='ac
abc
abbc
ab?c'

echo "$quest" | grep -E 'ab?c'
  # ac
  # abc

echo "$quest" | grep -E 'ab\?c'
  # ab?c

echo "$quest" | grep 'ab?c'
  # ab?c


# +
plus='ac
abc
abbc
ab+c'

echo "$plus" | grep -E 'ab+c'
  # abc
  # abbc

echo "$plus" | grep -E 'ab\+c'
  # ab+c

echo "$plus" | grep 'ab+c'
  # ab+c


# | vertical line
vert='ab
cd
ad
bc
b|c'

echo "$vert" | grep -E 'ab|cd'
  # ab
  # cd

echo "$vert" | grep -E 'ab\|cd'
  # none matched

echo "$vert" | grep 'ab|cd'
  # none matched


# () parentheses
paren='ad
abcd
abcbcd
bc
acbd
ebcf
a(bc)d
a(bcd'


echo "$paren" | grep -E 'a(bc)*d'
  # ad
  # abcd
  # abcbcd

echo "$paren" | grep -E 'a\(bc\)*d'
  # a(bc)d
  # a(bcd

echo "$paren" | grep 'a(bc)*d'
  # a(bc)d
  # a(bcd

echo "$paren" | grep 'a\(bc\)*d'
  # ad
  # abcd
  # abcbcd


# {} curly braces
brace='ac
abc
abbc
ab{0,1}c'

echo "$brace" | grep -E 'ab{0,1}c'  # same as 'ab?c'
  # ac
  # abc

echo "$brace" | grep -E 'ab\{0,1\}c'
  # ab{0,1}c

echo "$brace" | grep 'ab{0,1}c'
  # ab{0,1}c

echo "$brace" | grep 'ab\{0,1\}c'
  # ac
  # abc

[/code]



If I were an admin of a web site, I would test `wget` with the same strings and 
regexes as the above `grep` test.  I would create files whose names are the 
same as the sample strings of the above `grep` test.  Then, I would run `wget` 
with the regex for --accept-regex being fundamentally the same as the above 
`grep` test.

The following code creates such files in the directories whose names are the 
same as the five variables of the above `grep` test.

[code]

mkdir /quest /plus /vert /paren /brace


cd /quest
questAry=(ac abc abbc 'ab?c')
echo foo | tee "${questAry[@]}" > /dev/null


cd /plus
plusAry=(ac abc abbc 'ab+c')
echo foo | tee "${plusAry[@]}" > /dev/null


cd /vert
vertAry=(ab cd ad bc 'b|c')
echo foo | tee "${vertAry[@]}" > /dev/null


cd /paren
parenAry=(ad abcd abcbcd bc acbd ebcf 'a(bc)d' 'a(bcd')
echo foo | tee "${parenAry[@]}" > /dev/null


cd /brace
braceAry=(ac abc abbc 'ab{0,1}c')
echo foo | tee "${braceAry[@]}" > /dev/null

[/code]




Suppose unrealistically that the above directories and files had been created 
in "https://www.gnu.org"; such that "https://www.gnu.org/quest/abc";, 
"https://www.gnu.org/plus/abc"; and so on.

The following code would test `wget` to see how `wget` would handle '+', which 
ERE and BRE handle differently.

[code]

links='<a href="https://www.gnu.org/plus/ac";>
<a href="https://www.gnu.org/plus/abc";>
<a href="https://www.gnu.org/plus/abbc";>
<a href="https://www.gnu.org/plus/ab+c";>'

re=".*/ab+c"

wget -rl1 --accept-regex "$re" -Fi <(echo "$links") -w1

[/code]

I thought that `wget` would request only the files whose names match the regex 
and would not request the files whose names do not match the regex.  Hence, 
even though www.gnu.org <http://www.gnu.org> actually lacks such directories 
and files as "plus/abc", "plus/abbc" and so on, I thought that, by looking at 
which files `wget` would request, I would be able to see whether `wget` works 
identically to the ERE of `grep`.  Unfortunately, however, `wget` requests all 
the files no matter whether they match the regex.  `wget` may be asking the web 
server the existence of every file before the regex filter is performed, which 
is an inefficient behavior.

Because I am not an admin of any web sites, and because `wget` requests all the 
files no matter whether they match the regex, I gave up testing whether the 
posix regex of `wget` works identically to the ERE of `grep`.

Anyway, I would like the maintainer of `wget` to verify that wget currently 
uses ERE (as opposed to BRE) and that it is the default, by looking at the 
source code and by running wget.  If so verified, then, please add the sentence 
"posix is the default, and refers to POSIX Extended Regular Expression (ERE)." 
to the manpage and the infopage.

--- Rabvit

Manpage and infopage of wget need mention whether regex of wget is Extended or Basic

Reply via email to