Hello.

I am using wget 1.9.1. According to the documentation on the --include option:

     `-I' option accepts a comma-separated list of directories included
     in the retrieval.  Any other directories will simply be ignored.
     The directories are absolute paths.

from (wget.info)Directory-Based Limits

Now, this is not a complete explanation of how the option works (for example, it does not state that wildcards are acceptable), but a reasonable expectation is that a path without wildcards will match only that exact path. That's not what happens.

I am attempting to mirror a portion of an FTP site so I use a command similar to this:

wget --mirror -I /pub/mirrors/who/product/release ftp://my.domain.com/pub/mirrors/who/product/

There are several unwanted directories in the specified URL so I use the -I option to limit them. What I discovered is that it is possible to get unwanted directories as well.

For example, say the directory I specified above has the following entries:

        archive
        mail
        release
        release-old
        release-ancient
        tmp

All that has to happen for the entry to pass the -I filter is that the filter match the beginning of the target. So, all I want is the release directory, but what I get is release, release-old, and release-ancient. In fact, I get the same result with a value of pub/mirrors/who/product/rel for the -I option. In other words, it need not actually match *any* path completely in order for the directory to be recursively downloaded.

To see why this was happening, I stepped through the code in a debugger.

(ftp.c L1535)ftp_retrieve_dirs() is called since this is a recursive get. For each directory in the list, it calls accdir() at line 1572 to see whether the current item is an ACCEPTED directory. (utils.c L771) accdir() strips off any leading '/' since these are supposed to be all absolute paths. It then calls proclist() passing opt.includes as the match list, and the candidate directory. (utils.c L746)proclist() goes through each item in the list. If there are wildcards, it uses fnmatch(), else it uses frontcmp() to determine whether the target passes the filter. My entry does not have a wildcard, so it uses frontcmp(). (utils.c L737)frontcmp() scans the strings as long as it has not reached the end of either, and the current character in each is equal. If it reaches the end of the first string (which is the entry from opt.includes) it returns 1, else 0.

Now, I can understand that the intent was to allow files in deeper subdirectories to match the Include filter without needing to isolate the path elements further. For example:

with -I /pub/mirrors/who/product/release as before, all files in

    /pub/mirrors/who/product/release/foo/
    /pub/mirrors/who/product/release/bar/
    /pub/mirrors/who/product/release/baz   etc.

will be accepted because they all begin with the given -I value.

But, I would suggest that at least for non-wildcard matches, the prefix should 'match' only if it is a path prefix which breaks at a path element separator (including the end-of-string signifying an exact match). Better would be to include wildcard matches but that might be harder since it needs to have an implicit anchor of '/' or end-of-string which is not something the globbing RE engine can handle. I had a look at the latest SIngle UNIX Specification to see if I could find any words of wisdom there about the fname() function's capabilities. The specification is a bit vague in parts, but it does say that a slash must not be matched by either a '?' or '*' wildcard, or even in a character class, it must be explicitly included in the search pattern.

I tried putting a trailing '/' on the -I value, but the action function of cmd_directory_vector() invoked on the value trims any trailing '/'. So, there does not seem to be a way to force the match of a path prefix which consists only of full path elements. Use of frontcmp() makes non-wildcard -I values behave the same as if there were a trailing '*' and there is no way to retain a trailing '/' in the pattern.

The only possible way to (at least temporarily) achieve the effect I want is to enumerate all unwanted paths, where they have a prefix which matches any of the -I matches, as values to -X. This works only as long as nobody puts a new entry on the remote site which matches the -I values, but is not in the -X values, something which I cannot control. So, again, I say this is a bug.

I see that frontcmp() is also called by (recur.c)download_child_p which is an HTTP function, so any possible patch would probably need to just create a new function in utils.c solely for use in FTP directory matching. It's only a two line function and it's only used once in utils.c so the impact will be small.

I figured I would report this first to see whether the maintainers agree with my assessment before considering writing a patch.

Thanks for a wonderful program, and thanks for taking a look at this issue.

Cheers.

--- Bill Bresler


P.S. The bug-wget AT gnu DOT org address is forwarding messages to an invalid domain of sunsite.auc.dk. This is the message I receive:


Hi. This is the qmail-send program at a.mx.sunsite.dk.
I'm afraid I wasn't able to deliver your message to the following addresses.
This is a permanent error; I've given up. Sorry it didn't work out.

<[EMAIL PROTECTED]>:
Domain obsolete. Please try [EMAIL PROTECTED] instead. (#5.1.6)

So instead, I am posting directly to the list, but I am not a subscriber at this time. I would appreciate being cc'ed in any responses, but I will also check back through the web interface to the list in a day or so. Thanks.




Reply via email to