I have the following points to make regarding GNU Wget 1.10.2. Please consider them, and send me your feedback. I may be willing to assist in the correction/implementation of some ofthese...
Point 1 - Pointless/Unwanted Downloading/Recursion It would be pointless to download ALL HTML content under allcircumstances. Esp. when recursing. Examples: Where all or most pages (HTML content) contain php links, like,Apache file listings (i.e. "?N=D", "?M=A", etc.) or "/" etc. Known duplicates: i.e. we know that on this particular server,all index.html files are also represented as index.htm (So we couldsuppress either) Known HTML files or known HTML file patterns that we wish tosuppress. As such, since -A and -R options do not seem to correctly facilitatesuch exclusion/suppression, perhaps a separate option could beintroduced. (Examples: "--rreject", "--pre-reject", "--pre-filter") Point 2 - Resetting Switches WGET doesn't seem to offer a means for suppressing (resetting) switchesat the command line. i.e. if "recursive = on" in the config-file, itcan't be reset/overridden through the command line... "-r-" is rejectedand "-r -" is ignored. -r=off apparently wouldn't work either, and nosimilar mechanism is documented. Point 3 - Cookie Handling Cookie handling could be improved. Suggestions: Split --no-cookies switch as following: --cookies-use - Cookies can be loaded from a file.Cookies will be used in requests. --cookies--accept - Cookies are accepted from sites andstored. (Does NOT imply that cookies will be sent with requests! Couldbe used along with --save-cookies to Track/Log cookies without actuallyusing them.) Time stamping cookies might be useful when saving to a file. Naming of options/switches can be improved: See 'Command Line' Point 3 - Command Line The WGET command line is not sufficiently intuitive. The following naming scheme is suggested: "--class-tag" where 'class'is the class of switch/option/param (i.e. cookies, local, remote,etc.), and 'tag' is the actual name/tag for the particularswitch/param/opt. Suggestions for classes: cookies - Cookie handling local - Class of switches/params pertaining to representation atthe local end (i.e. "--local-directories-force" rather than"--force-directories", and "--remote-dirs-undercut" rather than"--cut-dirs"). remote - Class of switches/params pertaining to representation atthe remote end (server) Additionally, some (different) switches/options use different terms torepresent the same. i.e. 'dirs' , "directories". These could be madeconsistent. Please note that I'm not suggesting that existing options are removed.They may be retained for backward compatibility. Some of the reasons are mentioned under different point titles.The command-line suggestions under Cookie Handling are also relevant tosome other areas. Point 4 - Filtering There appear to be some inconsistencies in the implementations of thefollowing options when recursing: -A, -R -I, -X --exclude-domains (NOTE: html_extension = on) WGET doesn't seem to parse all HTML content for recursion under somecircumstances where some of thecontent is PHP generated HTML content (URLs of the form"any.php?a=1&b=2", which get saved to files as "[EMAIL PROTECTED]&b=2.html"). Links in such files aren't followed. However, when no filter is used,it seems to parse all the pages forlinks. Made me wonder if WGET deletes the files before parsing, byaccident... Also, under some circumstances, "--exclude-domains" fails. Example: "-D domain.my --exclude-domains www.domain.my" isn't handledcorrectly. Both domains get processed. Point 5 - Pattern Matching There seems to be some inconsistency in pattern matching, regarding theuse of symbol '?'. The manual says "Look up the manual of your shellfor a description of how pattern matching works.". Now I understandthat perhaps this shouldn't be taken so literally, but I would suggestthe following: Insert a clarification in the manual regarding thisnon-standard/inconsistent/non-intuitive behaviour. In addition, consider extending pattern-matching to include asingle-character wildcard, perhaps using an escape character. i.e. "\?"could be used to specify the wildcard, while normal '?' would be takenas a literal, and part of the URL. --------------------------------- Please inform me if you find any of theserelevant, and possibly give me some additional info/clarifications onthese, so I may follow up on these, and possibly get involved in theimprovements when I can find the time... If there are anyclarifications required, I may be reached at this edress. I would also like to know if there's an official forum for WGETdevelopment... ~ Thejaka. --------------------------------- Any questions? Get answers on any topic at Yahoo! Answers. Try it now.