FTP bugs
WGET 1.10.2 - Under Windows ~~~ 1) Resuming When a download fails with: 04:29:34 (17.41 KB/s) - Data connection: Connection reset by peer; Control connection closed. Retrying. WGET does the following: == SYST ... done.== PWD ... done. == TYPE I ... done. == CWD not required. == SIZE == PASV == RETR No such file Using -d shows that PWD returns '/', whereas the file is in a subdirectory. So CWD is indeed required. This only occurs when the control connection is closed and reopened. (Well, guess that's obvious) 2) When using -c and remove_listing = off WGET seems to append to the existing list, rather than replacing. Would seem better to use automatic renaming of the old lists before a new listing is dumped. ~~ There are other bugs I've come across, not to mention potential improvements that suggest themselves to me! I've posted some suggestions a couple of times before, but nobody seemed interested. I would try to fix WGET myself, but I don't have much experience in cross-platform development, and I'm unused to svn. So, if nobody seems interested, I'll just keep my comments to myself, in future! __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Bugs, Inconsistencies, and Suggestions
(I have already sent a variation of this to [EMAIL PROTECTED]) - I have the following points to make regarding GNU Wget 1.10.2. (Windowsbinaries) Please consider them, and send me your feedback. I may be willing to assist in the correction/implementation of some ofthese... Point 1 - Pointless/Unwanted Downloading/Recursion It would be pointless to download ALL HTML content under allcircumstances. Esp. when recursing. Examples: Where all or most pages (HTML content) contain php links, like,Apache file listings (i.e. ?N=D, ?M=A, etc.) or / etc. Known duplicates: i.e. we know that on this particular server,all index.html files are also represented as index.htm (So we couldsuppress either) Known HTML files or known HTML file patterns that we wish tosuppress. As such, since -A and -R options do not seem to correctly facilitatesuch exclusion/suppression, perhaps a separate option could beintroduced. (Examples: --rreject, --pre-reject, --pre-filter, --raccept) Point 2 - Resetting Switches WGET doesn't seem to offer a means for suppressing (resetting) switchesat the command line. i.e. if recursive = on in the config-file, itcan't be reset/overridden through the command line... -r- is rejectedand -r - is ignored. -r=off apparently wouldn't work either, and nosimilar mechanism is documented. Point 3 - Cookie Handling Cookie handling could be improved. Suggestions: Split --no-cookies switch as following: --cookies-use - Cookies can be loaded from a file.Cookies will be used in requests. --cookies--accept- Cookies are accepted from sites andstored. (Does NOT imply that cookies will be sent with requests! Couldbe used along with --save-cookies to Track/Log cookies without actuallyusing them.) Time stamping cookies might be useful when saving to a file. Naming of options/switches can be improved: See 'Command Line' Point 3 - Command Line The WGET command line is not sufficiently intuitive. The following naming scheme is suggested: --class-tag where 'class'is the class of switch/option/param (i.e. cookies, local, remote,etc.), and 'tag' is the actual name/tag for the particularswitch/param/opt. Suggestions for classes: cookies - Cookie handling local - Class of switches/params pertaining to representation atthe local end (i.e. --local-directories-force rather than--force-directories, and --remote-dirs-undercut rather than--cut-dirs). remote - Class of switches/params pertaining to representation atthe remote end (server) Additionally, some (different) switches/options use different terms torepresent the same. i.e. 'dirs' , directories. These could be madeconsistent. Please note that I'm not suggesting that existing options are removed.They may be retained for backward compatibility. Some of the reasons are mentioned under different point titles.The command-line suggestions under Cookie Handling are also relevant tosome other areas. Point 4 - Filtering There appear to be some inconsistencies in the implementations of thefollowing options when recursing: -A, -R -I, -X --exclude-domains (NOTE: html_extension = on) WGET doesn't seem to parse all HTML content for recursion under somecircumstances where some of thecontent is PHP generated HTML content (URLs of the formany.php?a=1b=2, which get saved to files as [EMAIL PROTECTED]b=2.html). Links in such files aren't followed. However, when no filter is used,it seems to parse all the pages forlinks. Made me wonder if WGET deletes the files before parsing, byaccident... Also, under some circumstances, --exclude-domains fails. Example: -D domain.my --exclude-domains www.domain.myisn't handledcorrectly. Both domains get processed. Point 5 - Pattern Matching There seems to be some inconsistency in pattern matching, regarding theuse of symbol '?'. The manual says Look up the manual of your shellfor a description of how pattern matching works.. Now I understandthat perhaps this shouldn't be taken so literally, but I would suggestthe following: Insert a clarification in the manual regarding thisnon-standard/inconsistent/non-intuitive behaviour. In addition, consider extending pattern-matching to include asingle-character wildcard, perhaps using an escape character. i.e. \?could be used to specify the wildcard, while normal '?' would be takenas a literal, and part of the URL. - Please inform me if you find any of theserelevant, and possibly give me some additional info/clarifications onthese, so I may follow up on these, and possibly get involved in theimprovements when I can find the time... If there are anyclarifications required, I may be reached at this edress. I would also like to know if there's an official forum for WGETdevelopment... ~ Thejaka. - Everyone is raving about the all-new
Bugs, Inconsistencies, and Suggestions
I have the following points to make regarding GNU Wget 1.10.2. Please consider them, and send me your feedback. I may be willing to assist in the correction/implementation of some ofthese... Point 1 - Pointless/Unwanted Downloading/Recursion It would be pointless to download ALL HTML content under allcircumstances. Esp. when recursing. Examples: Where all or most pages (HTML content) contain php links, like,Apache file listings (i.e. ?N=D, ?M=A, etc.) or / etc. Known duplicates: i.e. we know that on this particular server,all index.html files are also represented as index.htm (So we couldsuppress either) Known HTML files or known HTML file patterns that we wish tosuppress. As such, since -A and -R options do not seem to correctly facilitatesuch exclusion/suppression, perhaps a separate option could beintroduced. (Examples: --rreject, --pre-reject, --pre-filter) Point 2 - Resetting Switches WGET doesn't seem to offer a means for suppressing (resetting) switchesat the command line. i.e. if recursive = on in the config-file, itcan't be reset/overridden through the command line... -r- is rejectedand -r - is ignored. -r=off apparently wouldn't work either, and nosimilar mechanism is documented. Point 3 - Cookie Handling Cookie handling could be improved. Suggestions: Split --no-cookies switch as following: --cookies-use - Cookies can be loaded from a file.Cookies will be used in requests. --cookies--accept- Cookies are accepted from sites andstored. (Does NOT imply that cookies will be sent with requests! Couldbe used along with --save-cookies to Track/Log cookies without actuallyusing them.) Time stamping cookies might be useful when saving to a file. Naming of options/switches can be improved: See 'Command Line' Point 3 - Command Line The WGET command line is not sufficiently intuitive. The following naming scheme is suggested: --class-tag where 'class'is the class of switch/option/param (i.e. cookies, local, remote,etc.), and 'tag' is the actual name/tag for the particularswitch/param/opt. Suggestions for classes: cookies - Cookie handling local - Class of switches/params pertaining to representation atthe local end (i.e. --local-directories-force rather than--force-directories, and --remote-dirs-undercut rather than--cut-dirs). remote - Class of switches/params pertaining to representation atthe remote end (server) Additionally, some (different) switches/options use different terms torepresent the same. i.e. 'dirs' , directories. These could be madeconsistent. Please note that I'm not suggesting that existing options are removed.They may be retained for backward compatibility. Some of the reasons are mentioned under different point titles.The command-line suggestions under Cookie Handling are also relevant tosome other areas. Point 4 - Filtering There appear to be some inconsistencies in the implementations of thefollowing options when recursing: -A, -R -I, -X --exclude-domains (NOTE: html_extension = on) WGET doesn't seem to parse all HTML content for recursion under somecircumstances where some of thecontent is PHP generated HTML content (URLs of the formany.php?a=1b=2, which get saved to files as [EMAIL PROTECTED]b=2.html). Links in such files aren't followed. However, when no filter is used,it seems to parse all the pages forlinks. Made me wonder if WGET deletes the files before parsing, byaccident... Also, under some circumstances, --exclude-domains fails. Example: -D domain.my --exclude-domains www.domain.my isn't handledcorrectly. Both domains get processed. Point 5 - Pattern Matching There seems to be some inconsistency in pattern matching, regarding theuse of symbol '?'. The manual says Look up the manual of your shellfor a description of how pattern matching works.. Now I understandthat perhaps this shouldn't be taken so literally, but I would suggestthe following: Insert a clarification in the manual regarding thisnon-standard/inconsistent/non-intuitive behaviour. In addition, consider extending pattern-matching to include asingle-character wildcard, perhaps using an escape character. i.e. \?could be used to specify the wildcard, while normal '?' would be takenas a literal, and part of the URL. - Please inform me if you find any of theserelevant, and possibly give me some additional info/clarifications onthese, so I may follow up on these, and possibly get involved in theimprovements when I can find the time... If there are anyclarifications required, I may be reached at this edress. I would also like to know if there's an official forum for WGETdevelopment... ~ Thejaka. - Any questions? Get answers on any topic at Yahoo! Answers. Try it now.
Feature Request: Prefiltering (applicable to recursive gets)
It seems pointless to download ALL html content under some circumstances... esp. if all or most pages contain php links, like, Apache file listings (i.e. ?N=D, ?M=A, etc.). Why not add an option like --rreject --pre-reject or maybe --pre-filter, so we can specify which types of links to completely ignore? Cheap talk? Check out Yahoo! Messenger's low PC-to-Phone call rates. http://voice.yahoo.com
Feature Request: Pattern Matching - '?'
There seems to be some inconsistency in pattern matching, regarding the use of symbol '?'. The manual says Look up the manual of your shell for a description of how pattern matching works.. Now I understand that maybe this shouldn't be taken so literally, but I would suggest the following: 1) Insert a clarification in the manual regarding this non-standard/inconsistent behaviour. 2) In addition, consider extending pattern-matching to include a single-character wildcard, perhaps using an escape character. i.e. \? could be used to specify the wildcard, while normal '?' would be taken as a literal, and part of the URL. Do you Yahoo!? Everyone is raving about the all-new Yahoo! Mail beta. http://new.mail.yahoo.com