I have the following points to make regarding GNU Wget 1.10.2.
Please consider them, and send me your feedback.
I may be willing to assist in the correction/implementation of some ofthese...

Point 1 - Pointless/Unwanted Downloading/Recursion

It would be pointless to download ALL HTML content under allcircumstances. Esp. 
when recursing.
Examples:
  
   Where all or most pages (HTML content) contain php links, like,Apache file 
listings    
(i.e. "?N=D", "?M=A", etc.) or "/" etc.  
   Known duplicates: i.e. we know that on this particular server,all index.html 
files are also represented as index.htm (So we couldsuppress either)  
   Known HTML files or known HTML file patterns that we wish tosuppress.
As such, since -A and -R options do not seem to correctly facilitatesuch 
exclusion/suppression, perhaps a separate option could beintroduced.
(Examples: "--rreject", "--pre-reject", "--pre-filter")

Point 2 - Resetting Switches

WGET doesn't seem to offer a means for suppressing (resetting) switchesat the 
command line. i.e. if "recursive = on" in the config-file, itcan't be 
reset/overridden through the command line... "-r-" is rejectedand "-r -" is 
ignored. -r=off apparently wouldn't work either, and nosimilar mechanism is 
documented.

Point 3 - Cookie Handling

Cookie handling could be improved.

Suggestions:
  
   Split --no-cookies switch as following:      
      --cookies-use           - Cookies can be loaded from a file.Cookies will 
be used in requests.
        
      --cookies--accept    - Cookies are accepted from sites andstored. (Does 
NOT imply that cookies will be sent with requests! Couldbe used along with 
--save-cookies to Track/Log cookies without actuallyusing them.)  
  
   Time stamping cookies might be useful when saving to a file.  
   Naming of options/switches can be improved: See 'Command Line'
  

Point 3 - Command Line

The WGET command line is not sufficiently intuitive.

The following naming scheme is suggested: "--class-tag" where 'class'is the 
class of switch/option/param (i.e. cookies, local, remote,etc.), and 'tag' is 
the actual name/tag for the particularswitch/param/opt.

Suggestions for classes:
  
   cookies - Cookie handling  
   local - Class of switches/params pertaining to representation atthe local 
end (i.e. "--local-directories-force" rather than"--force-directories", and 
"--remote-dirs-undercut" rather than"--cut-dirs").
    
   remote - Class of switches/params pertaining to representation atthe remote 
end (server)
Additionally, some (different) switches/options use different terms torepresent 
the same. i.e. 'dirs' , "directories". These could be madeconsistent.

Please note that I'm not suggesting that existing options are removed.They may 
be retained for backward compatibility.

Some of the reasons are mentioned under different point titles.The command-line 
suggestions under Cookie Handling are also relevant tosome other areas.

Point 4 - Filtering

There appear to be some inconsistencies in the implementations of thefollowing 
options when recursing:
  
   -A, -R  
   -I, -X  
   --exclude-domains
  
(NOTE: html_extension = on)

WGET doesn't seem to parse all HTML content for recursion under 
somecircumstances where some of thecontent is PHP generated HTML content (URLs 
of the form"any.php?a=1&b=2", which get saved to files as "[EMAIL 
PROTECTED]&b=2.html").
Links in such files aren't followed. However, when no filter is used,it seems 
to parse all the pages forlinks. Made me wonder if WGET deletes the files 
before parsing, byaccident...

Also, under some circumstances, "--exclude-domains" fails.
Example: "-D domain.my --exclude-domains www.domain.my" isn't handledcorrectly. 
Both domains get processed.

Point 5 - Pattern Matching

There seems to be some inconsistency in pattern matching, regarding theuse of 
symbol '?'. The manual says "Look up the manual of your shellfor a description 
of how pattern matching works.". Now I understandthat perhaps this shouldn't be 
taken so literally, but I would suggestthe following:
  
   Insert a clarification in the manual regarding 
thisnon-standard/inconsistent/non-intuitive behaviour.    
   In addition, consider extending pattern-matching to include 
asingle-character wildcard, perhaps using an escape character. i.e. "\?"could 
be used to specify the wildcard, while normal '?' would be takenas a literal, 
and part of the URL.


---------------------------------
Please inform me if you find any of theserelevant, and possibly give me some 
additional info/clarifications onthese, so I may follow up on these, and 
possibly get involved in theimprovements when I can find the time... If there 
are anyclarifications required, I may be reached at this edress.

I would also like to know if there's an official forum for WGETdevelopment...

~ Thejaka.

 
---------------------------------
Any questions?  Get answers on any topic at Yahoo! Answers. Try it now.

Reply via email to