FTP bugs

2006-12-29 Thread Thejaka Maldeniya
WGET 1.10.2 - Under Windows
~~~

1) Resuming

When a download fails with:
 04:29:34 (17.41 KB/s) - Data connection: Connection
reset by peer; 
Control connection closed.
 Retrying.

WGET does the following:
 == SYST ... done.== PWD ... done.
 == TYPE I ... done.  == CWD not required.
 == SIZE
 == PASV
 == RETR
 No such file

Using -d shows that PWD returns '/', whereas the file
is in a
subdirectory. So CWD is indeed required.
This only occurs when the control connection is closed
and reopened.
(Well, guess that's obvious)

2) When using -c and remove_listing = off WGET seems
to append to the
existing list, rather than replacing. Would seem
better to use automatic
renaming of the old lists before a new listing is
dumped.

~~

There are other bugs I've come across, not to mention
potential
improvements that suggest themselves to me! I've
posted some suggestions
a couple of times before, but nobody seemed
interested. I would try to
fix WGET myself, but I don't have much experience in
cross-platform
development, and I'm unused to svn. So, if nobody
seems interested, I'll
just keep my comments to myself, in future!


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Bugs, Inconsistencies, and Suggestions

2006-12-08 Thread Thejaka Maldeniya
(I have already sent a variation of this to [EMAIL PROTECTED])

-

I have the following points to make regarding GNU Wget 1.10.2. (Windowsbinaries)
Please consider them, and send me your feedback.
I may be willing to assist in the correction/implementation of some ofthese...

Point 1 - Pointless/Unwanted Downloading/Recursion

It would be pointless to download ALL HTML content under allcircumstances. Esp. 
when recursing.
Examples:
  
   Where all or most pages (HTML content) contain php links, like,Apache file 
listings 
(i.e. ?N=D, ?M=A, etc.) or / etc.  
   Known duplicates: i.e. we know that on this particular server,all index.html 
files are also represented as index.htm (So we couldsuppress either)  
   Known HTML files or known HTML file patterns that we wish tosuppress.
As such, since -A and -R options do not seem to correctly facilitatesuch 
exclusion/suppression, perhaps a separate option could beintroduced.
(Examples: --rreject, --pre-reject, --pre-filter, --raccept)

Point 2 - Resetting Switches

WGET doesn't seem to offer a means for suppressing (resetting) switchesat the 
command line. i.e. if recursive = on in the config-file, itcan't be 
reset/overridden through the command line... -r- is rejectedand -r - is 
ignored. -r=off apparently wouldn't work either, and nosimilar mechanism is 
documented.

Point 3 - Cookie Handling

Cookie handling could be improved.

Suggestions:
  
   Split --no-cookies switch as following:  
  --cookies-use   - Cookies can be loaded from a file.Cookies will 
be used in requests.

  --cookies--accept- Cookies are accepted from sites andstored. (Does 
NOT imply that cookies will be sent with requests! Couldbe used along with 
--save-cookies to Track/Log cookies without actuallyusing them.)  
  
   Time stamping cookies might be useful when saving to a file.  
   Naming of options/switches can be improved: See 'Command Line'
  

Point 3 - Command Line

The WGET command line is not sufficiently intuitive.

The following naming scheme is suggested: --class-tag where 'class'is the 
class of switch/option/param (i.e. cookies, local, remote,etc.), and 'tag' is 
the actual name/tag for the particularswitch/param/opt.

Suggestions for classes:
  
   cookies - Cookie handling  
   local - Class of switches/params pertaining to representation atthe local 
end (i.e. --local-directories-force rather than--force-directories, and 
--remote-dirs-undercut rather than--cut-dirs).

   remote - Class of switches/params pertaining to representation atthe remote 
end (server)
Additionally, some (different) switches/options use different terms torepresent 
the same. i.e. 'dirs' , directories. These could be madeconsistent.

Please note that I'm not suggesting that existing options are removed.They may 
be retained for backward compatibility.

Some of the reasons are mentioned under different point titles.The command-line 
suggestions under Cookie Handling are also relevant tosome other areas.

Point 4 - Filtering

There appear to be some inconsistencies in the implementations of thefollowing 
options when recursing:
  
   -A, -R  
   -I, -X  
   --exclude-domains
  
(NOTE: html_extension = on)

WGET doesn't seem to parse all HTML content for recursion under 
somecircumstances where some of thecontent is PHP generated HTML content (URLs 
of the formany.php?a=1b=2, which get saved to files as [EMAIL 
PROTECTED]b=2.html).
Links in such files aren't followed. However, when no filter is used,it seems 
to parse all the pages forlinks. Made me wonder if WGET deletes the files 
before parsing, byaccident...

Also, under some circumstances, --exclude-domains fails.
Example: -D domain.my --exclude-domains www.domain.myisn't handledcorrectly. 
Both domains get processed.

Point 5 - Pattern Matching

There seems to be some inconsistency in pattern matching, regarding theuse of 
symbol '?'. The manual says Look up the manual of your shellfor a description 
of how pattern matching works.. Now I understandthat perhaps this shouldn't be 
taken so literally, but I would suggestthe following:
  
   Insert a clarification in the manual regarding 
thisnon-standard/inconsistent/non-intuitive behaviour.   
   In addition, consider extending pattern-matching to include 
asingle-character wildcard, perhaps using an escape character. i.e. \?could 
be used to specify the wildcard, while normal '?' would be takenas a literal, 
and part of the URL.


-
Please inform me if you find any of theserelevant, and possibly give me some 
additional info/clarifications onthese, so I may follow up on these, and 
possibly get involved in theimprovements when I can find the time... If there 
are anyclarifications required, I may be reached at this edress.

I would also like to know if there's an official forum for WGETdevelopment...

~ Thejaka.


 
-
Everyone is raving about the all-new 

Bugs, Inconsistencies, and Suggestions

2006-12-07 Thread Thejaka Maldeniya
I have the following points to make regarding GNU Wget 1.10.2.
Please consider them, and send me your feedback.
I may be willing to assist in the correction/implementation of some ofthese...

Point 1 - Pointless/Unwanted Downloading/Recursion

It would be pointless to download ALL HTML content under allcircumstances. Esp. 
when recursing.
Examples:
  
   Where all or most pages (HTML content) contain php links, like,Apache file 
listings
(i.e. ?N=D, ?M=A, etc.) or / etc.  
   Known duplicates: i.e. we know that on this particular server,all index.html 
files are also represented as index.htm (So we couldsuppress either)  
   Known HTML files or known HTML file patterns that we wish tosuppress.
As such, since -A and -R options do not seem to correctly facilitatesuch 
exclusion/suppression, perhaps a separate option could beintroduced.
(Examples: --rreject, --pre-reject, --pre-filter)

Point 2 - Resetting Switches

WGET doesn't seem to offer a means for suppressing (resetting) switchesat the 
command line. i.e. if recursive = on in the config-file, itcan't be 
reset/overridden through the command line... -r- is rejectedand -r - is 
ignored. -r=off apparently wouldn't work either, and nosimilar mechanism is 
documented.

Point 3 - Cookie Handling

Cookie handling could be improved.

Suggestions:
  
   Split --no-cookies switch as following:  
  --cookies-use   - Cookies can be loaded from a file.Cookies will 
be used in requests.

  --cookies--accept- Cookies are accepted from sites andstored. (Does 
NOT imply that cookies will be sent with requests! Couldbe used along with 
--save-cookies to Track/Log cookies without actuallyusing them.)  
  
   Time stamping cookies might be useful when saving to a file.  
   Naming of options/switches can be improved: See 'Command Line'
  

Point 3 - Command Line

The WGET command line is not sufficiently intuitive.

The following naming scheme is suggested: --class-tag where 'class'is the 
class of switch/option/param (i.e. cookies, local, remote,etc.), and 'tag' is 
the actual name/tag for the particularswitch/param/opt.

Suggestions for classes:
  
   cookies - Cookie handling  
   local - Class of switches/params pertaining to representation atthe local 
end (i.e. --local-directories-force rather than--force-directories, and 
--remote-dirs-undercut rather than--cut-dirs).

   remote - Class of switches/params pertaining to representation atthe remote 
end (server)
Additionally, some (different) switches/options use different terms torepresent 
the same. i.e. 'dirs' , directories. These could be madeconsistent.

Please note that I'm not suggesting that existing options are removed.They may 
be retained for backward compatibility.

Some of the reasons are mentioned under different point titles.The command-line 
suggestions under Cookie Handling are also relevant tosome other areas.

Point 4 - Filtering

There appear to be some inconsistencies in the implementations of thefollowing 
options when recursing:
  
   -A, -R  
   -I, -X  
   --exclude-domains
  
(NOTE: html_extension = on)

WGET doesn't seem to parse all HTML content for recursion under 
somecircumstances where some of thecontent is PHP generated HTML content (URLs 
of the formany.php?a=1b=2, which get saved to files as [EMAIL 
PROTECTED]b=2.html).
Links in such files aren't followed. However, when no filter is used,it seems 
to parse all the pages forlinks. Made me wonder if WGET deletes the files 
before parsing, byaccident...

Also, under some circumstances, --exclude-domains fails.
Example: -D domain.my --exclude-domains www.domain.my isn't handledcorrectly. 
Both domains get processed.

Point 5 - Pattern Matching

There seems to be some inconsistency in pattern matching, regarding theuse of 
symbol '?'. The manual says Look up the manual of your shellfor a description 
of how pattern matching works.. Now I understandthat perhaps this shouldn't be 
taken so literally, but I would suggestthe following:
  
   Insert a clarification in the manual regarding 
thisnon-standard/inconsistent/non-intuitive behaviour.
   In addition, consider extending pattern-matching to include 
asingle-character wildcard, perhaps using an escape character. i.e. \?could 
be used to specify the wildcard, while normal '?' would be takenas a literal, 
and part of the URL.


-
Please inform me if you find any of theserelevant, and possibly give me some 
additional info/clarifications onthese, so I may follow up on these, and 
possibly get involved in theimprovements when I can find the time... If there 
are anyclarifications required, I may be reached at this edress.

I would also like to know if there's an official forum for WGETdevelopment...

~ Thejaka.

 
-
Any questions?  Get answers on any topic at Yahoo! Answers. Try it now.

Feature Request: Prefiltering (applicable to recursive gets)

2006-12-03 Thread Thejaka Maldeniya
It seems pointless to download ALL html content under
some 
circumstances... esp. if all or most pages contain php
links, like, 
Apache file listings
(i.e. ?N=D, ?M=A, etc.). Why not add an option
like --rreject 
--pre-reject or maybe --pre-filter, so we can specify
which types of 
links to completely ignore?


 

Cheap talk?
Check out Yahoo! Messenger's low PC-to-Phone call rates.
http://voice.yahoo.com


Feature Request: Pattern Matching - '?'

2006-12-03 Thread Thejaka Maldeniya
There seems to be some inconsistency in pattern
matching, regarding the 
use of symbol '?'. The manual says Look up the manual
of your shell for 
a description of how pattern matching works.. Now I
understand that 
maybe this shouldn't be taken so literally, but I
would suggest the 
following:

1) Insert a clarification in the manual regarding this

non-standard/inconsistent behaviour.
2) In addition, consider extending pattern-matching to
include a 
single-character wildcard, perhaps using an escape
character. i.e. \? 
could be used to specify the wildcard, while normal
'?' would be taken 
as a literal, and part of the URL.


 

Do you Yahoo!?
Everyone is raving about the all-new Yahoo! Mail beta.
http://new.mail.yahoo.com