Dear WGet maintainers,

I'd like to propose two patches that I've recently
implemented on my local copy of WGet.
Here is a brief description of the changes + their rationale. 
I hope they might benefit other people too, so if you agree - 
please let me know and I'll submit a "lege artis" patch 
with source diff's + extensive descriptions.

1) Limiting the number of files recursively downloaded 
   from a given URL (in the "-r" mode).

   Currently, it's only possible to limit file types and 
   the overall download quota, but there is no control over 
   the number of files. The rationale here is to allow to download 
   only a limited number of files, that can sufficiently
   represent the site for various purposes (I myself conduct 
   research in text analysis/information retrieval, so it's useful
   to have a representative set of a site's pages).

   To this end I suggest to add a comman-line/config-file parameter,
   which will then control the behaviour of function 'retrieve_tree',
   by returning when the desired number of files has been downloaded.

2) Optionally deleting downloaded files that are not of type "text/html".

   Even though there is a mechanism for limiting downloaded file types,
   it's hard to configure Wget to download only files of type "text/html".
   Setting the Accept list to "htm,html" won't help for exactly the same
   reason that there exists option "-E", because not all "text/html" files
   have this extension (e.g., "asp" files or files dynamically created by
   CGI scripts, to name but a few).
   Again, for various site analyses it is convenient to only download files
   of type "text/html". So if it's impossible to selectively download only
   such files, we can download all files, and then delete all those of 
   other types.

Regards,

Evgeniy.

--
Evgeniy Gabrilovich
Ph.D. student in Computer Science
Department of Computer Science, Technion - Israel Institute of Technology
Technion City, Haifa 32000, Israel
E-mail: [EMAIL PROTECTED] WWW: http://www.cs.technion.ac.il/~gabr
Phone: (office) +972-4-8294948

Reply via email to