Re: Batch files in DOS
Hi, [EMAIL PROTECTED] wrote: > I'm trying to mirror about 100 servers (small fanfic sites) using > wget --recursive --level=inf -Dblah.com, blah.com,blah.com some_address > However, when I run the batch file, it stops reading after a while; > apparently my command has too many characters. Is there some other > way I should be doing this, or a workaround? You can put all the options for wget in a wgetrc file. Set an environment variable called "WGETRC" which points to the full pathname of your wgetrc file. To see the options for wgetrc see http://www.gnu.org/software/wget/manual/wget.html#Wgetrc-Commands for your example the wgetrc file would read: recursive=1 reclevel=inf domain=blah.com,blah.com,blah.com then start wget with: wget some_address TT
Re: Windows Title Bar
Hi, I'm using start [1]. That way I can specify the title, have it running in the background and adjust priority and stuff. If you want to use it in a batch file you can specify /wait. [Derek already got this, forgot to cc the list] TT [1] builtin command. Useable from cmd.exe or batch files. START ["title"] [/Dpath] [/I] [/MIN] [/MAX] [/SEPARATE | /SHARED] [/LOW | /NORMAL | /HIGH | /REALTIME | /ABOVENORMAL | /BELOWNORMAL] [/WAIT] [/B] [command/program] [parameters] Derek Parnell wrote: I'd like to be able to exactly specify the title that appears on the console title bar (Windows environment of course). Current the application uses the URL that is being got but I'd like to specify it myself. Is there a way to do this now or does this have to be an enhancement? Something like ... wget --title="News Server #1" http://www.etc.com/latest_news.html So that "News Server #1" appears as the console title rather than the URL (or its possible redirect).
current wget crashes when using -c
Hi, current trunk build crashes when trying to continue a download. builds from tags WGET_1_10, WGET_1_10_1 and WGET_1_10_2 run correctly. Build environment: Windows XP Visual Studio 2005 Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.42 for 80x86 OpenSSL 0.9.8a Following output is produced: ---8<---8<---8<---8<---8<---8<---8<---8<---8<---8<--- F:\temp\>wget.exe --debug -vc http://download.microsoft.com/download/2/4/3/243865fc-c896-497e-9a66-bcc3f596741e/directx_feb2006_redist.exe Setting --verbose (verbose) to 1 Setting --continue (continue) to 1 DEBUG output created by Wget 1.10+devel on Windows-MSVC. --15:24:15-- http://download.microsoft.com/download/2/4/3/243865fc-c896-497e-9a66-bcc3f596741e/directx_feb2006_redist.exe F:\temp\>wget.exe --version GNU Wget 1.10+devel ---8<---8<---8<---8<---8<---8<---8<---8<---8<---8<--- Regards Tobias
Re: Download all the necessary files and linked images
Hi, Jean-Marc MOLINA schrieb: > I have an other opinion about that limitation. Could it be considered as a > bug ? From the "Types of Files" section of the manual we can read : « Note > that these two options do not affect the downloading of html files; Wget > must load all the htmls to know where to go at all-recursive retrieval would > make no sense otherwise. ». It means the accept and reject options don't > work on HTML files. But I think they should because, special in this case, > you deliberately have to exclude them. Excluding them makes sense. So I > don't really know what to do... Consider the problem as a bug, as a new > feature to implement or as an existing feature that should be redesigned. > It's pretty tricky. I just set up my compile environment for WGet again. When I did regex support, I had the same problem with exclusion, so I introduced a new parameter "--follow-excluded-html". (Which is of course the default) but you can turn it off with --no-follow-excluded-html... See attached patch for current trunk. TT Index: trunk/src/init.c === --- trunk/src/init.c(revision 2133) +++ trunk/src/init.c(working copy) @@ -146,6 +146,7 @@ #endif { "excludedirectories", &opt.excludes, cmd_directory_vector }, { "excludedomains", &opt.exclude_domains, cmd_vector }, + { "followexcluded", &opt.followexcluded, cmd_boolean }, { "followftp", &opt.follow_ftp,cmd_boolean }, { "followtags", &opt.follow_tags, cmd_vector }, { "forcehtml", &opt.force_html,cmd_boolean }, @@ -277,6 +278,7 @@ opt.cookies = true; opt.verbose = -1; + opt.followexcluded = 1; opt.ntry = 20; opt.reclevel = 5; opt.add_hostdir = true; Index: trunk/src/main.c === --- trunk/src/main.c(revision 2133) +++ trunk/src/main.c(working copy) @@ -158,6 +158,7 @@ { "exclude-directories", 'X', OPT_VALUE, "excludedirectories", -1 }, { "exclude-domains", 0, OPT_VALUE, "excludedomains", -1 }, { "execute", 'e', OPT__EXECUTE, NULL, required_argument }, +{ "follow-excluded-html", 0, OPT_BOOLEAN, "followexcluded", -1 }, { "follow-ftp", 0, OPT_BOOLEAN, "followftp", -1 }, { "follow-tags", 0, OPT_VALUE, "followtags", -1 }, { "force-directories", 'x', OPT_BOOLEAN, "dirstruct", -1 }, @@ -611,6 +612,9 @@ -X, --exclude-directories=LIST list of excluded directories.\n"), N_("\ -np, --no-parent don't ascend to the parent directory.\n"), + N_("\ + --follow-excluded-html turns on downloading of excluded files for\n\ + inspection (this is the default).\n"), "\n", N_("Mail bug reports and suggestions to <[EMAIL PROTECTED]>.\n") Index: trunk/src/recur.c === --- trunk/src/recur.c (revision 2133) +++ trunk/src/recur.c (working copy) @@ -511,13 +511,14 @@ && !(has_html_suffix_p (u->file) /* The exception only applies to non-leaf HTMLs (but -p always implies non-leaf because we can overstep the - maximum depth to get the requisites): */ - && (/* non-leaf */ + maximum depth to get the requisites): + No execption if the user specified no-follow-excluded */ + && (opt.followexcluded && (/* non-leaf */ opt.reclevel == INFINITE_RECURSION /* also non-leaf */ || depth < opt.reclevel - 1 /* -p, which implies non-leaf (see above) */ - || opt.page_requisites))) + || opt.page_requisites { if (!acceptable (u->file)) {
Re: Get complete page?
Hi Juman, first execute this command: wget --help It is of utmost importance to read carefully the output of that command! Then you might try: wget --page-requisites --convert-links --span-hosts --html-extension --no-directories --execute=robots=off [URL] or wget -pkHEnd --execute=robots=off [URL] TT juman schrieb: > When using Mozilla or IE you can right-click on a page and choose "Save > Page As..." and then select to save the complete page which creates a > html file and a folder containing all the pictures for the page. The > links in the page for the pictures is also rewritten to create a > complete localized version of the page... Is there some smart way to do > the same with wget? > > /juman
Re: How do I prevent wget from creating index.html?C=M;O=A ?
Evert Meulie schrieb: Hi! Thanks for the reply. Since I have no control over the server from which I'm pulling the mirror AND I do not want to live with these files ( 8-) ), I was wondering whether there's a way to exclude certain file names, so that I can exclude the index.html?* wildcard...? afaik there's no way (with official releases) to do this. I have a regex patch for 1.9.1 lying around on my system but its not included in current wget releases (because it used pcre instead of gnu regex/c library regex). Last thing I heard regex support is planned for 1.11. (If you mirror this site often, why not use a script and delete them afterwards?) Regards TT
Re: How do I prevent wget from creating index.html?C=M;O=A ?
Evert Meulie schrieb: I'm using wget to mirror (part of) a site. This site contains a couple of directories which do not have a index.html in them, just a bunch of various files. When wget hits this dir, it creates: index.html index.html?C=M;O=A index.html?C=M;O=D index.html?C=N;O=A index.html?C=N;O=D index.html?C=S;O=A index.html?C=S;O=D It seems your server is configured to send a directory listing if no index.html is found. By the looks of it the listing is sortable (Modified/Name/Size Ascending/Descending). How do I prefer wget from doing so? I'm currently using the following: wget -np -nH --cut-dirs=3 --mirror http://some.domain.com/folder/folder/folder/folder If you want to get the files in this directory, I think you have to live with them. Otherwise it should suffice to use --exclude to exclude the directory. Regards TT
Re: regex in wget, it is dificult to implement?
Oliver Schulze L. schrieb: Hi, Would it be too dificult to implement this? I'm thinking of passing an argument to a regex function that returns true or false, and then deside to download the file. Any points to where to look in the code? Yes, I know where to look at, I did a regex patch for 1.9.1+cvs I'm currently not at home, but I could post an updated diff for current CVS version on monday morning. Regards TT
Re: NTLM authentication in CVS
Herold Heiko schrieb: >3) As expected msvc still throws compiler error on http.c and retr.c, (bad) >workaround: disable optimization. Anybody with a cl.exe newer than >Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 12.00.8804 for 80x86 >can comment if this is needed with newer versions, too ? > I'm using Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 13.00.9466 for 80x86 the only noteable output while compiling (your other two patches applied) is: \Vc7\PlatformSDK\Include\WinSock.h(689) : warning C4005: 'NO_ADDRESS' : macro redefinition host.c(59) : see previous definition of 'NO_ADDRESS' http.c(514) : warning C4090: 'function' : different 'const' qualifiers http.c(532) : warning C4090: 'function' : different 'const' qualifiers http.c(710) : warning C4090: 'function' : different 'const' qualifiers Tobias
wget regex patch
Hello, after reading so much about regex support for wget (espacially the lack of it) and experiencing myself how annoying it can be if you have downloaded a hundred /thumbs/ directories, I tried to implement regex support myself. I used pcre library from http://www.pcre.org which was pretty easy to use, given the fact that I never ever touched a single line of C (or C++) code before. Unfortunately I don't know jack about autoconf, makefiles etc. The patch in its current form is only useful with MSVC as I didn't alter any other makefiles. I hope someone can do that for me and include the pcre license from http://www.pcre.org/license.txt As you can see pcre.h and pcre.lib need to be somwhere the compiler can find them and HAVE_REGEX needs to be defined. Files and directories are ignored if the regex given on the command line match. For Syntax see wget --help. The patch was made against current cvs code. Hope this helps somehow. Tobias diff -ruwb wget-regex2/src/ftp.c wget-regex3/src/ftp.c --- wget-regex2/src/ftp.c Sat Apr 02 02:41:04 2005 +++ wget-regex3/src/ftp.c Wed Apr 06 18:55:24 2005 @@ -1749,7 +1749,11 @@ return res; /* First: weed out that do not conform the global rules given in opt.accepts and opt.rejects. */ +#ifdef HAVE_REGEX + if (opt.accepts || opt.rejects || opt.exclregfile) +#else if (opt.accepts || opt.rejects) +#endif /* HAVE_REGEX */ { f = start; while (f) diff -ruwb wget-regex2/src/init.c wget-regex3/src/init.c --- wget-regex2/src/init.c Sun Mar 20 17:07:38 2005 +++ wget-regex3/src/init.c Wed Apr 06 19:37:13 2005 @@ -137,6 +137,10 @@ #endif { "excludedirectories", &opt.excludes, cmd_directory_vector }, { "excludedomains", &opt.exclude_domains, cmd_vector }, +#ifdef HAVE_REGEX + { "excluderegexdir", &opt.exclregdir,cmd_string }, + { "excluderegexfile", &opt.exclregfile, cmd_string }, +#endif /* HAVE_REGEX */ { "followftp", &opt.follow_ftp,cmd_boolean }, { "followtags", &opt.follow_tags, cmd_vector }, { "forcehtml", &opt.force_html,cmd_boolean }, @@ -1367,6 +1371,12 @@ xfree_null (opt.sslcertkey); xfree_null (opt.sslcertfile); #endif /* HAVE_SSL */ +#ifdef HAVE_REGEX + xfree_null (opt.exclregdir_c) + xfree_null (opt.exclregfile_c) + xfree_null (opt.exclregdir); + xfree_null (opt.exclregfile); +#endif /* HAVE_REGEX */ xfree_null (opt.bind_address); xfree_null (opt.cookies_input); xfree_null (opt.cookies_output); diff -ruwb wget-regex2/src/main.c wget-regex3/src/main.c --- wget-regex2/src/main.c Tue Mar 22 15:20:02 2005 +++ wget-regex3/src/main.c Wed Apr 06 19:03:56 2005 @@ -68,6 +68,10 @@ /* On GNU system this will include system-wide getopt.h. */ #include "getopt.h" +#ifdef HAVE_REGEX +#include +#endif /* HAVE_REGEX */ + #ifndef PATH_SEPARATOR # define PATH_SEPARATOR '/' #endif @@ -176,6 +180,10 @@ { "egd-file", 0, OPT_VALUE, "egdfile", -1 }, { "exclude-directories", 'X', OPT_VALUE, "excludedirectories", -1 }, { "exclude-domains", 0, OPT_VALUE, "excludedomains", -1 }, +#ifdef HAVE_REGEX +{ "exclude-regex-dirs", 0, OPT_VALUE, "excluderegexdir", -1 }, +{ "exclude-regex-files", 0, OPT_VALUE, "excluderegexfile", -1 }, +#endif { "execute", 'e', OPT__EXECUTE, NULL, required_argument }, { "follow-ftp", 0, OPT_BOOLEAN, "followftp", -1 }, { "follow-tags", 0, OPT_VALUE, "followtags", -1 }, @@ -591,6 +599,12 @@ -D, --domains=LIST comma-separated list of accepted domains.\n"), N_("\ --exclude-domains=LIST comma-separated list of rejected domains.\n"), +#ifdef HAVE_REGEX +N_("\ + --exclude-regex-dirs=PATTERN pattern of directories to reject.\n"), + N_("\ + --exclude-regex-files=PATTERN pattern of files to reject.\n"), +#endif /* HAVE_REGEX */ N_("\ --follow-ftpfollow FTP links from HTML documents.\n"), N_("\ @@ -647,6 +661,7 @@ int i, ret, longindex; int nurl, status; int append_to_log = 0; + const char *error; i18n_initialize (); @@ -819,6 +834,40 @@ exit (1); } #endif + +#ifdef HAVE_REGEX + if (opt.exclregdir) +{ + opt.exclregdir_c = pcre_compile( +opt.exclregdir, /* the pattern */ +0,/* default options */ +&error, /* for error message */ +&i, /* for error offset */ +NULL);/* use default character tables */ + + if (opt.exclregdir_c == NULL) +{ + printf (_("Directory RegEx compilation failed at offset %d: %s\n"), i, error); + exit (1); +} +} + +if (opt.exclregfile) +{ + opt.exclregfile_c = pcre_compile( +opt.exclregfile, /* the pattern */ +0,/* default options */ +&error,