Add option "--convert-transparent-proxy" which converts links similarly to "-k" a.k.a. "--convert-links" by applying target file name changes made by the "--adjust-extension" option, but instead of generating relative links for offline, "file://" based browsing, generate absolute adjusted URLs which will work if the content is to be hosted from a transparent proxy server, e.g. in conjunction with "--span-hosts".
Signed-off-by: Gabriel Somlo <so...@cmu.edu> --- I'm not sure I like "--convert-transparent-proxy" for the option name, but don't strongly dislike it either, so any better ideas much appreciated. Also, if there's an easy way to compare the current document host name and protocol against the link target URL's, we could generate relative links within the same host and only use absolute links when going across hosts, but I couldn't immediately identify a simple and clean way of doing that. Any other suggestions for improvement much appreciated. Thanks much, Gabriel doc/wget.texi | 18 ++++++++++++ src/convert.c | 91 +++++++++++++++++++++++++++++++++++++++++++++++++++-------- src/convert.h | 3 +- src/init.c | 1 + src/main.c | 26 +++++++++++------ src/options.h | 2 ++ 6 files changed, 120 insertions(+), 21 deletions(-) diff --git a/doc/wget.texi b/doc/wget.texi index 1e1dd36..feb9abd 100644 --- a/doc/wget.texi +++ b/doc/wget.texi @@ -1979,6 +1979,24 @@ Note that only at the end of the download can Wget know which links have been downloaded. Because of that, the work done by @samp{-k} will be performed at the end of all the downloads. +@cindex conversion of links +@cindex link conversion +@item --convert-transparent-proxy +@itemx --convert-transparent-proxy +Similar to @samp{--convert-links}, but convert downloaded links to +absolute URLs adusted to match any changes made by @samp{--adjust-extension}. +This behavior is useful when the downloaded content is intended to +be served from a transparent proxy instead of being viewed locally. +Behavior w.r.t. links to files that have not been downloaded is +identical to @samp{--convert-links}. + +Example: if the document @file{http://@var{hostname-1}/foo/doc.html} links +to a script @file{http://@var{hostname-2}/bar/abc.cgi?xyz_arg}, which gets +downloaded as @file{/download/hostname-2/bar/abc.cgi?xyz_arg.css}, then the +link in @file{/download/hostname-1/foo/doc.html} will be modified to point +to @file{http://@var{hostname-2}/bar/abc.cgi%3Fxyz_arg.css}, which can then +be served from the transparent proxy when a client loads @file{doc.html}. + @cindex backing up converted files @item -K @itemx --backup-converted diff --git a/src/convert.c b/src/convert.c index c147686..df708f5 100644 --- a/src/convert.c +++ b/src/convert.c @@ -61,6 +61,7 @@ static void convert_links (const char *, struct urlpos *); static void convert_links_in_hashtable (struct hash_table *downloaded_set, int is_css, + bool is_tpu, int *file_count) { int i; @@ -132,12 +133,20 @@ convert_links_in_hashtable (struct hash_table *downloaded_set, /* Decide on the conversion type. */ if (local_name) { - /* We've downloaded this URL. Convert it to relative - form. We do this even if the URL already is in - relative form, because our directory structure may - not be identical to that on the server (think `-nd', + /* We've downloaded this URL. Convert it to reflect + any extension adjustments. Conversion may be to either + relative or absolute form, depending on whether local + viewing or transparent proxying are required. + We do this even if the URL already is in relative + form, because our directory structure may not be + identical to that on the server (think `-nd', `--cut-dirs', etc.) */ - cur_url->convert = CO_CONVERT_TO_RELATIVE; + cur_url->convert = is_tpu ? CO_CONVERT_TO_TPU_ABSOLUTE + : CO_CONVERT_TO_RELATIVE; + /* FIXME: if we could distinguish between links that stay + * within the same host and links going to a different + * host, we could limit transparent-proxy/absolute link + * conversion to cross-host links only */ cur_url->local_name = xstrdup (local_name); DEBUGP (("will convert url %s to local %s\n", u->url, local_name)); } @@ -173,24 +182,25 @@ convert_links_in_hashtable (struct hash_table *downloaded_set, direction to convert to. The "direction" means that the URLs to the files that have been - downloaded get converted to the relative URL which will point to - that file. And the other URLs get converted to the remote URL on - the server. + downloaded get converted to a URL which will point to that file, + reflecting changes made by --adjust-extension. The link can be + relative or absolute, depending on whether local viewing or + Other URLs get converted to the remote URL on the server. All the downloaded HTMLs are kept in downloaded_html_files, and downloaded URLs in urls_downloaded. All the information is extracted from these two lists. */ void -convert_all_links (void) +convert_all_links (bool is_tpu) { double secs; int file_count = 0; struct ptimer *timer = ptimer_new (); - convert_links_in_hashtable (downloaded_html_set, 0, &file_count); - convert_links_in_hashtable (downloaded_css_set, 1, &file_count); + convert_links_in_hashtable (downloaded_html_set, 0, is_tpu, &file_count); + convert_links_in_hashtable (downloaded_css_set, 1, is_tpu, &file_count); secs = ptimer_measure (timer); logprintf (LOG_VERBOSE, _("Converted %d files in %s seconds.\n"), @@ -206,6 +216,7 @@ static const char *replace_attr_refresh_hack (const char *, int, FILE *, const char *, int); static char *local_quote_string (const char *, bool); static char *construct_relative (const char *, const char *); +static char *construct_absolute (const char *, const char *); /* Change the links in one file. LINKS is a list of links in the document, along with their positions and the desired direction of @@ -320,6 +331,29 @@ convert_links (const char *file, struct urlpos *links) ++to_file_count; break; } + case CO_CONVERT_TO_TPU_ABSOLUTE: + /* Convert absolute URL to reflect adjusted extension. */ + { + char *newname = construct_absolute (link->url->url, + link->local_name); + char *quoted_newname = local_quote_string (newname, + link->link_css_p); + + if (link->link_css_p) + p = replace_plain (p, link->size, fp, quoted_newname); + else if (!link->link_refresh_p) + p = replace_attr (p, link->size, fp, quoted_newname); + else + p = replace_attr_refresh_hack (p, link->size, fp, quoted_newname, + link->refresh_timeout); + + DEBUGP (("TO_TPU_ABSOLUTE: %s to %s at position %d in %s.\n", + link->url->url, newname, link->pos, file)); + xfree (newname); + xfree (quoted_newname); + ++to_file_count; + break; + } case CO_CONVERT_TO_COMPLETE: /* Convert the link to absolute URL. */ { @@ -422,6 +456,41 @@ construct_relative (const char *basefile, const char *linkfile) return link; } +/* Construct and return an absolute URL reflecting changes made + by --adjust-extension to the base name of the original URL. + + Example: + + ca("http://foo.com/bar.cgi?xyz", + "foo.com/bar.cgi?xyz.css" ) -> "http://foo.com/bar.cgi?xyz.css" + + Essentially, we do s/$(basename orig_url)/$(basename linkfile)/ */ + +static char * +construct_absolute (const char *orig_url, const char *linkfile) +{ + char *orig_url_base, *linkfile_base; + size_t orig_url_dir_len; + char *adj_url; + + /* strip the path component from both orig_url and linkfile */ + orig_url_base = strrchr (orig_url, '/'); + linkfile_base = strrchr (linkfile, '/'); + + /* should anything unexpected happen, fall back to the original URL */ + if (orig_url_base == NULL || linkfile_base == NULL) + return strdup (orig_url); + + /* Calculate length of orig. URL "dirname" */ + orig_url_dir_len = strlen (orig_url) - strlen (orig_url_base); + + /* Construct adjusted absolute URL */ + adj_url = xmalloc (orig_url_dir_len + strlen (linkfile_base) + 1); + strncpy (adj_url, orig_url, orig_url_dir_len); + strcpy (adj_url + orig_url_dir_len, linkfile_base); + return adj_url; +} + /* Used by write_backup_file to remember which files have been written. */ static struct hash_table *converted_files; diff --git a/src/convert.h b/src/convert.h index 23c5f0e..d41c60b 100644 --- a/src/convert.h +++ b/src/convert.h @@ -40,6 +40,7 @@ enum convert_options { CO_NOCONVERT = 0, /* don't convert this URL */ CO_CONVERT_TO_RELATIVE, /* convert to relative, e.g. to "../../otherdir/foo.gif" */ + CO_CONVERT_TO_TPU_ABSOLUTE, /* convert to extension-adjusted absolute */ CO_CONVERT_TO_COMPLETE, /* convert to absolute, e.g. to "http://orighost/somedir/bar.jpg". */ CO_NULLIFY_BASE /* change to empty string. */ @@ -104,7 +105,7 @@ void register_redirection (const char *, const char *); void register_html (const char *); void register_css (const char *); void register_delete_file (const char *); -void convert_all_links (void); +void convert_all_links (bool); void convert_cleanup (void); char *html_quote_string (const char *); diff --git a/src/init.c b/src/init.c index 93e95f8..76de90a 100644 --- a/src/init.c +++ b/src/init.c @@ -156,6 +156,7 @@ static const struct { { "contentonerror", &opt.content_on_error, cmd_boolean }, { "continue", &opt.always_rest, cmd_boolean }, { "convertlinks", &opt.convert_links, cmd_boolean }, + { "converttpu", &opt.convert_tpu, cmd_boolean }, { "cookies", &opt.cookies, cmd_boolean }, { "cutdirs", &opt.cut_dirs, cmd_number }, { "debug", &opt.debug, cmd_boolean }, diff --git a/src/main.c b/src/main.c index 1ada822..0c980a5 100644 --- a/src/main.c +++ b/src/main.c @@ -172,6 +172,7 @@ static struct cmdline_option option_data[] = { "connect-timeout", 0, OPT_VALUE, "connecttimeout", -1 }, { "continue", 'c', OPT_BOOLEAN, "continue", -1 }, { "convert-links", 'k', OPT_BOOLEAN, "convertlinks", -1 }, + { "convert-transparent-proxy", 0, OPT_BOOLEAN, "converttpu", -1 }, { "content-disposition", 0, OPT_BOOLEAN, "contentdisposition", -1 }, { "content-on-error", 0, OPT_BOOLEAN, "contentonerror", -1 }, { "cookies", 0, OPT_BOOLEAN, "cookies", -1 }, @@ -718,6 +719,13 @@ Recursive download:\n"), -k, --convert-links make links in downloaded HTML or CSS point to\n\ local files.\n"), N_("\ + --convert-transparent-proxy make links in downloaded HTML or CSS\n\ + reflect extension adjustments made by\n\ + --adjust-extension, but keep them absolute\n\ + if they cross host boundaries, allowing the\n\ + resulting download to be served from\n\ + e.g. a transparent proxy.\n"), + N_("\ --backups=N before writing file X, rotate up to N backup files.\n"), #ifdef __VMS @@ -1219,7 +1227,7 @@ main (int argc, char **argv) /* All user options have now been processed, so it's now safe to do interoption dependency checks. */ - if (opt.noclobber && opt.convert_links) + if (opt.noclobber && (opt.convert_links || opt.convert_tpu)) { fprintf (stderr, _("Both --no-clobber and --convert-links were specified," @@ -1274,12 +1282,12 @@ Can't timestamp and not clobber old files at the same time.\n")); #endif if (opt.output_document) { - if (opt.convert_links + if ((opt.convert_links || opt.convert_tpu) && (nurl > 1 || opt.page_requisites || opt.recursive)) { fputs (_("\ -Cannot specify both -k and -O if multiple URLs are given, or in combination\n\ -with -p or -r. See the manual for details.\n\n"), stderr); +Cannot specify -k or --convert-transparent-proxy with -O if multiple URLs are\n\ +given, or in combination with -p or -r. See the manual for details.\n\n"), stderr); print_usage (1); exit (WGET_EXIT_GENERIC_ERROR); } @@ -1579,10 +1587,10 @@ for details.\n\n")); if (fstat (fileno (output_stream), &st) == 0 && S_ISREG (st.st_mode)) output_stream_regular = true; } - if (!output_stream_regular && opt.convert_links) + if (!output_stream_regular && (opt.convert_links || opt.convert_tpu)) { - fprintf (stderr, _("-k can be used together with -O only if \ -outputting to a regular file.\n")); + fprintf (stderr, _("-k or --convert-transparent-proxy can be used \ +together with -O only if outputting to a regular file.\n")); print_usage (1); exit (WGET_EXIT_GENERIC_ERROR); } @@ -1728,8 +1736,8 @@ outputting to a regular file.\n")); if (opt.cookies_output) save_cookies (); - if (opt.convert_links && !opt.delete_after) - convert_all_links (); + if ((opt.convert_links || opt.convert_tpu) && !opt.delete_after) + convert_all_links (opt.convert_tpu); cleanup (); diff --git a/src/options.h b/src/options.h index cd4e518..d4723b1 100644 --- a/src/options.h +++ b/src/options.h @@ -176,6 +176,8 @@ struct options NULL. */ bool convert_links; /* Will the links be converted locally? */ + bool convert_tpu; /* Will the links be converted for + transparent proxying? */ bool remove_listing; /* Do we remove .listing files generated by FTP? */ bool htmlify; /* Do we HTML-ify the OS-dependent -- 1.9.3