I did some digging. The problem is that the link->size written for the srcset elements is calculated based on the srcset content string passed to html-url.c:tag_handle_image(), which already has URL escapes such as & decoded (to plain "&"). But this size is used as the basis for skipping over the original URLs to be replaced in convert.c:convert_links(), so when skipping over the old URL that has been rewritten, the pointer does not move forward far enough. (In my example, it lags by 8 chars - the 4 char difference between "&" and "&" for the two occurrences in each URL) and copies over the intervening part from the wrong place in the original file.
My "fix" is, in tag_handle_image(), to set the link->size based on a re-escaped version of the URL extracted from the srcset. (We also have to fiddle with the base_ind to make sure we point to the correct start point for the remaining URLs in the value). This patch fixes my problem: *** html-url.c~ 2019-02-19 17:23:46.000000000 -0500 --- html-url.c 2020-12-29 15:21:59.524993035 -0500 *************** *** 726,733 **** { char *url_text = strdupdelim (srcset + url_start, srcset + url_end); struct urlpos *up = append_url (url_text, base_ind + url_start, ! url_end - url_start, ctx); if (up) { up->link_inline_p = 1; --- 726,748 ---- { char *url_text = strdupdelim (srcset + url_start, srcset + url_end); + /* The SIZE passed to append_url is stored with the URL and used + to skip over the original URL in the source file when rewriting + in convert_file. Because it has to skip over the pre-decoded + text, it needs to be increased to reflect the length of the + URL before decode_entity was applied. We don't have that + information (the entire srcset value was decoded at once, not + one URL at a time), so we guess here by re-encoding and using + the length of that. Will not work if the original escaping + was non-canonical. */ + char *quoted_url_text = html_quote_string(url_text); + int url_undecoded_size = strlen(quoted_url_text); + xfree(quoted_url_text); struct urlpos *up = append_url (url_text, base_ind + url_start, ! url_undecoded_size, ctx); ! /* We also have to update base_ind to account for the unescaped ! characters. */ ! base_ind += url_undecoded_size - (url_end - url_start); if (up) { up->link_inline_p = 1; Hope this helps. DAn. On Tue, Dec 29, 2020 at 11:43 AM Dan Ellis <dan.el...@gmail.com> wrote: > I'm using wget to make a frozen, offline mirror of a wordpress.com site. > The original HTML makes extensive use of <img srcset=...> (responsive > design for different browser resolutions. wget is corrupting the > comma-separated lists of images. > > e.g. > > > wget --page-requisites --span-hosts https://theliteratelens.com/ > > downloads a set of files including theliteratelens.com/index.html which > includes the following element as the first instance of srcset (line breaks > inserted by me and irrelevant fields omitted): > > <img width="350" height="248" > src=" > https://theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=350&h=248&crop=1 > " > class="attachment-suburbia-sticky size-suburbia-sticky wp-post-image" > alt="" > loading="lazy" > srcset=" > https://theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=350&h=248&crop=1 > 350w, > https://theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=150&h=106&crop=1 > 150w, > https://theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=300&h=212&crop=1 > 300w" > sizes="(max-width: 350px) 100vw, 350px" > ... /> > > Note the srcset field with 3 versions of the image referenced whose > decoded URL tails look like "realistfrontcover_small.jpg?w=150&h=248&crop=1" > > However, if I add --convert-links, e.g. > > wget --page-requisites --span-hosts --convert-links > https://theliteratelens.com/ > > the same element in theliteratelens.com/index.html becomes: > > <img width="350" height="248" > src="../ > theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=350&h=248&crop=1 > " > class="attachment-suburbia-sticky size-suburbia-sticky wp-post-image" > alt="" > loading="lazy" > srcset="../ > theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=350&h=248&crop=1p;crop=../theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=150&h=106&crop=1h=106&a../theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=300&h=212&crop=1300&h=212&crop=1 > 300w" > sizes="(max-width: 350px) 100vw, 350px" > ... /> > > i.e. the comma-separated list in the srcset has been badly corrupted. For > instance, the end of the first path, which was originally > > ...h=248&crop=1 350w, https:// > theliteratelens.files.wordpress.com/2017/12... > > becomes > > ...h=248&crop=1p;crop=../ > theliteratelens.files.wordpress.com/2017/12... > > and the second boundary between elements starts as > > ...h=106&crop=1 150w, https://theliteratelens.files... > > but ends up as > > ...h=106&crop=1h=106&a../theliteratelens.files... > > What seems to be happening is that the convert-links logic is finding the > absolute URLs to the second host ( > https://theliteratelens.files.wordpress.com) and correctly maps them to > relative paths (../theliteratelens.files.wordpress.com/), but at the same > time it reaches back one space-delimiter too far, and replaces those > characters with a spurious sample from the preceding string. > > I hope this helps identify the problem. > > DAn. > > >