Re: srcset lists are corrupted when converting links

Dan Ellis Tue, 29 Dec 2020 12:36:29 -0800

I did some digging.

The problem is that the link->size written for the srcset elements is
calculated based on the srcset content string passed to
html-url.c:tag_handle_image(), which already has URL escapes such as &amp;
decoded (to plain "&").  But this size is used as the basis for skipping
over the original URLs to be replaced in convert.c:convert_links(), so when
skipping over the old URL that has been rewritten, the pointer does not
move forward far enough.  (In my example, it lags by 8 chars - the 4 char
difference between "&amp;" and "&" for the two occurrences in each URL) and
copies over the intervening part from the wrong place in the original file.


My "fix" is, in tag_handle_image(), to set the link->size based on a
re-escaped version of the URL extracted from the srcset.  (We also have to
fiddle with the base_ind to make sure we point to the correct start point
for the remaining URLs in the value).  This patch fixes my problem:

*** html-url.c~ 2019-02-19 17:23:46.000000000 -0500

--- html-url.c 2020-12-29 15:21:59.524993035 -0500

***************

*** 726,733 ****

              {

                char *url_text = strdupdelim (srcset + url_start,

                                              srcset + url_end);

                struct urlpos *up = append_url (url_text, base_ind +
url_start,

!                                               url_end - url_start, ctx);

                if (up)

                  {

                    up->link_inline_p = 1;

--- 726,748 ----

              {

                char *url_text = strdupdelim (srcset + url_start,

                                              srcset + url_end);

+               /* The SIZE passed to append_url is stored with the URL and
used

+                  to skip over the original URL in the source file when
rewriting

+                  in convert_file.  Because it has to skip over the
pre-decoded

+                  text, it needs to be increased to reflect the length of
the

+                  URL before decode_entity was applied.  We don't have that

+                  information (the entire srcset value was decoded at
once, not

+                  one URL at a time), so we guess here by re-encoding and
using

+                  the length of that.  Will not work if the original
escaping

+                  was non-canonical. */

+               char *quoted_url_text = html_quote_string(url_text);

+               int url_undecoded_size = strlen(quoted_url_text);

+               xfree(quoted_url_text);

                struct urlpos *up = append_url (url_text, base_ind +
url_start,

!                                               url_undecoded_size, ctx);

!               /* We also have to update base_ind to account for the
unescaped

!                  characters. */

!               base_ind += url_undecoded_size - (url_end - url_start);

                if (up)

                  {

                    up->link_inline_p = 1;

Hope this helps.

  DAn.


On Tue, Dec 29, 2020 at 11:43 AM Dan Ellis <[email protected]> wrote:

> I'm using wget to make a frozen, offline mirror of a wordpress.com site.
> The original HTML makes extensive use of <img srcset=...> (responsive
> design for different browser resolutions. wget is corrupting the
> comma-separated lists of images.
>
> e.g.
>
>
>   wget --page-requisites --span-hosts https://theliteratelens.com/
>
> downloads a set of files including theliteratelens.com/index.html which
> includes the following element as the first instance of srcset (line breaks
> inserted by me and irrelevant fields omitted):
>
> <img width="350" height="248"
>  src="
> https://theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=350&amp;h=248&amp;crop=1
> "
>  class="attachment-suburbia-sticky size-suburbia-sticky wp-post-image"
>  alt=""
>  loading="lazy"
>  srcset="
> https://theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=350&amp;h=248&amp;crop=1
> 350w,
> https://theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=150&amp;h=106&amp;crop=1
> 150w,
> https://theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=300&amp;h=212&amp;crop=1
> 300w"
>  sizes="(max-width: 350px) 100vw, 350px"
> ... />
>
> Note the srcset field with 3 versions of the image referenced whose
> decoded URL tails look like "realistfrontcover_small.jpg?w=150&h=248&crop=1"
>
> However, if I add --convert-links, e.g.
>
>   wget --page-requisites --span-hosts --convert-links
> https://theliteratelens.com/
>
> the same element in theliteratelens.com/index.html becomes:
>
> <img width="350" height="248"
>  src="../
> theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=350&amp;h=248&amp;crop=1
> "
>  class="attachment-suburbia-sticky size-suburbia-sticky wp-post-image"
>  alt=""
>  loading="lazy"
>  srcset="../
> theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=350&amp;h=248&amp;crop=1p;crop=../theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=150&amp;h=106&amp;crop=1h=106&a../theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=300&amp;h=212&amp;crop=1300&amp;h=212&amp;crop=1
> 300w"
>  sizes="(max-width: 350px) 100vw, 350px"
> ... />
>
> i.e. the comma-separated list in the srcset has been badly corrupted.  For
> instance, the end of the first path, which was originally
>
>   ...h=248&amp;crop=1 350w, https://
> theliteratelens.files.wordpress.com/2017/12...
>
> becomes
>
>   ...h=248&amp;crop=1p;crop=../
> theliteratelens.files.wordpress.com/2017/12...
>
> and the second boundary between elements starts as
>
>   ...h=106&amp;crop=1 150w, https://theliteratelens.files...
>
> but ends up as
>
>   ...h=106&amp;crop=1h=106&a../theliteratelens.files...
>
> What seems to be happening is that the convert-links logic is finding the
> absolute URLs to the second host (
> https://theliteratelens.files.wordpress.com) and correctly maps them to
> relative paths (../theliteratelens.files.wordpress.com/), but at the same
> time it reaches back one space-delimiter too far, and replaces those
> characters with a spurious sample from the preceding string.
>
> I hope this helps identify the problem.
>
>   DAn.
>
>
>

Re: srcset lists are corrupted when converting links

Reply via email to