[bug #60287] Windows recursive download escapes utf8 URLs twice

Eli Zaretskii Sat, 27 Mar 2021 23:57:11 -0700

Follow-up Comment #10, bug #60287 (project wget):

Without converting charsets, it would be difficult to rely on certain library
functions and support certain features.


For example, locale-dependent C library functions work only with the locale's
encoding, and will produce wrong results if presented with strings encoded
differently.  The IRI support needs to work in UTF-8 internally.  And when
writing Web pages to disk, Wget needs to encode the page name so that it would
be acceptable as a file name by the local filesystem.

That is why conversion to the locale's charset is rather necessary. Using the
original bytes might work for some operations, but not for others, so keeping
the original bytes would need some logic for where they can and cannot be
used, which is a complication.  It is better to convert once, and then forget
about it.

The 404 error is most probably because Wget does attempt to convert encoding,
but does it incorrectly when you don't tell it the actual encodings.  So the
re-encoded URL is garbled.


    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?60287>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/

[bug #60287] Windows recursive download escapes utf8 URLs twice

Reply via email to