On 16.02.22 21:04, Seymour J Metz wrote:
Given that RFCs 3490-3492 came out in 2003 and 5890-5895 came out in 2010, I would have expected IDNA support by now. Does anybody know for sure?
This issue has nothing to do with IDN support.It is about the fact that the input file uses a charset that is not compatible with UTF-8 or ASCII, namely UTF-16 [1].
UTF-16 uses 2 or 4 bytes per character, so it needs to be converted into UTF-8 before wget can read it. Also, that file uses a BOM (byte order mark), which needs to be processed.
This does the job: iconv -f utf-16 -t utf-8 /tmp/url-list.txt > url-list-utf8.txt Just a small glimpse over to Wget2 :-)Wget2 understands `--input-encoding=utf-16`, BUT it currently doesn't handle the BOM. This is easy to implement as the code already exists to deal with HTML files encoded as UTF-16 with or without BOM.
I created https://gitlab.com/gnuwget/wget2/-/issues/586 for this. Regards, Tim [1] https://en.wikipedia.org/wiki/UTF-16 [2] https://en.wikipedia.org/wiki/Byte_order_mark
________________________________________ From: Bug-wget <bug-wget-bounces+smetz3=gmu....@gnu.org> on behalf of pythonomor...@gmail.com <pythonomor...@gmail.com> Sent: Tuesday, February 8, 2022 1:26 PM To: bug-wget@gnu.org Subject: wget: unable to resolve host address Hello, I am trying to download from a list of files (jpeg images). The website utilizes Cyrillic in its URL. I get the following error message: wget: unable to resolve host address 'xn--h-xubc' I've checked the links manually and the do work. I am enclosing a shortened version of the file list. I've tried different commands to no avail: wget.exe -i C:\dl_files\url-list.txt --secure-protocol=auto --remote-encoding=Windows-1251 -nc -c -P C:\dl_files\ I've used Windows-1251 as I did not see a list of encoding names in the manual https://secure-web.cisco.com/1ooTZPy8h-fBRcp0Zjk_hT6tQbv4w0wsk879mz0uB6aG15KQwcB5um7xiytswPhvpEx2CdU9QntWH_SPxAnAAG2ARAaxmvTXfptU_z__MN1SAGF4Sez144I6e5o6wRDx_cSKPXoTDNyplauirv54vbnDS5kLuXXsirRhFl1o3guYaHHwaf3LYbyLEOP1sfTL44_bLjOocvGciGnBwA68K2ME4JREkRcBuegw_-t6YfWN3v9vCCIziBr8G5DQ-u2wZVCytrHEb423jdgKX3xtQJQrfCnNBUT243xpqVx57lS8cbrgaBTxvUOBIKj0Se4FctlqI9ZanNX4VKAbM5laWTi54FjwlpdEqS5p2a-_mHFAGnfVznDud3Ng47NLEw8LBwKlZSNA26ms9KzvmbbG0zDq3PF5CE_nwWxjc01-0kGa2qeRISiPFM58HpVsAG3Pt/https%3A%2F%2Fwww.gnu.org%2Fsoftware%2Fwget%2Fmanual%2Fwget.html%23Wgetrc-Commands wget.exe -i C:\dl_files\url-list.txt --secure-protocol=auto -nc -c -P C:\dl_files\ Apparently the problem is caused by Cyrillic characters. I have inkling that I am not using the correct options for the program. I would appreciate if you gave me a hint on how to solve the problem. Regards, Max
OpenPGP_signature
Description: OpenPGP digital signature