On Monday 14 December 2015 22:15:32 Tim Rühsen wrote: > Am Montag, 14. Dezember 2015, 21:58:59 schrieb Eli Zaretskii: > > > From: Tim Rühsen <tim.rueh...@gmx.de> > > > Date: Mon, 14 Dec 2015 20:22:41 +0100 > > > > > > > 1. The functions that call 'iconv' (in iri.c) don't make a point of > > > > > > > > flushing the last portion of the converted URL after 'iconv' > > > > returns successfully having converted the input string in its > > > > entirety. IME, you need then to call 'iconv' one last time with > > > > either the 2nd or the 3rd argument set to NULL, otherwise > > > > sometimes the last converted character doesn't get output. In my > > > > case, some URLs converted from CP1255 to UTF-8 lost their last > > > > character. It sounds like no one has actually used this > > > > conversion in iri.c, except for trivially converting UTF-8 to > > > > itself. Is that possible/reasonable? > > > > > > Possibly. > > > Could you please give an example string ? I would like to test it on > > > GNU/Linux, BSD and Solaris to see if the output is always the same. > > > > This is what gave me trouble: > > > > https://he.wikipedia.org/wiki/%F9._%F9%F4%F8%E4 > > > > This is https://he.wikipedia.org/wiki/ש._שפרה that Andries was using > > in his tests, but it's encoded in CP1255 (and hex-encoded after that). > > Try converting it into UTF-8, and you will get the last character > > chopped off after 'iconv' returns. Or at least that's what happens > > for me. > > > > > > 2. Wget assumes that the URL given on its command line is encoded in > > > > > > > > the locale's encoding. This is a good assumption when the user > > > > herself types the URL at the shell prompt, but not when the URL is > > > > copy-pasted from a browser's address bar. In the latter case, the > > > > URL tends to be in UTF-8 (sometimes hex-encoded). At least that's > > > > what I get from Firefox. We don't seem to have in wget any > > > > facilities to specify a separate (3rd) encoding for the URLs on > > > > the command line, do we? > > > > > > I stumbled upon this a while ago when thinking about the design of > > > wget2. > > > And wget2 already has a working --input-encoding option for such cases. > > > AFAIK, nobody asked for such an option during the last years - so I > > > assume this to be a somewhat 'expert' or 'fancy' option, at least a low > > > priority one. It is an optional goodie. > > > > IMO, it's a sorely missing feature, since copy/pasting URLs from a > > browser is something people do very often. I do it all the time, > > because many times wget is much better in downloading large files than > > a browser. > > Arg, one step back please (my fault). > What you are looking for is --local-encoding. That is the encoding of the > URLs given on the command line. > --input-encoding specifies the encoding of an (additional) input file and/or > input from stdin. > > wget converts your example correctly (with --locale-encoding=cp1255): > > converted 'https://he.wikipedia.org/wiki/%F9._%F9%F4%F8%E4' (CP1255) -> > 'https://he.wikipedia.org/wiki/ש._שפר' (UTF-8) > > Also wget2, which uses iconv() differently than wget: > > 14.220742.058 converted 'https://he.wikipedia.org/wiki/�._����' (CP1255) -> > 'https://he.wikipedia.org/wiki/ש._שפר' (utf-8) > 14.220742.058 converted 'ש._שפר' (utf-8) -> '�._���' (CP1255)
I should not write posts while doing homework with the kids and playing with the dog :-( Your are right, ה is missing. > IME, you need then to call 'iconv' one last time with > either the 2nd or the 3rd argument set to NULL, otherwise > sometimes the last converted character doesn't get output. I'll give it a try. Tim