> On 27 Jul 2018, at 13:58, Ben Coman <b...@openinworld.com> wrote:
>
> On 27 July 2018 at 18:39, Offray Vladimir Luna Cárdenas
> <offray.l...@mutabit.com> wrote:
>> Hi,
>>
>> I was ready to show a friend the Pharo web capabilities with the
>> classical "myString asUrl retrieveContents", but the friend gave me a
>> url that contains non Latin characters[1] and then I got an
>> ZnInvalidUTF8 error.
>>
>> [1]
>> http://www.bidchance.com/freesearch.do?&filetype=&channel=¤tpage=1&searchtype=zb&queryword=%BF%A6%CA%B2&displayStyle=&pstate=&field=&leftday=&province=&bidfile=&project=&heshi=&recommend=&field=&jing=&starttime=&endtime=&attachment=
>>
>> How can I process web addresses in Pharo that contain non latin
>> characters like the one in [1]?
>
> Just some blind digging...
>
> A few levels down the stack is a call equivalent to...
> x := '%BF%A6%CA%B2'.
> ZnPercentEncoder new decode: x.
> which fails with the same error.
>
> In #decode we have...
> bytes := #[191 166 202 178].
Correct (as it is not legal UTF-8)
> and browsing around I discovered a useful method...
> encoder := ZnCharacterEncoder detectEncoding: bytes
> "==> a ZnSimplifiedByteEncoder('iso88591' strict)"
>
> now the following works...
> (ZnPercentEncoder new characterEncoder: encoder ) decode: x.
Right, but that guess is wrong (check the resulting string).
Since we are talking about Chinese characters that are outside the allowed
range for #iso88591 (#latin1), that is logical.
Clearly, the original website http://www.bidchance.com knows the encoding ...
Again, to my understanding, without further context, when %BF%A6%CA%B2 is
encountered in the query part of a URL, it is first percent decoded, then UTF-8
decoded. That is what #asUrl assumes, and which leads to the error since that
particular sequence, when interpreted like that, does not constitute a legal
UTF-8 encoding.
> So maybe that helps explain it,
> but I don't know how to join the dots to make it work out of the box
> with "asUrl retrieveContents"
>
> cheers -ben