Re: [Pharo-users] Working with urls that contain non latin characters

Sven Van Caekenberghe Fri, 27 Jul 2018 06:27:05 -0700

> On 27 Jul 2018, at 13:58, Ben Coman <b...@openinworld.com> wrote:
> 
> On 27 July 2018 at 18:39, Offray Vladimir Luna Cárdenas
> <offray.l...@mutabit.com> wrote:
>> Hi,
>> 
>> I was ready to show a friend the Pharo web capabilities with the
>> classical "myString asUrl retrieveContents", but the friend gave me a
>> url that contains non Latin characters[1] and then I got an
>> ZnInvalidUTF8 error.
>> 
>> [1]
>> http://www.bidchance.com/freesearch.do?&filetype=&channel=&currentpage=1&searchtype=zb&queryword=%BF%A6%CA%B2&displayStyle=&pstate=&field=&leftday=&province=&bidfile=&project=&heshi=&recommend=&field=&jing=&starttime=&endtime=&attachment=
>> 
>> How can I process web addresses in Pharo that contain non latin
>> characters like the one in [1]?
> 
> Just some blind digging...
> 
> A few levels down the stack is a call equivalent to...
>    x := '%BF%A6%CA%B2'.
>    ZnPercentEncoder new decode: x.
> which fails with the same error.
> 
> In #decode we have...
>    bytes := #[191 166 202 178].

Correct (as it is not legal UTF-8)

> and browsing around I discovered a useful method...
>    encoder := ZnCharacterEncoder detectEncoding: bytes
> "==> a ZnSimplifiedByteEncoder('iso88591' strict)"
> 
> now the following works...
>    (ZnPercentEncoder new characterEncoder: encoder ) decode: x.

Right, but that guess is wrong (check the resulting string).

Since we are talking about Chinese characters that are outside the allowed 
range for #iso88591 (#latin1), that is logical.

Clearly, the original website http://www.bidchance.com knows the encoding ...

Again, to my understanding, without further context, when %BF%A6%CA%B2 is 
encountered in the query part of a URL, it is first percent decoded, then UTF-8 
decoded. That is what #asUrl assumes, and which leads to the error since that 
particular sequence, when interpreted like that, does not constitute a legal 
UTF-8 encoding.

> So maybe that helps explain it,
> but I don't know how to join the dots to make it work out of the box
> with "asUrl retrieveContents"
> 
> cheers -ben
Re: [Pharo-users] Working with urls that contain non latin characters

Reply via email to