[Pharo-users] Working with urls that contain non latin characters

2018-07-27 Thread Offray Vladimir Luna Cárdenas
Hi,

I was ready to show a friend the Pharo web capabilities with the
classical "myString asUrl retrieveContents", but the friend gave me a
url that contains non Latin characters[1] and then I got an
ZnInvalidUTF8 error.

[1]
http://www.bidchance.com/freesearch.do?&filetype=&channel=¤tpage=1&searchtype=zb&queryword=%BF%A6%CA%B2&displayStyle=&pstate=&field=&leftday=&province=&bidfile=&project=&heshi=&recommend=&field=&jing=&starttime=&endtime=&attachment=

How can I process web addresses in Pharo that contain non latin
characters like the one in [1]?

Thanks,

Offray






Re: [Pharo-users] Working with urls that contain non latin characters

2018-07-27 Thread Herbert Vojčík




Offray Vladimir Luna Cárdenas wrote on 27. 7. 2018 12:39:

Hi,

I was ready to show a friend the Pharo web capabilities with the
classical "myString asUrl retrieveContents", but the friend gave me a
url that contains non Latin characters[1] and then I got an
ZnInvalidUTF8 error.

[1]
http://www.bidchance.com/freesearch.do?&filetype=&channel=¤tpage=1&searchtype=zb&queryword=%BF%A6%CA%B2&displayStyle=&pstate=&field=&leftday=&province=&bidfile=&project=&heshi=&recommend=&field=&jing=&starttime=&endtime=&attachment=

How can I process web addresses in Pharo that contain non latin
characters like the one in [1]?


Maybe you can make it into additional test case(s) in Zinc and upload 
them so authors (or anyone else willing to) can take on from that point 
and fix them.


I also faintly remember Zinc having some problems when I worked with it 
and the need to devise workarounds around asUrl use.



Thanks,

Offray




Re: [Pharo-users] Working with urls that contain non latin characters

2018-07-27 Thread Sven Van Caekenberghe
Hi Offray,

> On 27 Jul 2018, at 12:39, Offray Vladimir Luna Cárdenas 
>  wrote:
> 
> Hi,
> 
> I was ready to show a friend the Pharo web capabilities with the
> classical "myString asUrl retrieveContents", but the friend gave me a
> url that contains non Latin characters[1] and then I got an
> ZnInvalidUTF8 error.
> 
> [1]
> http://www.bidchance.com/freesearch.do?&filetype=&channel=¤tpage=1&searchtype=zb&queryword=%BF%A6%CA%B2&displayStyle=&pstate=&field=&leftday=&province=&bidfile=&project=&heshi=&recommend=&field=&jing=&starttime=&endtime=&attachment=
> 
> How can I process web addresses in Pharo that contain non latin
> characters like the one in [1]?

I am on holiday, so I cannot go too deep into this, but AFAIU the URL is wrong 
(or it assumes a specific context with a non-standard encoding).

In a URL's query part, non-ASCII data is first UTF-8 encoded, then percent 
encoded (this is the modern way).

I don't read Chinese, so it is hard to infer much from the original site, but I 
am assuming the search is for '喀什', a city called Kashgar, 
https://en.wikipedia.org/wiki/Kashgar_(disambiguation).

The string in question can be written as (to avoid copy/paste problems):

  String with: 21888 asCharacter with: 20160 asCharacter.

The encoding in a URL has to be:

  ZnPercentEncoder new encode: (String with: 21888 asCharacter with: 20160 
asCharacter).

This gives us for example the following URL:

  'https://www.google.com/search?q=%E5%96%80%E4%BB%80' asUrl.

Which parses OK and contains the correct encoded string (decoded in the URL 
object):

  'https://www.google.com/search?q=%E5%96%80%E4%BB%80' asUrl queryAt: #q.

If you copy/paste that URL in your browser it should resolve to stuff about 
Kashgar.

Obviously the website www.bidchance.com does something else (non-standard ?).

HTH,

Sven

> Thanks,
> 
> Offray




Re: [Pharo-users] Working with urls that contain non latin characters

2018-07-27 Thread Ben Coman
On 27 July 2018 at 18:39, Offray Vladimir Luna Cárdenas
 wrote:
> Hi,
>
> I was ready to show a friend the Pharo web capabilities with the
> classical "myString asUrl retrieveContents", but the friend gave me a
> url that contains non Latin characters[1] and then I got an
> ZnInvalidUTF8 error.
>
> [1]
> http://www.bidchance.com/freesearch.do?&filetype=&channel=¤tpage=1&searchtype=zb&queryword=%BF%A6%CA%B2&displayStyle=&pstate=&field=&leftday=&province=&bidfile=&project=&heshi=&recommend=&field=&jing=&starttime=&endtime=&attachment=
>
> How can I process web addresses in Pharo that contain non latin
> characters like the one in [1]?

Just some blind digging...

A few levels down the stack is a call equivalent to...
x := '%BF%A6%CA%B2'.
ZnPercentEncoder new decode: x.
which fails with the same error.

In #decode we have...
bytes := #[191 166 202 178].

and browsing around I discovered a useful method...
encoder := ZnCharacterEncoder detectEncoding: bytes
"==> a ZnSimplifiedByteEncoder('iso88591' strict)"

now the following works...
(ZnPercentEncoder new characterEncoder: encoder ) decode: x.


So maybe that helps explain it,
but I don't know how to join the dots to make it work out of the box
with "asUrl retrieveContents"

cheers -ben



Re: [Pharo-users] Working with urls that contain non latin characters

2018-07-27 Thread Sven Van Caekenberghe



> On 27 Jul 2018, at 13:58, Ben Coman  wrote:
> 
> On 27 July 2018 at 18:39, Offray Vladimir Luna Cárdenas
>  wrote:
>> Hi,
>> 
>> I was ready to show a friend the Pharo web capabilities with the
>> classical "myString asUrl retrieveContents", but the friend gave me a
>> url that contains non Latin characters[1] and then I got an
>> ZnInvalidUTF8 error.
>> 
>> [1]
>> http://www.bidchance.com/freesearch.do?&filetype=&channel=¤tpage=1&searchtype=zb&queryword=%BF%A6%CA%B2&displayStyle=&pstate=&field=&leftday=&province=&bidfile=&project=&heshi=&recommend=&field=&jing=&starttime=&endtime=&attachment=
>> 
>> How can I process web addresses in Pharo that contain non latin
>> characters like the one in [1]?
> 
> Just some blind digging...
> 
> A few levels down the stack is a call equivalent to...
>x := '%BF%A6%CA%B2'.
>ZnPercentEncoder new decode: x.
> which fails with the same error.
> 
> In #decode we have...
>bytes := #[191 166 202 178].

Correct (as it is not legal UTF-8)

> and browsing around I discovered a useful method...
>encoder := ZnCharacterEncoder detectEncoding: bytes
> "==> a ZnSimplifiedByteEncoder('iso88591' strict)"
> 
> now the following works...
>(ZnPercentEncoder new characterEncoder: encoder ) decode: x.

Right, but that guess is wrong (check the resulting string).

Since we are talking about Chinese characters that are outside the allowed 
range for #iso88591 (#latin1), that is logical.

Clearly, the original website http://www.bidchance.com knows the encoding ...

Again, to my understanding, without further context, when %BF%A6%CA%B2 is 
encountered in the query part of a URL, it is first percent decoded, then UTF-8 
decoded. That is what #asUrl assumes, and which leads to the error since that 
particular sequence, when interpreted like that, does not constitute a legal 
UTF-8 encoding.

> So maybe that helps explain it,
> but I don't know how to join the dots to make it work out of the box
> with "asUrl retrieveContents"
> 
> cheers -ben




Re: [Pharo-users] Working with urls that contain non latin characters

2018-07-27 Thread Sven Van Caekenberghe



> On 27 Jul 2018, at 15:25, Sven Van Caekenberghe  wrote:
> 
> Right, but that guess is wrong (check the resulting string).

Actually, the encoding used is GBK, 
https://en.wikipedia.org/wiki/GBK_(character_encoding)

This is a variable length encoding used in China. It is not currently 
implemented (but could be added).

But even if we implemented it, it would not solve the current issue (we would 
not known that we had to use it).