Re: [rfbproto] [PATCH] Specify UTF-8 for strings

Peter Rosin Mon, 17 Aug 2009 05:43:19 -0700

Den 2009-08-17 12:47 skrev Adam Tkac:
> On Mon, Aug 17, 2009 at 11:20:08AM +0200, Peter Rosin wrote:
>> Den 2009-08-17 10:59 skrev Adam Tkac:
>>> On Mon, Aug 17, 2009 at 10:22:56AM +0200, Peter Rosin wrote:
>>>>>> If it is so natural with UTF-8 and if it really is the only sane choise
>>>>>> (I think it is), it's enough if our spec says (e.g.)
>>>>>>
>>>>>>     It is strongly recommended that all implementations use
>>>>>>     UTF-8 for all strings (except explicitely stated otherwise)
>>>>>>     to ensure interoperability. But be prepared that not all
>>>>>>     implementation do, so fail gracefully if you receive
>>>>>>     something else.
>>>>>>
>>>>>> instead of (e.g.)
>>>>>>
>>>>>>     All implementations MUST use UTF-8 for all strings (except
>>>>>>     explicitely stated otherwise). But not all implementations
>>>>>>     do, so you SHOULD fail gracefully if you receive something
>>>>>>     else.
>>>>>>
>>>>>> I just don't see why the wording with MUST/SHOULD is so superior
>>>>>> that it is worth rendering existing implementations incompatible
>>>>>> with our spec.
>>>>> This is ok with me. I don't think there's any difference in practice.
>>>> Oh, cool. Pierre previously asked if I had any alternative wording,
>>>> so here is my suggestion:
>>>>
>>>> diff --git a/rfbproto.rst b/rfbproto.rst
>>>> index 7852746..0252e4f 100644
>>>> --- a/rfbproto.rst
>>>> +++ b/rfbproto.rst
>>>> @@ -201,6 +201,26 @@ that you contact RealVNC Ltd to make sure that your 
>>>> encodin security types do not clash. Please see the RealVNC website at
>>>>   http://www.realvnc.com for details of how to contact them.
>>>>
>>>> +String Encodings
>>>> +================
>>>> +
>>>> +It is strongly recommended that strings in RFB are encoded using the
>>>> +UTF-8 encoding. This allows full unicode support, yet retains good
>>>> +compatibility with older RFB implementations.
>>>> +
>>>> +The encoding used for strings in the protocol has historically often
>>>> +been unspecified, or has changed between versions of the protocol. As a
>>>> +result, there are a lot of implementations which use different,
>>>> +incompatible encodings. Commonly those encodings have been ISO 8859-1
>>>> +(also known as Latin-1) or Windows code pages.
>>>> +
>>>> +Clients and servers are encouraged to send UTF-8 strings unless that
>>>> +particular part of the protocol mandates another encoding. They should
>>>> +however be prepared to receive invalid UTF-8 sequences at all times.
>>>> +Such sequences should be handled gracefully by e.g. stripping the
>>>> +invalid portions or trying to interpret the string using common
>>>> +encodings such as ISO 8859-1 or Windows code page 1252.
>>>> +
>>> Hm, it is easy to say "invalid portions of UTF-8" string but it is
>>> _very_ hard to create an algorithm which will determine if a part of
>>> string is valid or invalid. If you are using UTF-8 users might create
>>> strings with "obscure" characters. I think this kind of heuristic
>>> should not be included in protocol.
>> The only thing I changed from the original patch (by Pierre) in the
>> last three lines was to add "e.g.", so that implementors
>> would have a choice of doing something else if they liked to.
>>
>> But is it really hard to determine UTF-8 validity? I think that is
>> exactly one of the nice properties of UTF-8. Quoting from the UTF-8
>> article on wikipedia:
>>
>>      Because the starting and continuation bytes are distinct sets,
>>      UTF-8 is "self-synchronizing". Character boundaries are easily
>>      found when searching either forwards or backwards. If bytes
>>      are lost due to error or corruption, one can always locate
>>      the beginning of the next character and thus limit the damage.
>>      Many multi-byte encodings are much harder to resynchronize.
>>
>> Or are you talking about something else?
>>
>>> If an implementation sends strings in, for example, the ISO 8859-*
>>> encoding it will end with crippled characters but we have to live
>>> with it, there is probably no algorithm to solve this problem.
>> You could have an option that says, "if a string has errors according
>> to UTF-8, treat it as ISO 8859-1" (substitute for your preferred
>> encoding).
> 
> Yes, something like that sounds better for me. I attached improved (I
> hope it is an improvement ;)) specification of strings.


*snip*

> +All new implementations should encode strings in UTF-8 unless the

Sorry, but it's not an improvement if you reintroduce either of the
magic words SHOULD and MUST in this context. I'm obviously taking
to deaf ears.

And my suggested "option" was just some configuration option in some
implementation, I did not intend for that to go into the spec. I agree
with Peter Åstrand that the specific names of any "fallback" encodings
should probably be left out of the spec.

Cheers,
Peter

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
tigervnc-rfbproto mailing list
tigervnc-rfbproto@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tigervnc-rfbproto

Re: [rfbproto] [PATCH] Specify UTF-8 for strings

Reply via email to