Re: UTF8 / binary strings in dynamic languages

Jimmy Jones Wed, 21 Aug 2013 13:06:25 -0700

> 3. If the language string is an overloaded text/bytes type, as is
> regrettably quite common, what do we do then?
> 
> The current answer to this question is "send it as vbin". That's very
> safe, insofar as it won't throw any sort of encoding exception. It
> does not, however, always honor what I think is the user's more
> typical intention: produce an ascii string at the other end.


I guess the problem is between dynamically and statically typed languages,
if you stay with the same language you don't notice anything, but this
slightly defeats the object of AMQP!

> So for 3, I'd like to consider the possibility of, by default, sending
> ambiguous language strings as ascii rendered to amqp str16. This
> requires an encoding step that may produce errors. And maybe that's
> just too obnoxious! That's what I'd like to know.

I'm not convinced, but I'm prepared to be convinced. If I put a binary
value in a map and encoded it some of the time it might be valid utf8,
other times not. Could this lead to a class of subtle bugs where a receiver
written in a statically typed language will work most of the time when
the value appears as a vbin, but not other times when it "accidentally"
appears a a str16?
 
> In summary, if we have a way to determine what the user wanted (text
> or bytes), we should try to carry that through on the wire. At the
> following URL I've tried to map out what type information we can get
> for each language. Please update it as you please.
> 
>  
> https://cwiki.apache.org/confluence/display/qpid/Language+support+for+unambiguous+text+string+and+byte+array+types

I've just signed up, but don't seem to be able to edit the page? I'll
add the stuff about utf8::upgrade when I can edit.
 
> On Wed, Aug 21, 2013 at 8:44 AM, Jimmy Jones <jimmyjon...@gmx.co.uk> wrote:
>>> > AFAIK in perl, if you include unicode characters in a string it'll
>>> > set the utf8 flag. If you don't include any unicode characters (eg. 7
>>> > bit ascii, or raw bytes) the flag won't be set. So given a perl
>>> > scalar that doesn't contain any utf8 characters, you don't know if
>>> > its a textual string (str16) or a binary string (vbin). There is a
>>> > is_utf8_string function, but that'll only tell you if the string
>>> > would be valid utf8, but it could be a binary string that happens to
>>> > be valid utf8, so that's not really safe.
>>>
>>> You can explicitly mark it as utf8 using utf8::upgrade() though, right?
>>> Certainly I tried that in a simple test and the property in question was
>>> then sent as str16.
>>
>> Yes, if I as a user had a string that was textual, I could call 
>> utf8::upgrade() to ensure it got sent as str16. I guess this is similar in 
>> concept to calling setEncoding in C++, although maybe less natural in a 
>> dynamically typed language.
>
> It would be more reasonable to treat perl scalars as textual for our
> API if perl offered a good way to explicitly handle byte arrays. My
> (certainly insufficient) web browsing suggested that wasn't really
> available, or not in a form recommended for use. Any candidates for a
> serviceable explicitly-arbitrary-bytes-and-not-text-at-all "type" in
> perl?

Sorry, I don't know of any, althogh I'm no perl guru! I'll have another
look though.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org

Re: UTF8 / binary strings in dynamic languages

Reply via email to