> On 14 Nov 2018, at 6:33 am, Ben Rubinstein via use-livecode > <use-livecode@lists.runrev.com> wrote: > > That's really helpful - and in parts eye-opening - thanks Mark. > > I have a few follow-up questions. > > Does textEncode _always_ return a binary string? Or, if invoked with > "CP1252", "ISO-8859-1", "MacRoman" or "Native", does it return a string?
Internally we have different types of values. So we have MCStringRef which is the thing which either contains a buffer of native chars or a buffer of UTF-16 chars. There are others. For example, MCNumberRef will either hold a 32 bit signed int or a double. These are returned by numeric operations where there’s no string representation of a number. So: put 1.0 into tNumber # tNumber holds an MCStringRef put 1.0 + 0 int0 tNumber # tNumber holds an MCNumberRef The return type of textEncode is an MCDataRef. This is a byte buffer, buffer size & byte count. So: put textEncode(“foo”, “UTF-8”) into tFoo # tFoo holds MCDataRef Then if we do something like: set the text of field “foo” to tFoo tFoo is first converted to MCStringRef. As it’s an MCDataRef we just move the buffer over and say it’s a native encoded string. There’s no checking to see if it’s a UTF-8 string and decoding with that etc. Then the string is put into the field. If you remember that mergJSON issue you reported where mergJSON returns UTF-8 data and you were putting it into a field and it looked funny this is why. > > > CodepointOffset has signature 'integer codepointOffset(string)', so when you > > pass a binary string (data) value to it, the data value gets converted to a > > string by interpreting it as a sequence of bytes in the native encoding. > > OK - so one message I take are that in fact one should never invoke > codepointOffset on a binary string. Should it actually throw an error in this > case? No, as mentioned above values can move to and from different types according to the operations performed on them and this is largely opaque to the scripter. If you do a text operation on a binary string then there’s an implicit conversion to a native encoded string. You generally want to use codepoint in 7+ generally where previously you used char unless you know you are dealing with a binary string and then you use byte. > > By the same token, probably one should only use 'byte', 'byteOffset', > 'byteToNum' etc with binary strings - would it be better, to avoid confusion, > if char, offset, charToNum should refuse to operate on a binary string? That would not be backwards compatible. > >> e.g. In the case of &, it can either take two data arguments, or two >> string arguments. In this case, if both arguments are data, then the result >> will be data. Otherwise both arguments will be converted to strings, and a >> string returned. > The second message I take is that one needs to be very careful, if operating > on UTF8 or other binary strings, to avoid 'contaminating' them e.g. by > concatenating with a simple quoted string, as this may cause it to be > silently converted to a non-binary string. (I presume that 'put "simple > string" after/before pBinaryString' will cause a conversion in the same way > as "&"? What about 'put "!" into char x of pBinaryString?) When concatenating if both left and right are binary strings (MCDataRef) then there’s no conversion of either to string however we do not currently have a way to declare a literal as a binary string (might be nice if we did!) so you would need to: put textEncode("simple string”, “UTF-8”) after pBinaryString > > The engine can tell whether a string is 'native' or UTF16. When the engine is > converting a binary string to 'string', does it always interpret the source > as the native 8-bit encoding, or does it have some heuristic to decide > whether it would be more plausible to interpret the source as UTF16? No it does not try to interpret. ICU has a charset detector that will give you a list of possible charsets along with a confidence. It could be implemented as a separate api: get detectedTextEncodings(<binary string>, [<optional hint charset>]) -> array of charset/confidence pairs get bestDetectedTextEncoding(<binary string>, [<optional hint charset>]) -> charset Feel free to feature request that! Cheers Monte _______________________________________________ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode