Re: What is LC's internal text format?

Monte Goulding via use-livecode Tue, 13 Nov 2018 15:46:46 -0800


> On 14 Nov 2018, at 6:33 am, Ben Rubinstein via use-livecode 
> <use-livecode@lists.runrev.com> wrote:
> 
> That's really helpful - and in parts eye-opening - thanks Mark.
> 
> I have a few follow-up questions.
> 
> Does textEncode _always_ return a binary string? Or, if invoked with 
> "CP1252", "ISO-8859-1", "MacRoman" or "Native", does it return a string?


Internally we have different types of values. So we have MCStringRef which is 
the thing which either contains a buffer of native chars or a buffer of UTF-16 
chars. There are others. For example, MCNumberRef will either hold a 32 bit 
signed int or a double. These are returned by numeric operations where there’s 
no string representation of a number. So:

put 1.0 into tNumber # tNumber holds an MCStringRef
put 1.0 + 0 int0 tNumber # tNumber holds an MCNumberRef

The return type of textEncode is an MCDataRef. This is a byte buffer, buffer 
size & byte count.

So:
put textEncode(“foo”, “UTF-8”) into tFoo # tFoo holds MCDataRef

Then if we do something like:
set the text of field “foo” to tFoo

tFoo is first converted to MCStringRef. As it’s an MCDataRef we just move the 
buffer over and say it’s a native encoded string. There’s no checking to see if 
it’s a UTF-8 string and decoding with that etc.

Then the string is put into the field.

If you remember that mergJSON issue you reported where mergJSON returns UTF-8 
data and you were putting it into a field and it looked funny this is why.
> 
> > CodepointOffset has signature 'integer codepointOffset(string)', so when you
> > pass a binary string (data) value to it, the data value gets converted to a
> > string by interpreting it as a sequence of bytes in the native encoding.
> 
> OK - so one message I take are that in fact one should never invoke 
> codepointOffset on a binary string. Should it actually throw an error in this 
> case?

No, as mentioned above values can move to and from different types according to 
the operations performed on them and this is largely opaque to the scripter. If 
you do a text operation on a binary string then there’s an implicit conversion 
to a native encoded string. You generally want to use codepoint in 7+ generally 
where previously you used char unless you know you are dealing with a binary 
string and then you use byte.
> 
> By the same token, probably one should only use 'byte', 'byteOffset', 
> 'byteToNum' etc with binary strings - would it be better, to avoid confusion, 
> if char, offset, charToNum should refuse to operate on a binary string?

That would not be backwards compatible.
> 
>> e.g. In the case of &, it can either take two data arguments, or two
>> string arguments. In this case, if both arguments are data, then the result
>> will be data. Otherwise both arguments will be converted to strings, and a
>> string returned.
> The second message I take is that one needs to be very careful, if operating 
> on UTF8 or other binary strings, to avoid 'contaminating' them e.g. by 
> concatenating with a simple quoted string, as this may cause it to be 
> silently converted to a non-binary string. (I presume that 'put "simple 
> string" after/before pBinaryString' will cause a conversion in the same way 
> as "&"? What about 'put "!" into char x of pBinaryString?)

When concatenating if both left and right are binary strings (MCDataRef) then 
there’s no conversion of either to string however we do not currently have a 
way to declare a literal as a binary string (might be nice if we did!) so you 
would need to:

put textEncode("simple string”, “UTF-8”) after pBinaryString

> 
> The engine can tell whether a string is 'native' or UTF16. When the engine is 
> converting a binary string to 'string', does it always interpret the source 
> as the native 8-bit encoding, or does it have some heuristic to decide 
> whether it would be more plausible to interpret the source as UTF16?

No it does not try to interpret. ICU has a charset detector that will give you 
a list of possible charsets along with a confidence. It could be implemented as 
a separate api:

get detectedTextEncodings(<binary string>, [<optional hint charset>]) -> array 
of charset/confidence pairs

get bestDetectedTextEncoding(<binary string>, [<optional hint charset>]) -> 
charset

Feel free to feature request that!

Cheers

Monte


_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: What is LC's internal text format?

Reply via email to