I don't *think* I'm confusing binary string/data with binary numbers -- I was just trying to illustrate that when a Latin Small Letter A (U+0061) gets encoded, somewhere there is stored (four bytes, one of which is) a byte 97, i.e. the bit sequence 1100001, unless computers don't work that way anymore.
What I now see is tripping me up is the implicit cast to a character you're saying that charToNum supports, without the corresponding cast to a number supported in numToChar -- i.e. this fails: put textEncode("a","UTF-32") into X;put numtochar(byte 1 of X) while this works: put textEncode("a","UTF-32") into X;put numtochar(bytetonum(byte 1 of X)) Thanks for the insight, Geoff On Tue, Nov 13, 2018 at 12:03 AM Mark Waddingham via use-livecode < use-livecode@lists.runrev.com> wrote: > On 2018-11-13 08:35, Geoff Canyon via use-livecode wrote: > > So then why does put textEncode("a","UTF-32") into X;put chartonum(byte > > 1 > > of X) put 97? > > Because: > > 1) textEncode("a", "UTF-32") produces the byte sequence <97,0,0,0> > 2) byte 1 of <97,0,0,0> is <97> > 3) charToNum(<97>) first converts the byte <97> into a native string > which is "a" (as the 97 is the code for 'a' in the native encoding > table), then converts that (native) char to a number -> 97 > > > That implies that "byte" 1 is "a", not 1100001. > > 1100001 is 97 but printed in base-2. > > FWIW, I think you are confusing 'binary string' with 'binary number' - > these are not the same thing. > > A 'binary string' (internally the data type is 'Data') is a sequence of > bytes (just as a 'string' is a sequence of > characters/codepoints/codeunits). > > A 'binary number' is a number which has been rendered to a string with > base-2. > > Bytes are like characters (and codepoints, and codeunits) in that they > are 'abstract' things - they aren't numbers, and have no direct > conversion to them - which is why we have byteToNum, numToByte, > nativeCharToNum, numToNativeChar, codepointToNum and numToCodepoint. > > The charToNum and numToChar functions are actually deprecated / > considered legacy - as their function (when useUnicode is set to true) > depends on processing unicode text as binary data - which isn't how > unicode works post-7 (indeed, there was no way to fold their behavior > into the new model - hence the deprecation, and replacement with > nativeCharToNum / numToNativeChar). > > You'll notice that there is no modern 'charToNum'/'numToChar' - just > 'codepointToNum'/'numToCodepoint'. A codepoint is an index into the > (large - 21-bit) Unicode code table; Unicode characters can be composed > of multiple codepoints (e.g. [e,combining-acute] and thus don't have a > 'number' per-se. > > Warmest Regards, > > Mark. > > > > > I've looked in the dictionary and I don't see anything that comes close > > to > > describing this. > > > > gc > > > > On Mon, Nov 12, 2018 at 10:21 PM Mark Waddingham via use-livecode < > > use-livecode@lists.runrev.com> wrote: > > > >> On 2018-11-13 07:15, Geoff Canyon via use-livecode wrote: > >> > On Mon, Nov 12, 2018 at 3:50 PM Monte Goulding via use-livecode < > >> > use-livecode@lists.runrev.com> wrote: > >> > Unless I'm misunderstanding, this hasn't been my observation. Using > >> > offset > >> > on a string that has been textEncodet()ed to UTF-32 returns values > that > >> > are > >> > 4 * (the character offset - 1) + 1 -- if it were re-encoded, wouldn't > >> > it > >> > return the actual offsets (except when it fails)? Also, 𐀁 encodes to > >> > 00010001, and routines that convert to UTF-32 and then use offset will > >> > find > >> > five instances of that character in the UTF-32 encoding because of > >> > improper > >> > boundaries. To see this, run this code: > >> > > >> > on mouseUp > >> > put textencode("𐀁","UTF-32") into X > >> > put textencode("𐀁𐀁𐀁","UTF-32") into Y > >> > put offset(X,Y,1) > >> > end mouseUp > >> > > >> > That will return 2, meaning that it found the encoding for X starting > >> > at > >> > character 2 + 1 = 3 of Y. In other words, it found X using the last > >> > half of > >> > the first "𐀁" and the first half of the second "𐀁" > >> > >> The textEncode function generates binary data which is composed of > >> bytes. When you use binary data in a text function (which offset is), > >> the engine uses a compatability conversion which treats the sequence > >> of > >> bytes as a sequence of native characters (this preserves what happened > >> pre-7.0 when strings were only ever native, and as such binary and > >> string were essentially the same thing). > >> > >> So if you textEncode a 1 (native) character string as UTF-32, you will > >> get a four byte string, which will then turn back into a 4 (native) > >> character string when passed to offset. > >> > >> Warmest Regards, > >> > >> Mark. > >> > >> -- > >> Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/ > >> LiveCode: Everyone can create apps > >> > >> _______________________________________________ > >> use-livecode mailing list > >> use-livecode@lists.runrev.com > >> Please visit this url to subscribe, unsubscribe and manage your > >> subscription preferences: > >> http://lists.runrev.com/mailman/listinfo/use-livecode > > _______________________________________________ > > use-livecode mailing list > > use-livecode@lists.runrev.com > > Please visit this url to subscribe, unsubscribe and manage your > > subscription preferences: > > http://lists.runrev.com/mailman/listinfo/use-livecode > > -- > Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/ > LiveCode: Everyone can create apps > > _______________________________________________ > use-livecode mailing list > use-livecode@lists.runrev.com > Please visit this url to subscribe, unsubscribe and manage your > subscription preferences: > http://lists.runrev.com/mailman/listinfo/use-livecode _______________________________________________ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode