Oh. Now I know why I kept getting beaten up during class as a kid - because I'd ask some question and then the teacher would do a Mark - and then ALL of it would end up on the test.
On Fri, Jun 23, 2017 at 5:09 AM, Mark Waddingham via use-livecode < use-livecode@lists.runrev.com> wrote: > On 2017-06-23 03:07, Peter W A Wood via use-livecode wrote: > >> Some Unicode characters, such as emojis, have to be represented by two >> codepoints in UTF-16 (known as surrogates) so they take four bytes not >> two. Additionally, the number of bytes for characters with accents >> will take either one codepoint or two depending on whether they have >> been coded in pre-composed or decomposed form. (e.g. ç can be either >> U+0063 U+0327 (decomposed) or U+00E7 (precomposed). >> >> So it is isn’t easy to estimate the number of bytes in a UTF-16 string. >> > > The number of bytes used by a string when encoded as UTF-16 is '2 * the > number of codeunits in tString'. > > The number of codeunits in a string in LiveCode is a stored property of > the string, so doesn't require any computation. (We took the decision that > regardless of how a string is stored internally, it should always be > possible to ask for the number of codeunits in constant time, and to be > able to look up a codeunit in constant time). > > Note: codeunit is not the same as codepoint and codepoint is not the same > as character. Both codepoint and character require scanning the string (in > the general case) to both compute the i'th one, and to compute the length. > > In contrast (to UTF-16), if you want the number of bytes a string takes up > in UTF-8 encoding then you also have to scan the string as a codepoint in > UTF-8 can be 1-4 bytes in length. > > I would guess that LiveCode will store the characters of a string in >> single bytes if all the letters of the string conform to ISO-8859-1. >> So if you can be certain that your text is all ISO-8859-1 encoded, you >> can estimate at 1 byte per character. (The guess is base on the fact >> that the first 256 Unicode code points replicate ISO-8859-1). >> > > Almost true - the engine stores strings which can be fit into the running > platform's 'legacy' (in terms of pre 7.0) encoding (ISO8859-1, Latin-1, > MacRoman) in that encoding in memory. This means that stacks written > pre-unicode will use the same amount of memory, same amount of processing > time as they did before. > > The reason this works is because all three of those encodings have the > property that when they are converted to Unicode, the number of codeunits > in the Unicode version is the same as the number of codes (indeed, bytes in > this case) in the original string. > > Warmest Regards, > > Mark. > > -- > Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/ > LiveCode: Everyone can create apps > > _______________________________________________ > use-livecode mailing list > use-livecode@lists.runrev.com > Please visit this url to subscribe, unsubscribe and manage your > subscription preferences: > http://lists.runrev.com/mailman/listinfo/use-livecode > -- On the first day, God created the heavens and the Earth On the second day, God created the oceans. On the third day, God put the animals on hold for a few hours, and did a little diving. And God said, "This is good." _______________________________________________ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode