Re: How to find offsets in Unicode Text fast

Mark Waddingham via use-livecode Tue, 13 Nov 2018 00:29:21 -0800

On 2018-11-13 01:06, Geoff Canyon via use-livecode wrote:

On Mon, Nov 12, 2018 at 11:36 AM Ben Rubinstein via use-livecode <
use-livecode@lists.runrev.com> wrote:
I'm really confused that case-insensitive should work at all forUTF-16 or
UTF-32;

The caseSensitive (and formSensitive) properties only apply to strings*not* binary strings.


The output of textEncode() is a binary string.

The 'is' operator is overloaded - in strict order:

  left-empty 'is' right-ANY -- returns is-empty(right-ANY)
  left-ANY 'is' right-empty -- returns is-empty(left-ANY)
  left-array 'is' left-array -- compare as array
  left-number 'is' right-number -- compare as number

left-numeric-[binary]-string 'is' right-numeric-[binary]-string --compare as numberleft-binary-string 'is' right-binary-string -- compare as binarystrings

  left-any 'is' right-any -- compare as strings

Also concatenation, put after and put before are overloaded:

   binary-string & binary-string -> binary-string
   string & ANY -> string
   ANY & string -> string

   put src-data after|before dst-data -> dst-data is binary-string
   put src-ANY after|before dst-ANY -> dst-ANY is string

This is so puzzling. I tried this code in a button:

on mouseUp
   put "Ѡ" into x
   put "ѡ" into y
   --put ("Ѡ" is "ѡ") && (x is y)
   --exit mouseUp
   put textencode("Ѡ","UTF-32") into xBig
   put textencode("ѡ","UTF-32") into xSmall
   repeat for each byte B in xBig
      put B after yBig
   end repeat
   repeat for each byte B in xSmall
      put B after ySmall
   end repeat
   put "Ѡ" into zBig
   put "ѡ" into zSmall
   put zBig into wBig
   put zSamll into wSmall
   put textencode(zBig,"UTF-32") into zBig
   put textencode(zSmall,"UTF-32") into zSmall
   put x into j
   put y into k
   set caseSensitive to false
   put ("Ѡ" is "ѡ") && (xBig is xSmall) && (yBig is ySmall) && (zBig is
zSmall) && (wBig is wSmall) && (x is y) && (j is k)
end mouseUp


That puts: true false false false true true true

Things to note:

1. "Ѡ" and "ѡ" are upper and lower case omega in cyrillic, 00000460 and

00000461. Given the string literals, LC is happy to say they are thesame

(the first true)
2. Put them in a variable, LC is happy to say they are the same
(the second-to-last true).

3. Convert them to UTF-32 and LC no longer recognizes them as the same(the

fourth boolean, false)

4. Put the variables into other variables, and LC identifies them asthe

same (the last true)


("Ѡ" is "ѡ") is true because they are both strings

(xBig is xSmall) is false because both sides are binary-strings (and socompare byte for byte)

(yBig is ySmall) is false because both sides are binary-strings

(zBig is zSmall) is false because you've textEncoded strings whichproduce binary-strings so both are binary strings

(wBig is wSmall) is true because both sides are strings
(x is y) is true because both sides are strings
(j is k) is true because both sides are strings

One could argue that 'is'/'is not' should never have been overloaded todo binary string comparison - and that should have perhaps been added asa separate operator (especially since binary strings are compared asnumbers if numeric). With hindsight I'd probably agree as it is a slightdiscontinuity in terms of comparison with pre-7.

Indeed, had we not added that overload then we would not be having thisdiscussion - it would have been a similar discussion as used to come upa lot with comparing the output of compress() and other functions whichhave always produced binary data - and why comparisons seemed 'not asone would expect'.


Warmest Regards,

Mark.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: How to find offsets in Unicode Text fast

Reply via email to