Re: First 1000 characters without loop?

2017-06-23 Thread Richard Gaskin via use-livecode

Mark Waddingham wrote:

> On 2017-06-22 23:18, Richard Gaskin wrote:
>> With many chunk expressions, I would imagine it does.  With line
>> chunks, for example, the engine needs to walk through the string,
>> comparing each character to CR, counting the found CRs as it goes.
>
> Yes - essentially that is the case (although technically it looks for
> LF, not CR as currently - for better or for worse - the engine assumes
> line means LF as the separator, and normalizes line endings
> appropriately on a per-platform basis when you 'import' things as text
> into LiveCode).

Yes, where I wrote "CR" I meant the LiveCode constant, which 
historically has been synonymous with LF - has that changed?


--
 Richard Gaskin
 Fourth World Systems
 Software Design and Development for the Desktop, Mobile, and the Web
 
 ambassa...@fourthworld.comhttp://www.FourthWorld.com

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: First 1000 characters without loop?

2017-06-23 Thread Richard Gaskin via use-livecode

Mark Waddingham wrote:

> On 2017-06-23 03:19, Richard Gaskin via use-livecode wrote:
>> Seems murky.  I'd much rather at least have something like a byteLen
>> function, which returns the number of bytes for a given string.  With
>> that I can maintain byte offsets into a file with good performance
>> and no ambiguity.
>
> You do:
>
>the number of bytes in textEncode(tString, )

Ah, yes - of course.  Thanks for the reminder.  I keep forgetting about 
the byte chunk type.


--
 Richard Gaskin
 Fourth World Systems
 Software Design and Development for the Desktop, Mobile, and the Web
 
 ambassa...@fourthworld.comhttp://www.FourthWorld.com

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: First 1000 characters without loop?

2017-06-23 Thread Jerry Jensen via use-livecode
I like it when Mark Does A Mark.

> On Jun 23, 2017, at 12:29 PM, J. Landman Gay via use-livecode 
>  wrote:
> 
> "Do a Mark" :-) That will become a part of our vocabulary here.
> 
> 
> 
> On June 23, 2017 8:15:31 AM Mike Kerner via use-livecode 
>  wrote:
> 
>> Oh.
>> Now I know why I kept getting beaten up during class as a kid - because I'd
>> ask some question and then the teacher would do a Mark - and then ALL of it
>> would end up on the test.
> 
> --
> Jacqueline Landman Gay | jac...@hyperactivesw.com
> HyperActive Software   | http://www.hyperactivesw.com
> 
> 
> 
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: First 1000 characters without loop?

2017-06-23 Thread J. Landman Gay via use-livecode

"Do a Mark" :-) That will become a part of our vocabulary here.



On June 23, 2017 8:15:31 AM Mike Kerner via use-livecode 
 wrote:



Oh.
Now I know why I kept getting beaten up during class as a kid - because I'd
ask some question and then the teacher would do a Mark - and then ALL of it
would end up on the test.


--
Jacqueline Landman Gay | jac...@hyperactivesw.com
HyperActive Software   | http://www.hyperactivesw.com



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: First 1000 characters without loop?

2017-06-23 Thread Bob Sneidar via use-livecode
Wouldn't it be cool if you could:

repeat for each charChunk(1000) tChunk in tVar
...

Just dreaming. Use case is minimal. 

Bob S


> On Jun 23, 2017, at 08:15 , Bob Sneidar via use-livecode 
>  wrote:
> 
> For a loop you would do something like:
> 
> repeat with i = 1 to the number of chars of tVar step 1000
>  put char i to i+1000 of tVar into tVar2
>  -- do something with tVar2
> end repeat
> 
> bob s


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: First 1000 characters without loop?

2017-06-23 Thread Mike Kerner via use-livecode
or, what monte was suggesting:
put var1 into var2
delete char 1001 to -1 of var2

On Fri, Jun 23, 2017 at 11:15 AM, Bob Sneidar via use-livecode <
use-livecode@lists.runrev.com> wrote:

> For a loop you would do something like:
>
> repeat with i = 1 to the number of chars of tVar step 1000
>   put char i to i+1000 of tVar into tVar2
>   -- do something with tVar2
> end repeat
>
> bob s
>
>
> > On Jun 22, 2017, at 12:36 , Devin Asay via use-livecode <
> use-livecode@lists.runrev.com> wrote:
> >
> > Hi Devin & Mark,
> >
> > Thanks for this solution.
> >
> > Does that statement create an implied loop?
>
>
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>



-- 
On the first day, God created the heavens and the Earth
On the second day, God created the oceans.
On the third day, God put the animals on hold for a few hours,
   and did a little diving.
And God said, "This is good."
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: First 1000 characters without loop?

2017-06-23 Thread Bob Sneidar via use-livecode
For a loop you would do something like:

repeat with i = 1 to the number of chars of tVar step 1000
  put char i to i+1000 of tVar into tVar2
  -- do something with tVar2
end repeat

bob s


> On Jun 22, 2017, at 12:36 , Devin Asay via use-livecode 
>  wrote:
> 
> Hi Devin & Mark,
> 
> Thanks for this solution.
> 
> Does that statement create an implied loop?


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: First 1000 characters without loop?

2017-06-23 Thread Rick Harrison via use-livecode
Hi Mark,

Thank you for your verbose answers to questions.
That’s really really deep stuff!

I’m so thankful that the engine takes care of all
of this stuff so that the rest of us don’t have to!

Cheers,

Rick

> On Jun 23, 2017, at 4:17 AM, Mark Waddingham via use-livecode 
>  wrote:
> 
> On 2017-06-22 23:18, Richard Gaskin via use-livecode wrote:
>> With many chunk expressions, I would imagine it does.  With line
>> chunks, for example, the engine needs to walk through the string,
>> comparing each character to CR, counting the found CRs as it goes.
> 
> Yes - essentially that is the case (although technically it looks for LF, not 
> CR as currently - for better or for worse - the engine assumes line means LF 
> as the separator, and normalizes line endings appropriately on a per-platform 
> basis when you 'import' things as text into LiveCode).
> 
>> In this case, though, I believe it doesn't need a loop per se, since
>> AFAIK character are fixed-size entities internally (Mark Waddingham,
>> is that true that UTF-16 gives us two-bytes per char across the
>> board?).
> 
> No this is not quite true - characters are not fixed sized entities from the 
> computer's point of view. In LiveCode 'character' means 'grapheme' - which is 
> roughly what human's consider to be characters in terms of writing and 
> editing.
> 
> Indeed, there are several concepts here:
> 
>  1) character: a character is a sequence of Unicode codepoints
> 
>  2) codepoint: a codepoint is the index into the Unicode code table (which 
> has space for 1 million or so definitions)
> 
>  3) codeunit: a codeunit is an index into the Basic Multilingual Plane (BMP) 
> - the first 65536 Unicode codes. The BMP contains a block of codes called 
> 'surrogates' which aren't actually codes in themselves, but allow two 
> codeunits to be used to express a codepoint for any code defined above 65536.
> 
> Some examples:
> 
> Character 'a':
> 
> This is (as you might expect) always a single codepoint, and, indeed, always 
> a single codeunit (in Unicode 'a' is encoded with the same code as it is in 
> ASCII).
> 
> Character 'a-acute':
> 
> This can be either represented as a single codepoint (and codeunit) 'a-acute' 
> (the same code as a-acute has in the ISO-8859-1 encoding, a strict superset 
> of ASCII).
> 
> Or it can be represented as two codepoints 'a', 'combining-acute'. In both 
> cases, these codepoints are in the BMP, so each codepoint is represented as a 
> single codeunit.
> 
> Character 'smiling face with open mouth emoji':
> 
> This has code 0x1F603 - meaning it falls outside of the BMP (it is > 65535). 
> It is a single codepoint, but requires two codeunits to encode.
> 
> Some comparisons:
> 
> ASCII, ISO8859-1, Latin-1 and MacRoman are all 'single-codepoint' encodings - 
> all characters which those encodings can express are encoded as a single 
> codepoint.
> 
> Unicode is a 'multi-code' encoding - characters may require any number of 
> codepoints to express. For example:
> 
>  - In Indic languages (which have a somewhat different structure than 
> languages like English, French, German etc.), many codepoints are often 
> needed to represent what humans might consider a 'character'.
> 
>  - You can stack any number of defined 'combining accents' onto a base 
> character. You can have a character such as 
> a-acute-underbar-ring-grave-cedilla-umlaut if you want.
> 
>  - Emoji codepoints can be prefixed by 'variation selectors' which allow 
> customization of things like face color.
> 
> Basically, Unicode is a model for encoding writing systems with the aim that 
> (over time) it can be used to represent *any* writing system which exists now 
> or existed in the past. In order to do this in a tractable way (i.e. a way 
> which could be implemented maintainably on modern systems) it uses an 
> abstract model (sequences of codepoints which form characters). Due to this 
> it can sometimes seem a little 'odd' but then it is trying to model things 
> which were not designed to necessarily fit into a computer's viewpoint of the 
> world - writing systems have evolved organically without thought on how a 
> computer might need to process them.
> 
> In terms of LiveCode, then you have access to 'character', 'codepoint' and 
> 'codeunit' chunks. In general:
> 
>   - character access for general strings is never constant time, as 
> characters can require multiple codepoints.
> 
>   - codepoint access for general strings is never constant time, as 
> codepoints can require two codeunits to encode.
> 
>   - codeunits access for general strings is always constant time.
> 
> Internally, the engine will keep things which can be represented in the 
> platform's native encoding as native as much as possible (the native 
> encodings have the property that 1 character = 1 codepoint = 1 codeunit); 
> otherwise it will (currently) store things internally as sequences of 
> codeunits in the UTF-16 encoding. (How this might 

Re: First 1000 characters without loop?

2017-06-23 Thread Mike Kerner via use-livecode
Oh.
Now I know why I kept getting beaten up during class as a kid - because I'd
ask some question and then the teacher would do a Mark - and then ALL of it
would end up on the test.

On Fri, Jun 23, 2017 at 5:09 AM, Mark Waddingham via use-livecode <
use-livecode@lists.runrev.com> wrote:

> On 2017-06-23 03:07, Peter W A Wood via use-livecode wrote:
>
>> Some Unicode characters, such as emojis, have to be represented by two
>> codepoints in UTF-16 (known as surrogates) so they take four bytes not
>> two. Additionally, the number of bytes for characters with accents
>> will take either one codepoint or two depending on whether they have
>> been coded in pre-composed or decomposed form. (e.g. ç can be either
>> U+0063 U+0327 (decomposed) or U+00E7 (precomposed).
>>
>> So it is isn’t easy to estimate the number of bytes in a UTF-16 string.
>>
>
> The number of bytes used by a string when encoded as UTF-16 is '2 * the
> number of codeunits in tString'.
>
> The number of codeunits in a string in LiveCode is a stored property of
> the string, so doesn't require any computation. (We took the decision that
> regardless of how a string is stored internally, it should always be
> possible to ask for the number of codeunits in constant time, and to be
> able to look up a codeunit in constant time).
>
> Note: codeunit is not the same as codepoint and codepoint is not the same
> as character. Both codepoint and character require scanning the string (in
> the general case) to both compute the i'th one, and to compute the length.
>
> In contrast (to UTF-16), if you want the number of bytes a string takes up
> in UTF-8 encoding then you also have to scan the string as a codepoint in
> UTF-8 can be 1-4 bytes in length.
>
> I would guess that LiveCode will store the characters of a string in
>> single bytes if all the letters of the string conform to ISO-8859-1.
>> So if you can be certain that your text is all ISO-8859-1 encoded, you
>> can estimate at 1 byte per character. (The guess is base on the fact
>> that the first 256 Unicode code points replicate ISO-8859-1).
>>
>
> Almost true - the engine stores strings which can be fit into the running
> platform's 'legacy' (in terms of pre 7.0) encoding (ISO8859-1, Latin-1,
> MacRoman) in that encoding in memory. This means that stacks written
> pre-unicode will use the same amount of memory, same amount of processing
> time as they did before.
>
> The reason this works is because all three of those encodings have the
> property that when they are converted to Unicode, the number of codeunits
> in the Unicode version is the same as the number of codes (indeed, bytes in
> this case) in the original string.
>
> Warmest Regards,
>
> Mark.
>
> --
> Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
> LiveCode: Everyone can create apps
>
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>



-- 
On the first day, God created the heavens and the Earth
On the second day, God created the oceans.
On the third day, God put the animals on hold for a few hours,
   and did a little diving.
And God said, "This is good."
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: First 1000 characters without loop?

2017-06-23 Thread Mark Waddingham via use-livecode

On 2017-06-23 03:07, Peter W A Wood via use-livecode wrote:

Some Unicode characters, such as emojis, have to be represented by two
codepoints in UTF-16 (known as surrogates) so they take four bytes not
two. Additionally, the number of bytes for characters with accents
will take either one codepoint or two depending on whether they have
been coded in pre-composed or decomposed form. (e.g. ç can be either
U+0063 U+0327 (decomposed) or U+00E7 (precomposed).

So it is isn’t easy to estimate the number of bytes in a UTF-16 string.


The number of bytes used by a string when encoded as UTF-16 is '2 * the 
number of codeunits in tString'.


The number of codeunits in a string in LiveCode is a stored property of 
the string, so doesn't require any computation. (We took the decision 
that regardless of how a string is stored internally, it should always 
be possible to ask for the number of codeunits in constant time, and to 
be able to look up a codeunit in constant time).


Note: codeunit is not the same as codepoint and codepoint is not the 
same as character. Both codepoint and character require scanning the 
string (in the general case) to both compute the i'th one, and to 
compute the length.


In contrast (to UTF-16), if you want the number of bytes a string takes 
up in UTF-8 encoding then you also have to scan the string as a 
codepoint in UTF-8 can be 1-4 bytes in length.



I would guess that LiveCode will store the characters of a string in
single bytes if all the letters of the string conform to ISO-8859-1.
So if you can be certain that your text is all ISO-8859-1 encoded, you
can estimate at 1 byte per character. (The guess is base on the fact
that the first 256 Unicode code points replicate ISO-8859-1).


Almost true - the engine stores strings which can be fit into the 
running platform's 'legacy' (in terms of pre 7.0) encoding (ISO8859-1, 
Latin-1, MacRoman) in that encoding in memory. This means that stacks 
written pre-unicode will use the same amount of memory, same amount of 
processing time as they did before.


The reason this works is because all three of those encodings have the 
property that when they are converted to Unicode, the number of 
codeunits in the Unicode version is the same as the number of codes 
(indeed, bytes in this case) in the original string.


Warmest Regards,

Mark.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: First 1000 characters without loop?

2017-06-23 Thread Mark Waddingham via use-livecode

On 2017-06-23 03:19, Richard Gaskin via use-livecode wrote:

Seems murky.  I'd much rather at least have something like a byteLen
function, which returns the number of bytes for a given string.  With
that I can maintain byte offsets into a file with good performance and
no ambiguity.


You do:

  the number of bytes in textEncode(tString, )

The 'number of bytes in a string' makes no sense as there is no direct 
relationship between bytes and strings. I appreciate why this idea hangs 
around - it used to be true - char and byte where the same concept prior 
to 7.0 but that's only because the concept of 'char' was that of 
ISO8859-1/Latin-1 which can only represent the following written 
languages:


Western Europe and Americas: Afrikaans, Basque, Catalan, Danish, Dutch, 
English, Faeroese, Finnish, French, Galician, German, Icelandic, Irish, 
Italian, Norwegian, Portuguese, Spanish and Swedish.


If you step outside of that 'area', then it wasn't very much help (see 
https://www.terena.org/activities/multiling/ml-docs/iso-8859.html - for 
the historical encodings covering different sets of written languages).


The question you have to ask is 'how many bytes are in a string after it 
has been encoded in ' - when a string is written to disk an 
encoding *has* to be chosen. Sometimes the encoding is ASCII, sometimes 
it is UTF-8, sometimes it is UTF-16, sometimes it is something more 
exotic.


For any file format, an encoding of text always has to be defined - so 
you always 'know' if you know the file format (although some, the 
encoding might be indicated by a byte prefixing the encoded string, or 
as a piece of information in the header of the encoded file - e.g. Byte 
Order Marks).



How do I find a substring in binary data in a what that will tell me
the number of bytes of the offset?


If you have loaded binary data, and want to find the offset of a 
sequence of bytes within it then use 'byteOffset'.


If your binary data is actually encoded text data, then you need to 
textEncode the 'needle' (the thing you are searching for) first, making 
sure you do so with the encoding which the encoded text data requires:


  - put the encoded/raw data you want to search into tHaystackData
  put textEncode(tNeedleText, ) into 
tNeedleData

  put byteOffset(tNeedleData, tHaystackData) into tOffset

However, it is important to note that this only allows an exact match - 
you can't do caseless searches like this (or searches where you want 
'e-acute' to match both 'e-acute' and 'e,combining-acute').


In the case of wanting to do caseless searches, then you need to do 
something like this:


   put textDecode(tHaystackData, ) into tHaystackText
   put offset(tNeedleText, tHaystackText) into tNeedleOffset
   put the number of bytes in textEncode(char 1 to tNeedleOffset of 
tHaystackText) into tNeedleByteOffset


i.e. The operation you are wanting to perform is 'offset of  in 
 when using encoding ' which might make a useful engine 
addition - feel free to file an enhancement, although the above snippet 
should work in script with the operations we currently have. (Similar, 
your 'byteLen' function, is actually 'length of string in encoding 
' - that also might be a useful engine addition, but can also 
be done in script now, as outlined above).


Warmest Regards,

Mark.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: First 1000 characters without loop?

2017-06-23 Thread Mark Waddingham via use-livecode

On 2017-06-22 23:18, Richard Gaskin via use-livecode wrote:

With many chunk expressions, I would imagine it does.  With line
chunks, for example, the engine needs to walk through the string,
comparing each character to CR, counting the found CRs as it goes.


Yes - essentially that is the case (although technically it looks for 
LF, not CR as currently - for better or for worse - the engine assumes 
line means LF as the separator, and normalizes line endings 
appropriately on a per-platform basis when you 'import' things as text 
into LiveCode).



In this case, though, I believe it doesn't need a loop per se, since
AFAIK character are fixed-size entities internally (Mark Waddingham,
is that true that UTF-16 gives us two-bytes per char across the
board?).


No this is not quite true - characters are not fixed sized entities from 
the computer's point of view. In LiveCode 'character' means 'grapheme' - 
which is roughly what human's consider to be characters in terms of 
writing and editing.


Indeed, there are several concepts here:

  1) character: a character is a sequence of Unicode codepoints

  2) codepoint: a codepoint is the index into the Unicode code table 
(which has space for 1 million or so definitions)


  3) codeunit: a codeunit is an index into the Basic Multilingual Plane 
(BMP) - the first 65536 Unicode codes. The BMP contains a block of codes 
called 'surrogates' which aren't actually codes in themselves, but allow 
two codeunits to be used to express a codepoint for any code defined 
above 65536.


Some examples:

Character 'a':

This is (as you might expect) always a single codepoint, and, indeed, 
always a single codeunit (in Unicode 'a' is encoded with the same code 
as it is in ASCII).


Character 'a-acute':

This can be either represented as a single codepoint (and codeunit) 
'a-acute' (the same code as a-acute has in the ISO-8859-1 encoding, a 
strict superset of ASCII).


Or it can be represented as two codepoints 'a', 'combining-acute'. In 
both cases, these codepoints are in the BMP, so each codepoint is 
represented as a single codeunit.


Character 'smiling face with open mouth emoji':

This has code 0x1F603 - meaning it falls outside of the BMP (it is > 
65535). It is a single codepoint, but requires two codeunits to encode.


Some comparisons:

ASCII, ISO8859-1, Latin-1 and MacRoman are all 'single-codepoint' 
encodings - all characters which those encodings can express are encoded 
as a single codepoint.


Unicode is a 'multi-code' encoding - characters may require any number 
of codepoints to express. For example:


  - In Indic languages (which have a somewhat different structure than 
languages like English, French, German etc.), many codepoints are often 
needed to represent what humans might consider a 'character'.


  - You can stack any number of defined 'combining accents' onto a base 
character. You can have a character such as 
a-acute-underbar-ring-grave-cedilla-umlaut if you want.


  - Emoji codepoints can be prefixed by 'variation selectors' which 
allow customization of things like face color.


Basically, Unicode is a model for encoding writing systems with the aim 
that (over time) it can be used to represent *any* writing system which 
exists now or existed in the past. In order to do this in a tractable 
way (i.e. a way which could be implemented maintainably on modern 
systems) it uses an abstract model (sequences of codepoints which form 
characters). Due to this it can sometimes seem a little 'odd' but then 
it is trying to model things which were not designed to necessarily fit 
into a computer's viewpoint of the world - writing systems have evolved 
organically without thought on how a computer might need to process 
them.


In terms of LiveCode, then you have access to 'character', 'codepoint' 
and 'codeunit' chunks. In general:


   - character access for general strings is never constant time, as 
characters can require multiple codepoints.


   - codepoint access for general strings is never constant time, as 
codepoints can require two codeunits to encode.


   - codeunits access for general strings is always constant time.

Internally, the engine will keep things which can be represented in the 
platform's native encoding as native as much as possible (the native 
encodings have the property that 1 character = 1 codepoint = 1 
codeunit); otherwise it will (currently) store things internally as 
sequences of codeunits in the UTF-16 encoding. (How this might be done 
in future may well change in order to permit optimization, for example 
pure Greek or Russian text currently has a penalty compared to English 
text as it will always require UTF-16 internal encoding; however with 
the advent of Emoji and other such things, pure English text itself is 
becoming much less common).


Most of the time 'character' is the most appropriate thing to use for 
reading strings, whilst codepoints can be used to build up strings of 
characters.


The presence 

Re: First 1000 characters without loop?

2017-06-22 Thread Monte Goulding via use-livecode

> On 23 Jun 2017, at 11:19 am, Richard Gaskin via use-livecode 
>  wrote:
> 
> Monte Goulding wrote:
> 
> >> On 23 Jun 2017, at 10:06 am, Richard Gaskin wrote:
> >>
> >> How can we know which is in use for a given string?
> >
> > You shouldn’t need to know. The engine will use native encoding where
> > possible for efficiency. A lot of the performance improvements between
> > LC 7 and 8 were using the right code paths based on whether the string
> > is native or unicode.
> 
> Seems murky.  I'd much rather at least have something like a byteLen 
> function, which returns the number of bytes for a given string.  With that I 
> can maintain byte offsets into a file with good performance and no ambiguity.

In theory `the number of bytes of ` should in my opinion return 
whatever the byteLength function would given the codeunit docs state:

> The hierarchy of the new and altered chunk types is as follows: byte w of 
> codeunit x of codepoint y of char z of word …. 

However this report was resolved as not a bug so I guess that theory is wrong 
and maybe there’s a docs bug in there (I have asked internally on our language 
channel) http://quality.livecode.com/show_bug.cgi?id=13248 

put the number of codeunits of  “️” -> 3 

So this is actually 6 bytes but as documented you can’t rely on the codeunit 
length being 16 bit so I guess that means there is currently no way to get what 
you want reliably. Whether you need it is a separate discussion.
> 
> 
> >> Suppose I wanted to process a lot of text, so performance is
> >> critical. Using bytes would be optimal, since any chunk type or even
> >> Unicode characters may vary in length.
> >>
> >> So if I wanted to create an index of byte offsets into a large chunk
> >> of text, how would I know how long a character is?
> >
> > If it’s text encoded then you probably want to use character offsets
> > and let the engine worry about optimising it. If you know it’s binary
> > data then use bytes.
> 
> How do I find a substring in binary data in a what that will tell me the 
> number of bytes of the offset?


If you are dealing with bytes of binary data then use byteOffset. Is that what 
you mean here? Probably better to talk about ranges rather than substrings if 
you are discussing binary data.

Cheers

Monte

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: First 1000 characters without loop?

2017-06-22 Thread Richard Gaskin via use-livecode

Monte Goulding wrote:

>> On 23 Jun 2017, at 10:06 am, Richard Gaskin wrote:
>>
>> How can we know which is in use for a given string?
>
> You shouldn’t need to know. The engine will use native encoding where
> possible for efficiency. A lot of the performance improvements between
> LC 7 and 8 were using the right code paths based on whether the string
> is native or unicode.

Seems murky.  I'd much rather at least have something like a byteLen 
function, which returns the number of bytes for a given string.  With 
that I can maintain byte offsets into a file with good performance and 
no ambiguity.



>> Suppose I wanted to process a lot of text, so performance is
>> critical. Using bytes would be optimal, since any chunk type or even
>> Unicode characters may vary in length.
>>
>> So if I wanted to create an index of byte offsets into a large chunk
>> of text, how would I know how long a character is?
>
> If it’s text encoded then you probably want to use character offsets
> and let the engine worry about optimising it. If you know it’s binary
> data then use bytes.

How do I find a substring in binary data in a what that will tell me the 
number of bytes of the offset?


--
 Richard Gaskin
 Fourth World Systems
 Software Design and Development for the Desktop, Mobile, and the Web
 
 ambassa...@fourthworld.comhttp://www.FourthWorld.com

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: First 1000 characters without loop?

2017-06-22 Thread Peter W A Wood via use-livecode
Richard

> How can we know which is in use for a given string?
> 
> Suppose I wanted to process a lot of text, so performance is critical. Using 
> bytes would be optimal, since any chunk type or even Unicode characters may 
> vary in length.
> 
> So if I wanted to create an index of byte offsets into a large chunk of text, 
> how would I know how long a character is?

Some Unicode characters, such as emojis, have to be represented by two 
codepoints in UTF-16 (known as surrogates) so they take four bytes not two. 
Additionally, the number of bytes for characters with accents will take either 
one codepoint or two depending on whether they have been coded in pre-composed 
or decomposed form. (e.g. ç can be either U+0063 U+0327 (decomposed) or U+00E7 
(precomposed).

So it is isn’t easy to estimate the number of bytes in a UTF-16 string.

I would guess that LiveCode will store the characters of a string in single 
bytes if all the letters of the string conform to ISO-8859-1. So if you can be 
certain that your text is all ISO-8859-1 encoded, you can estimate at 1 byte 
per character. (The guess is base on the fact that the first 256 Unicode code 
points replicate ISO-8859-1).

Regards

Peter


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: First 1000 characters without loop?

2017-06-22 Thread Monte Goulding via use-livecode

> On 23 Jun 2017, at 10:06 am, Richard Gaskin via use-livecode 
>  wrote:
> 
> How can we know which is in use for a given string?

You shouldn’t need to know. The engine will use native encoding where possible 
for efficiency. A lot of the performance improvements between LC 7 and 8 were 
using the right code paths based on whether the string is native or unicode.
> 
> Suppose I wanted to process a lot of text, so performance is critical. Using 
> bytes would be optimal, since any chunk type or even Unicode characters may 
> vary in length.
> 
> So if I wanted to create an index of byte offsets into a large chunk of text, 
> how would I know how long a character is?

If it’s text encoded then you probably want to use character offsets and let 
the engine worry about optimising it. If you know it’s binary data then use 
bytes.

Cheers

Monte
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: First 1000 characters without loop?

2017-06-22 Thread Richard Gaskin via use-livecode

Monte Goulding wrote:

>> On 23 Jun 2017, at 7:18 am, Richard Gaskin wrote:
>>
>> is that true that UTF-16 gives us two-bytes per char across the
>> board?
>
> That’s true (the 16 means 16 bit) but internally strings may be either
> native 8 bit or unicode 16 bit.

How can we know which is in use for a given string?

Suppose I wanted to process a lot of text, so performance is critical. 
Using bytes would be optimal, since any chunk type or even Unicode 
characters may vary in length.


So if I wanted to create an index of byte offsets into a large chunk of 
text, how would I know how long a character is?


--
 Richard Gaskin
 Fourth World Systems
 Software Design and Development for the Desktop, Mobile, and the Web
 
 ambassa...@fourthworld.comhttp://www.FourthWorld.com

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: First 1000 characters without loop?

2017-06-22 Thread Monte Goulding via use-livecode

> On 23 Jun 2017, at 7:18 am, Richard Gaskin via use-livecode 
>  wrote:
> 
> is that true that UTF-16 gives us two-bytes per char across the board?

That’s true (the 16 means 16 bit) but internally strings may be either native 8 
bit or unicode 16 bit.

It should just be a direct memory copy so it’s about as efficient as you are 
going to get. If you don’t need the trailing chars it would be more efficient 
from a memory perspective to delete the remaining chars.

Cheers

Monte
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: First 1000 characters without loop?

2017-06-22 Thread Richard Gaskin via use-livecode

Rick Harrison wrote:

>> On Jun 22, 2017, at 2:17 PM, Mark Talluto wrote:
>>
>> On Jun 22, 2017, at 11:03 AM, Rick Harrison wrote:
>>>
>>> I have a string variable which contains over 2500 characters.
>>> I only want to grab the first 1000 characters of that string.
>>> Rather than looping 1000 times to grab each character
>>> is there a way to just grab the first 1000 efficiently in
>>> one big chunk?
>>
>> Hi Rick,
>>
>> put char 1 to 1000 of tOriginalVar into tNewVar
>
> Thanks for this solution.
>
> Does that statement create an implied loop?
> It’s great for a one liner though!


With many chunk expressions, I would imagine it does.  With line chunks, 
for example, the engine needs to walk through the string, comparing each 
character to CR, counting the found CRs as it goes.


But even then, better to have the engine do that in machine code than 
for us to do it in script. :)


In this case, though, I believe it doesn't need a loop per se, since 
AFAIK character are fixed-size entities internally (Mark Waddingham, is 
that true that UTF-16 gives us two-bytes per char across the board?).


If I'm mistaken there, any traversal of the string is still about as 
efficient as it's going to get in a general-purpose language, since it's 
relying on the well-optimized Unicode libraries many projects depend on.


All that said, as much as I enjoy benchmarking I wouldn't sweat 
use-cases involving small data.  1k chars could be sliced with any chunk 
type so quickly it probably won't matter.


--
 Richard Gaskin
 Fourth World Systems
 Software Design and Development for the Desktop, Mobile, and the Web
 
 ambassa...@fourthworld.comhttp://www.FourthWorld.com

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: First 1000 characters without loop?

2017-06-22 Thread Devin Asay via use-livecode

On Jun 22, 2017, at 1:05 PM, Rick Harrison via use-livecode 
> wrote:

Hi Devin & Mark,

Thanks for this solution.

Does that statement create an implied loop?



I don’t think so. It’s similar to substring functions in other languages. As 
Mike said, string chunk expressions are one of the best things about LiveCode.

You can do things like:

  put char 5 to 25 of line 4 of tVar into tVar2

I discuss it in my lesson on working with text in LiveCode:

http://livecode.byu.edu/textfind/TextandFind.php

Devin

Devin Asay
Director
Office of Digital Humanities
Brigham Young University

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: First 1000 characters without loop?

2017-06-22 Thread Rick Harrison via use-livecode
Hi Devin & Mark,

Thanks for this solution.

Does that statement create an implied loop?
It’s great for a one liner though!

Rick

> On Jun 22, 2017, at 2:17 PM, Mark Talluto via use-livecode 
>  wrote:
> 
> On Jun 22, 2017, at 11:03 AM, Rick Harrison via use-livecode 
>  wrote:
>> 
>> I have a string variable which contains over 2500 characters.
>> I only want to grab the first 1000 characters of that string.
>> Rather than looping 1000 times to grab each character
>> is there a way to just grab the first 1000 efficiently in
>> one big chunk?
>> 
>> Thanks,
>> 
>> Rick
> 
> Hi Rick,
> 
> put char 1 to 1000 of tOriginalVar into tNewVar
> 
> 
> Best regards,
> 
> Mark Talluto
> livecloud.io 
> nursenotes.net 
> canelasoftware.com 
> 
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: First 1000 characters without loop?

2017-06-22 Thread Mike Kerner via use-livecode
chunks make me happy.
actually, any time i have to parse a hornet's nest of text, this language
makes me happy.

On Thu, Jun 22, 2017 at 2:17 PM, Mark Talluto via use-livecode <
use-livecode@lists.runrev.com> wrote:

> On Jun 22, 2017, at 11:03 AM, Rick Harrison via use-livecode <
> use-livecode@lists.runrev.com> wrote:
> >
> > I have a string variable which contains over 2500 characters.
> > I only want to grab the first 1000 characters of that string.
> > Rather than looping 1000 times to grab each character
> > is there a way to just grab the first 1000 efficiently in
> > one big chunk?
> >
> > Thanks,
> >
> > Rick
>
> Hi Rick,
>
> put char 1 to 1000 of tOriginalVar into tNewVar
>
>
> Best regards,
>
> Mark Talluto
> livecloud.io 
> nursenotes.net 
> canelasoftware.com 
>
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>



-- 
On the first day, God created the heavens and the Earth
On the second day, God created the oceans.
On the third day, God put the animals on hold for a few hours,
   and did a little diving.
And God said, "This is good."
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: First 1000 characters without loop?

2017-06-22 Thread Mark Talluto via use-livecode
On Jun 22, 2017, at 11:03 AM, Rick Harrison via use-livecode 
 wrote:
> 
> I have a string variable which contains over 2500 characters.
> I only want to grab the first 1000 characters of that string.
> Rather than looping 1000 times to grab each character
> is there a way to just grab the first 1000 efficiently in
> one big chunk?
> 
> Thanks,
> 
> Rick

Hi Rick,

put char 1 to 1000 of tOriginalVar into tNewVar


Best regards,

Mark Talluto
livecloud.io 
nursenotes.net 
canelasoftware.com 

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: First 1000 characters without loop?

2017-06-22 Thread Devin Asay via use-livecode
On Jun 22, 2017, at 12:03 PM, Rick Harrison via use-livecode 
 wrote:
> 
> I have a string variable which contains over 2500 characters.
> I only want to grab the first 1000 characters of that string.
> Rather than looping 1000 times to grab each character
> is there a way to just grab the first 1000 efficiently in



Rick,

Does this do what you want?

  put char 1 to 1000 of tVar into tVar2

Devin

Devin Asay
Director
Office of Digital Humanities
Brigham Young University


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


First 1000 characters without loop?

2017-06-22 Thread Rick Harrison via use-livecode
I have a string variable which contains over 2500 characters.
I only want to grab the first 1000 characters of that string.
Rather than looping 1000 times to grab each character
is there a way to just grab the first 1000 efficiently in
one big chunk?

Thanks,

Rick

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode