subject:"OMG text processing performance 6.7 \- 9.5"

RE: OMG text processing performance 6.7 - 9.5 - correction

2020-02-06 Thread Neville via use-livecode

Belay my claim about the offsets found from using an offset search on raw text 
and on the utf-8 version of that text giving exactly the same offset numbers 
for corresponding hits - they don’t of course. The offsets reported in the raw 
text are binary 8-bit character offsets, the offsets reported in the utf-8 
encoded text are unicode character offsets, as they must be.

Apologies, I was reading my data incorrectly.

Neville


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

RE: OMG text processing performance 6.7 - 9.5

2020-02-05 Thread Neville via use-livecode

One further comment … when  talking about long text and not using lineOffset I 
really do mean long. 

Using source text the first 500 KB of Les Miserables, the times for simple 
Parse 1 offset search with skip for *both* raw text and utf-8 was 1 ms, and for 
lineOffset 10 ms and 14 ms respectively, not worth worrying about doing 
anything fancy.

Neville


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

RE: OMG text processing performance 6.7 - 9.5

2020-02-05 Thread Neville via use-livecode

Richard, here is a link to my test stack

https://www.dropbox.com/sh/bbpe12p8bf56ofe/AADbhV2LavLP4Y3CZ8Ab8NGia?dl=0 


The LesMiserables.txt file is included for convenience; it should be placed in 
your Documents directory. The algorithms are all in the script for the `Run` 
button.

I am still mystified that the character offset searches give the same number 
for each hit for the utf8 text as for the raw text. Surely `char x of 
theUTF8Text` returns the unicode character at offset x, `char x of theRawText` 
returns the 8-bit ascii character of the raw text? How then can x be the same 
for the corresponding hit, when I know there are some multibyte unicode 
characters in the text (eg e-acute in Miserables)? Indeed just what does 
textDecode(theRawText,`UTF-8`) do, does it modify the actual text at all or 
just set a property flag?

Another mystery: I decided to extend the search algorithms by adding in 
matchChunk. In this case I use the regular expression `(?m)(?i)(Valjean)` to 
get the start and end offsets of the first match, and then truncate the initial 
section as per Parse2. As expected this search is much slower than any of the 
others on the raw text, it has a lot more to do. I then expected to get around 
the same time for the search on utf8 text rather than an exponentially worse 
time, since matchChunk is presumably encoding-blind. But it is actually 15% 
faster than on the raw text, in fact it is the fastest for finding offsets of 
all the algorithms if you must* search utf8 text ! How can this be? I don’t 
believe the utf8 text is 15% smaller than the raw text!

searches on raw text
matchChunk3059 ms
filter  16 ms
parse0  10 ms
parse18 ms
parse3  2244 ms
parse2671 ms
parse4682 ms

searches on utf-8 text
matchChunk utf8  2492 ms
filter utf8   1954 ms
parse0 utf8   3788 ms
parse1 utf8   223254 ms
parse3 utf8   634423 ms
parse2 utf8   3409 ms
parse4 utf8   7166 ms

*As I mentioned in most case character offset searching the raw text should be 
OK if you are searching for 7-bit ascii strings of length say>2. But I think 
the lineOffset and filter operations could give false positives, if there is a 
multibyte character contains OD as a component byte in the text.

Neville


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: OMG text processing performance 6.7 - 9.5

2020-02-04 Thread Richard Gaskin via use-livecode


Super thorough work there, Neville.  Thanks.

Could I trouble you to post code listings for the various algos?

I'd like to try them on my MBOX archives, and they may also be useful 
for others looking for parsing routines in the archives.


--
 Richard Gaskin
 Fourth World Systems


Neville wrote:> Just for interest, and to see just how slow lineOffset 
is, I added a couple of more tests to the search for occurrences of 
“Valjean” in the Gutenberg English translation of Les Miserables. I also 
wanted find how filter performs.


The searches were first applied to the raw binary text as read from the utf-8 
encoded file, without using textDecode; then on the text utf-8 decoded

Parse 0 : using itemdelimiter  ‘Valjean’ (case insensitive)
Parse 1: using offset with skips
Parse 2: using offset, truncating the text and 0 skip
Parse 3: use lineOffset with skips
Parse 4: use lineOffset, truncating the text and 0 skip
filter: use filter to find lines containing '*Valjean*'

Parse 1 and 2 produced 1120 hits. Parse 0 gave 1121 hits, the extra one being a false positive at the end of the file, which needs to be accounted for in an implementation. I was slightly surprised that the character offsets produced were the same for raw and for utf-8 text, I guess I was expecting the latter to give the unicode character offset. Parse 3 and 4 and filter all output 1099 lines. 


Results:

searches on raw text:

parse0  11 ms
parse19 ms
parse2751 ms
parse3  2551 ms
parse4753 ms
filter  16 ms

searches on utf-8 text:

parse04386 ms
parse1224367 ms
parse23461 ms
parse3636554 ms —— 
parse47242 ms
filter2258 ms

So for long texts it is best to use raw binary text and search with character 
offset(pSt,pSrc,skip) [Parse 1]. If you have to search on utf-8 encode text 
then use Parse 2, deleting initial sections of the text as you go. Never use 
lineOffset (except for small text) even if that means extra code to find line 
endings on either side of the character offset when you really want the found 
line. If you don’t actually need the offset of the hits in the original file - 
for example for editing the original - then filter is the fastest on long text 
and just as fast on short text, but depending on your needs you probably have 
to do another search on the filtered text; but this would be a viable approach 
if the number of lines produced is itself small.

Neville




___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: OMG text processing performance 6.7 - 9.5

2020-02-04 Thread Mark Wieder via use-livecode


On 2/4/20 6:43 PM, Colin Holgate via use-livecode wrote:

Would have been neat if it took 24601 milliseconds.


Chortle

--
 Mark Wieder
 ahsoftw...@gmail.com

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: OMG text processing performance 6.7 - 9.5

2020-02-04 Thread Colin Holgate via use-livecode

Would have been neat if it took 24601 milliseconds.


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

RE: OMG text processing performance 6.7 - 9.5

2020-02-04 Thread Neville via use-livecode

Just for interest, and to see just how slow lineOffset is, I added a couple of 
more tests to the search for occurrences of “Valjean” in the Gutenberg English 
translation of Les Miserables. I also wanted find how filter performs.

The searches were first applied to the raw binary text as read from the utf-8 
encoded file, without using textDecode; then on the text utf-8 decoded

Parse 0 : using itemdelimiter  ‘Valjean’ (case insensitive)
Parse 1: using offset with skips
Parse 2: using offset, truncating the text and 0 skip
Parse 3: use lineOffset with skips
Parse 4: use lineOffset, truncating the text and 0 skip
filter: use filter to find lines containing '*Valjean*'

Parse 1 and 2 produced 1120 hits. Parse 0 gave 1121 hits, the extra one being a 
false positive at the end of the file, which needs to be accounted for in an 
implementation. I was slightly surprised that the character offsets produced 
were the same for raw and for utf-8 text, I guess I was expecting the latter to 
give the unicode character offset. Parse 3 and 4 and filter all output 1099 
lines. 

Results:

searches on raw text:

parse0  11 ms
parse19 ms
parse2751 ms
parse3  2551 ms
parse4753 ms
filter  16 ms

searches on utf-8 text:

parse04386 ms
parse1224367 ms
parse23461 ms
parse3636554 ms —— 
parse47242 ms
filter2258 ms

So for long texts it is best to use raw binary text and search with character 
offset(pSt,pSrc,skip) [Parse 1]. If you have to search on utf-8 encode text 
then use Parse 2, deleting initial sections of the text as you go. Never use 
lineOffset (except for small text) even if that means extra code to find line 
endings on either side of the character offset when you really want the found 
line. If you don’t actually need the offset of the hits in the original file - 
for example for editing the original - then filter is the fastest on long text 
and just as fast on short text, but depending on your needs you probably have 
to do another search on the filtered text; but this would be a viable approach 
if the number of lines produced is itself small.

Neville



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: OMG text processing performance 6.7 - 9.5

2020-02-04 Thread Alex Tweedly via use-livecode



On 04/02/2020 22:12, Richard Gaskin via use-livecode wrote:


The code I was using was similar to Alex' itemDel solution, but 
playing with all three together shows itemDel only slightly faster 
than delete, and both much faster than traversing in-place with "start 
at".


You know I'm always happy to take credit for someone else's work :-)   - 
but I should say that the "parse0" (aka "itemDel") function was 
Hermann's, not mine.


Alex.


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

RE: OMG text processing performance 6.7 - 9.5

2020-02-04 Thread Neville via use-livecode

The recent testing of the Parse1 and Parse2  algorithms I think must have been 
on ascii not utf-8 text

I tested on the English translation of Les Miserables, to ensure at least a 
sprinkling of multi-bite characters in the text, and a longish file: 3.4 MB. I 
tested for the search string ‘Valjean’ which obviously occurs very frequently.

The searches were first applied to the raw binary text as read from the utf-8 
encoded file, without decoding; then on the text utf-8 decoded

Parse 0 : using itemdelimiter  ‘Valjean’ (case insensitive)

Parse 1: using offset with skips

Parse 2: using offset, truncating the text and 0 skip

Results:

searches on raw text
parse0 10 ms
parse1 9 ms
parse2 708 ms

searches on utf-8text
parse0 4402 ms
parse1 225469 ms
parse2 3453 ms


The winner for long utf-8 text is Parse 2; for raw text Parse1 and Parse 0 are 
equivalent The results dramatically demonstrate the exponential decay in 
performance with long utf-8 text. 

For most searches I would think one could use the raw text as long as one was 
searching for an ascii string, false positives where the string of single bytes 
occurs inside multibyte characters would be extremely unlikely.

Neville



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: OMG text processing performance 6.7 - 9.5

2020-02-04 Thread Richard Gaskin via use-livecode

Ralph DiMola wrote:
> My initial timings was with an earlier v9 version. I will do some
> timings on 9.5.1. In the meanwhile I wonder if doing a "delete char
> 1 to n of myVar" is more expensive then "put char n to -1 of myVar
> into myVar" as I do.

I had thought the exercise was to obtain a list of offsets for a given 
term within a larger string.

And that's something I need to do, a lot of actually, building indexes 
for a large collection of MBOX and similarly-structured files.

Here the delete method was far faster than working in-place with "start 
at", and the speed advantage grew as the file did.

Seemed counterintuitive when you first brought it up, but I can 
understand how Unicode can slow things down, esp. UTF-8 with its 
potentially-but-not-necessarily-varying character length.

The code I was using was similar to Alex' itemDel solution, but playing 
with all three together shows itemDel only slightly faster than delete, 
and both much faster than traversing in-place with "start at".

But now that others are seeing different results I'm keen to hear from 
Mark Waddingham on what factors can affect things so differently on 
different systems.

FWIW all my tests were done in LC 9.6dp2 on Ubuntu 18.04 LTS.

--
 Richard Gaskin
 Fourth World Systems
 Software Design and Development for the Desktop, Mobile, and the Web

 ambassa...@fourthworld.comhttp://www.FourthWorld.com

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

RE: OMG text processing performance 6.7 - 9.5

2020-02-04 Thread Ralph DiMola via use-livecode

My initial timings was with an earlier v9 version. I will do some timings on
9.5.1. In the meanwhile I wonder if doing a "delete char 1 to n of myVar" is
more expensive then "put char n to -1 of myVar into myVar" as I do.

Ralph DiMola
IT Director
Evergreen Information Services
rdim...@evergreeninfo.net


-Original Message-
From: use-livecode [mailto:use-livecode-boun...@lists.runrev.com] On Behalf
Of Bob Sneidar via use-livecode
Sent: Tuesday, February 04, 2020 2:45 PM
To: How to use LiveCode
Cc: Bob Sneidar
Subject: Re: OMG text processing performance 6.7 - 9.5

Heresy! Burn the cretan!!! ;-)

Bob S


> On Feb 4, 2020, at 10:43 , Richard Gaskin via use-livecode
 wrote:
> 
> Hmmm  It may be that Mark Waddingham was wrong in the guidance he gave
earlier about Unicode vs memcopy, but I wonder if there may be something
else here.


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: OMG text processing performance 6.7 - 9.5

2020-02-04 Thread Bob Sneidar via use-livecode

Heresy! Burn the cretan!!! ;-)

Bob S


> On Feb 4, 2020, at 10:43 , Richard Gaskin via use-livecode 
>  wrote:
> 
> Hmmm  It may be that Mark Waddingham was wrong in the guidance he gave 
> earlier about Unicode vs memcopy, but I wonder if there may be something else 
> here.


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: OMG text processing performance 6.7 - 9.5

2020-02-04 Thread J. Landman Gay via use-livecode


On 2/4/20 12:43 PM, Richard Gaskin via use-livecode wrote:

J. Landman Gay wrote:

On 2/3/20 2:19 PM, hh via use-livecode wrote:

Parse1 is here always at least 30% faster than Parse2.


I'm seeing the same thing, only more so. I searched for "the" in a 424K text 
file:

parse1 = 11 ms
parse2 = 111 ms


Hmmm  It may be that Mark Waddingham was wrong in the guidance he gave earlier about 
Unicode vs memcopy, but I wonder if there may be something else here.


Were you using the same text I tested with, the Gutenberg text file for "Call of the 
Wild"?

https://www.gutenberg.org/files/215/215-0.txt



I just tried that file. This is a shorter file than the one I used before. This looks like the 
two are almost equal, with parse1 coming out slightly ahead. Here are a string of results, the 
first item in each row is parse1 and the second item is parse2:


2,3
2,3
3,2
2,3
2,3
3,2
2,3
3,2
2,3
3,3
3,2
2,3
2,3
2,4
3,3
3,3
2,2
2,3

--
Jacqueline Landman Gay | jac...@hyperactivesw.com
HyperActive Software   | http://www.hyperactivesw.com


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: OMG text processing performance 6.7 - 9.5

2020-02-04 Thread Richard Gaskin via use-livecode


J. Landman Gay wrote:

On 2/3/20 2:19 PM, hh via use-livecode wrote:

Parse1 is here always at least 30% faster than Parse2.


I'm seeing the same thing, only more so. I searched for "the" in a 424K 
text file:


parse1 = 11 ms
parse2 = 111 ms


Hmmm  It may be that Mark Waddingham was wrong in the guidance he 
gave earlier about Unicode vs memcopy, but I wonder if there may be 
something else here.


Were you using the same text I tested with, the Gutenberg text file for 
"Call of the Wild"?


https://www.gutenberg.org/files/215/215-0.txt

--
 Richard Gaskin
 Fourth World Systems
 Software Design and Development for the Desktop, Mobile, and the Web
 
 ambassa...@fourthworld.comhttp://www.FourthWorld.com

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

RE: OMG text processing performance 6.7 - 9.5

2020-02-03 Thread J. Landman Gay via use-livecode

I moved the text to a variable before running each test. I should have 
mentioned that.


--
Jacqueline Landman Gay | jac...@hyperactivesw.com
HyperActive Software | http://www.hyperactivesw.com
On February 3, 2020 6:14:29 PM Ralph DiMola via use-livecode 
 wrote:



I can see that if the text is in a field. When deleting the chars the
remaining text must be reformatted. Try it with the text in a variable.

Ralph DiMola
IT Director
Evergreen Information Services
rdim...@evergreeninfo.net


-Original Message-
From: use-livecode [mailto:use-livecode-boun...@lists.runrev.com] On Behalf
Of J. Landman Gay via use-livecode
Sent: Monday, February 03, 2020 6:48 PM
To: How to use LiveCode
Cc: J. Landman Gay
Subject: Re: OMG text processing performance 6.7 - 9.5

On 2/3/20 2:19 PM, hh via use-livecode wrote:

Parse1 is here always at least 30% faster than Parse2.


I'm seeing the same thing, only more so. I searched for "the" in a 424K text
file:

parse1 = 11 ms
parse2 = 111 ms

The text was imported into a field using the property inspector, which I
assume converted it to UTF16 automatically.

--
Jacqueline Landman Gay | jac...@hyperactivesw.com
HyperActive Software   | http://www.hyperactivesw.com

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your 
subscription preferences:

http://lists.runrev.com/mailman/listinfo/use-livecode





___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

RE: OMG text processing performance 6.7 - 9.5

2020-02-03 Thread Ralph DiMola via use-livecode

I can see that if the text is in a field. When deleting the chars the
remaining text must be reformatted. Try it with the text in a variable.

Ralph DiMola
IT Director
Evergreen Information Services
rdim...@evergreeninfo.net


-Original Message-
From: use-livecode [mailto:use-livecode-boun...@lists.runrev.com] On Behalf
Of J. Landman Gay via use-livecode
Sent: Monday, February 03, 2020 6:48 PM
To: How to use LiveCode
Cc: J. Landman Gay
Subject: Re: OMG text processing performance 6.7 - 9.5

On 2/3/20 2:19 PM, hh via use-livecode wrote:
> Parse1 is here always at least 30% faster than Parse2.

I'm seeing the same thing, only more so. I searched for "the" in a 424K text
file:

parse1 = 11 ms
parse2 = 111 ms

The text was imported into a field using the property inspector, which I
assume converted it to UTF16 automatically.

-- 
Jacqueline Landman Gay | jac...@hyperactivesw.com
HyperActive Software   | http://www.hyperactivesw.com

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: OMG text processing performance 6.7 - 9.5

2020-02-03 Thread J. Landman Gay via use-livecode


On 2/3/20 2:19 PM, hh via use-livecode wrote:

Parse1 is here always at least 30% faster than Parse2.


I'm seeing the same thing, only more so. I searched for "the" in a 424K 
text file:


parse1 = 11 ms
parse2 = 111 ms

The text was imported into a field using the property inspector, which I 
assume converted it to UTF16 automatically.


--
Jacqueline Landman Gay | jac...@hyperactivesw.com
HyperActive Software   | http://www.hyperactivesw.com

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

RE: OMG text processing performance 6.7 - 9.5

2020-02-03 Thread Ralph DiMola via use-livecode

Also never forget the blazingly fast "filter" when appropriate.
 
Ralph DiMola
IT Director
Evergreen Information Services
rdim...@evergreeninfo.net


-Original Message-
From: use-livecode [mailto:use-livecode-boun...@lists.runrev.com] On Behalf
Of hh via use-livecode
Sent: Monday, February 03, 2020 3:19 PM
To: use-livecode@lists.runrev.com
Cc: hh
Subject: Re: OMG text processing performance 6.7 - 9.5

Parse1 is here always at least 30% faster than Parse2.

Yet another approach in LC 7/8/9 that I find to be very fast (especially for
a lot of hits in large strings, e.g. when searching for "and" or "the"):

-- Offset per ItemDelimiter
-- Searches for pStr in pSrc using pCase function Parse0 pStr, pSrc, pCase
  set the caseSensitive to pCase
  set the itemDelimiter to pStr
  put the length of pStr into tN0
  put 1 into tN
  repeat for each item tI in pSrc
add the length of tI to tN
put tN & cr after tOut
add tN0 to tN
  end repeat
  return line 1 to -2 of tOut
end Parse0

> Richard G. wrote:
> 
> -- Offset in place:
> function Parse1 pStr, pSrc
>put 0 into tStart
>repeat
>  put offset(pStr, pSrc, tStart) into tOS
>  if tOS = 0 then exit repeat
>  add tOS to tStart
>  put tStart  after tOut
>end repeat
>return tOut
> end Parse1
> 
> -- Delete as you go:
> function Parse2 pStr, pSrc
>put 0 into tStart
>repeat
>  put offset( pStr, pSrc) into tOS
>  if tOS = 0 then exit repeat
>  delete char 1 to tOS of pSrc
>  add tOS to tStart
>  put tStart  after tOut
>end repeat
>return tOut
> end Parse2

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: OMG text processing performance 6.7 - 9.5

2020-02-03 Thread hh via use-livecode

Parse1 is here always at least 30% faster than Parse2.

Yet another approach in LC 7/8/9 that I find to be very
fast (especially for a lot of hits in large strings,
e.g. when searching for "and" or "the"):

-- Offset per ItemDelimiter
-- Searches for pStr in pSrc using pCase
function Parse0 pStr, pSrc, pCase
  set the caseSensitive to pCase
  set the itemDelimiter to pStr
  put the length of pStr into tN0
  put 1 into tN
  repeat for each item tI in pSrc
add the length of tI to tN
put tN & cr after tOut
add tN0 to tN
  end repeat
  return line 1 to -2 of tOut
end Parse0

> Richard G. wrote:
> 
> -- Offset in place:
> function Parse1 pStr, pSrc
>put 0 into tStart
>repeat
>  put offset(pStr, pSrc, tStart) into tOS
>  if tOS = 0 then exit repeat
>  add tOS to tStart
>  put tStart  after tOut
>end repeat
>return tOut
> end Parse1
> 
> -- Delete as you go:
> function Parse2 pStr, pSrc
>put 0 into tStart
>repeat
>  put offset( pStr, pSrc) into tOS
>  if tOS = 0 then exit repeat
>  delete char 1 to tOS of pSrc
>  add tOS to tStart
>  put tStart  after tOut
>end repeat
>return tOut
> end Parse2

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: OMG text processing performance 6.7 - 9.5

2020-02-03 Thread Richard Gaskin via use-livecode


Sannyasin Brahmanathaswami wrote:
hhmm…. how do you 


"truncate the string and
search from the beginning" 


??

Can you give a code snippet example?

BR

I found this as well. Another thing, it's faster to truncate the string and
search from the beginning than using a "start at" on the entire string when
searching for all occurrences of a string .



-- Offset in place:
function Parse1 pStr, pSrc
  put 0 into tStart
  repeat
put offset(pStr, pSrc, tStart) into tOS
if tOS = 0 then exit repeat
add tOS to tStart
put tStart  after tOut
  end repeat
  return tOut
end Parse1


-- Delete as you go:
function Parse2 pStr, pSrc
  put 0 into tStart
  repeat
put offset( pStr, pSrc) into tOS
if tOS = 0 then exit repeat
delete char 1 to tOS of pSrc
add tOS to tStart
put tStart  after tOut
  end repeat
  return tOut
end Parse2


--
 Richard Gaskin
 Fourth World Systems
 Software Design and Development for the Desktop, Mobile, and the Web
 
 ambassa...@fourthworld.comhttp://www.FourthWorld.com

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: OMG text processing performance 6.7 - 9.5

2020-02-03 Thread Sannyasin Brahmanathaswami via use-livecode

hhmm…. how do you 

"truncate the string and
search from the beginning" 

??

Can you give a code snippet example?

BR

I found this as well. Another thing, it's faster to truncate the string and
search from the beginning than using a "start at" on the entire string when
searching for all occurrences of a string .

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: OMG text processing performance 6.7 - 9.5

2020-02-01 Thread Richard Gaskin via use-livecode


Ralph DiMola wrote:

I found this as well. Another thing, it's faster to truncate the string and
search from the beginning than using a "start at" on the entire string when
searching for all occurrences of a string . This was counter intuitive to me
until Mark explained that skipping chars requires more work because
repetitive skipping of Unicode chars is slower than many "memcpy"s on very
long strings.


Very interesting - thanks for noting that.  It's counterintuitive, but 
it sure shows:


I ran a quick test searching for all offsets for the string "Spitz" in 
the Gutenberg text file for Jack London's "Call of the Wild" - here are 
the results:


377 ms for 'start at'
 56 ms for deletion

--
 Richard Gaskin
 Fourth World Systems


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: OMG text processing performance 6.7 - 9.5

2020-01-31 Thread Bob Sneidar via use-livecode

That's what  Isuspect as well. Thanks. 

Bob S


> On Jan 31, 2020, at 09:02 , Mark Waddingham via use-livecode 
>  wrote:
> 
> The code on the LC side is the same (engine and client drivers) so it’s 
> almost certainly hardware / OS causing the difference.
> 
> Warmest Regards,
> 
> Mark

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: OMG text processing performance 6.7 - 9.5

2020-01-31 Thread Mark Waddingham via use-livecode

I don’t think you read too much into differences of as little as 10ticks - the 
error in time measurement for a single run would be too high.

It’s seems to make sense that vms would do the same task slower than the 
machine that they run on so I don’t think that’s a very interesting comparison.

Your windows workstation is an i3 - I suspect it only has two cores - SSDs 
whilst fast can only do one thing at a time. When you are running the database 
and client on the same machine I suspect there’s more contention there then 
when client and server are separate machines... 

The code on the LC side is the same (engine and client drivers) so it’s almost 
certainly hardware / OS causing the difference.

Warmest Regards,

Mark


Sent from my iPhone

> On 31 Jan 2020, at 15:54, Bob Sneidar via use-livecode 
>  wrote:
> 
> I take that back it's a 32 bit Windows OS (dunno why I even still have this 
> PC). But bitness is not going to affect a single network thread. Also the 
> processor is an i3 running at 3.1. My mac is an i7 running at 2.3. Also my 
> Mac is clamped to 100mb networking due to the nature of our VIOP phones. The 
> Standalone Workstation is gigabit and as I mentioned IS the SQL server 
> itself. 
> 
> Bob S
> 
> 
>> On Jan 31, 2020, at 07:47 , Bob Sneidar via use-livecode 
>>  wrote:
>> 
>> Not so fast. On a standalone workstation Windows 7 64bit 16gig memory and an 
>> SSD: 64 ticks. Compared with 14 ticks on my Mac OS X. My Parallels VM is 
>> outperforming a workstation. Oh, and the Windows workstation? It's the 
>> workstation running the mySQL instance!!!
>> 
>> That's exactly my point. It is decidedly NOT like-with-like, but it has 
>> nothing to do with virtualization. The Windows LC is running very much 
>> slower than the Mac OS, and I'm not blaming LC. It may be Windows itself. I 
>> disabled my Antivirus. Got it down to 57 ticks. 
>> 
>> Bob S
>> 
>> 
 On Jan 31, 2020, at 03:06 , Mark Waddingham via use-livecode 
  wrote:
>>> 
>>> That’s not comparing like-with-like though - you are comparing VMs running 
>>> Windows on your Mac with your Mac by the sound of it... VMs introduce a 
>>> fair bit of overhead for all I/O (and also for some code - depending on the 
>>> age of your CPU and the virtualisation support it has).
>>> 
>>> Mark
>>> 
>>> Sent from my iPhone
> 
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: OMG text processing performance 6.7 - 9.5

2020-01-31 Thread Bob Sneidar via use-livecode

I take that back it's a 32 bit Windows OS (dunno why I even still have this 
PC). But bitness is not going to affect a single network thread. Also the 
processor is an i3 running at 3.1. My mac is an i7 running at 2.3. Also my Mac 
is clamped to 100mb networking due to the nature of our VIOP phones. The 
Standalone Workstation is gigabit and as I mentioned IS the SQL server itself. 

Bob S


> On Jan 31, 2020, at 07:47 , Bob Sneidar via use-livecode 
>  wrote:
> 
> Not so fast. On a standalone workstation Windows 7 64bit 16gig memory and an 
> SSD: 64 ticks. Compared with 14 ticks on my Mac OS X. My Parallels VM is 
> outperforming a workstation. Oh, and the Windows workstation? It's the 
> workstation running the mySQL instance!!!
> 
> That's exactly my point. It is decidedly NOT like-with-like, but it has 
> nothing to do with virtualization. The Windows LC is running very much slower 
> than the Mac OS, and I'm not blaming LC. It may be Windows itself. I disabled 
> my Antivirus. Got it down to 57 ticks. 
> 
> Bob S
> 
> 
>> On Jan 31, 2020, at 03:06 , Mark Waddingham via use-livecode 
>>  wrote:
>> 
>> That’s not comparing like-with-like though - you are comparing VMs running 
>> Windows on your Mac with your Mac by the sound of it... VMs introduce a fair 
>> bit of overhead for all I/O (and also for some code - depending on the age 
>> of your CPU and the virtualisation support it has).
>> 
>> Mark
>> 
>> Sent from my iPhone

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: OMG text processing performance 6.7 - 9.5

2020-01-31 Thread Bob Sneidar via use-livecode

Not so fast. On a standalone workstation Windows 7 64bit 16gig memory and an 
SSD: 64 ticks. Compared with 14 ticks on my Mac OS X. My Parallels VM is 
outperforming a workstation. Oh, and the Windows workstation? It's the 
workstation running the mySQL instance!!!

That's exactly my point. It is decidedly NOT like-with-like, but it has nothing 
to do with virtualization. The Windows LC is running very much slower than the 
Mac OS, and I'm not blaming LC. It may be Windows itself. I disabled my 
Antivirus. Got it down to 57 ticks. 

Bob S

> On Jan 31, 2020, at 03:06 , Mark Waddingham via use-livecode 
>  wrote:
> 
> That’s not comparing like-with-like though - you are comparing VMs running 
> Windows on your Mac with your Mac by the sound of it... VMs introduce a fair 
> bit of overhead for all I/O (and also for some code - depending on the age of 
> your CPU and the virtualisation support it has).
> 
> Mark
> 
> Sent from my iPhone

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: OMG text processing performance 6.7 - 9.5

2020-01-31 Thread Niggemann, Bernd via use-livecode

Ben,

If you have access to a business-license you could use "script profiling" on a 
small but representative sample of your data and see where the bottlenecks are. 
If you find any you could try to optimize that part.

"script profiling" adds its own overhead to the processing time (roughly 
doubles it) but if you keep the sample small it is well worth it in my 
experience.

Kind regards
Bernd
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: OMG text processing performance 6.7 - 9.5

2020-01-31 Thread Ben Rubinstein via use-livecode


Ooh, that's very likely to be a good tip.

(I think the database stuff is a red herring, by the way: the bulk of the 
delay is in the processing stage, which is just reading text files in and out.)


Thanks everyone for their input. I'll report back with what I can find (unless 
I can persuade IT to let me use the 32 bit drivers!)


Ben

On 30/01/2020 22:41, Ralph DiMola via use-livecode wrote:

I found this as well. Another thing, it's faster to truncate the string and
search from the beginning than using a "start at" on the entire string when
searching for all occurrences of a string . This was counter intuitive to me
until Mark explained that skipping chars requires more work because
repetitive skipping of Unicode chars is slower than many "memcpy"s on very
long strings.

Ralph DiMola
IT Director
Evergreen Information Services
rdim...@evergreeninfo.net


-Original Message-
From: use-livecode [mailto:use-livecode-boun...@lists.runrev.com] On Behalf
Of Neville via use-livecode
Sent: Thursday, January 30, 2020 4:49 PM
To: use-livecode@lists.runrev.com
Cc: Neville
Subject: Re: OMG text processing performance 6.7 - 9.5

Are you perchance using lineOffset searches? I have found that lineOffset
performance on utf8 text degrades exponentially with the length of the file,
presumably as it searches for line breaks. Use offset instead which remains
fast (and much faster still if you can search on the raw text before
textencoding, then utf8 encode the found chunks)
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: OMG text processing performance 6.7 - 9.5

2020-01-31 Thread Mark Waddingham via use-livecode

That’s not comparing like-with-like though - you are comparing VMs running 
Windows on your Mac with your Mac by the sound of it... VMs introduce a fair 
bit of overhead for all I/O (and also for some code - depending on the age of 
your CPU and the virtualisation support it has).

Mark

Sent from my iPhone

> On 31 Jan 2020, at 01:02, Bob Sneidar via use-livecode 
>  wrote:
> 
> Hi Mark. 
> 
> I have to chime in here that the difference between OS X apps accessing a sql 
> database and Windows doing the same thing in the same app is substantial, and 
> I cannot think why, unless it is Windows itself causing the problem. Querying 
> my customer database for all my customer records takes 14 ticks on OSX. On 
> Windows running in my Parallels it takes 59. On a server 2012 running in 
> VMWare it takes 71! 
> 
> This has made my port to Windows almost unviable. If I had some kind of real 
> answer as to what was causing this kind of lag, it would help. 
> 
> Bob S
> 
> 
> 
>> On Jan 30, 2020, at 06:04 , Mark Waddingham via use-livecode 
>>  wrote:
>> 
>> 
>> Warmest Regards,
>> 
>> Mark.
>> 
>> -- 
>> Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
>> LiveCode: Everyone can create apps
> 
> 
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: OMG text processing performance 6.7 - 9.5

2020-01-30 Thread Bob Sneidar via use-livecode

Hi Mark. 

I have to chime in here that the difference between OS X apps accessing a sql 
database and Windows doing the same thing in the same app is substantial, and I 
cannot think why, unless it is Windows itself causing the problem. Querying my 
customer database for all my customer records takes 14 ticks on OSX. On Windows 
running in my Parallels it takes 59. On a server 2012 running in VMWare it 
takes 71! 

This has made my port to Windows almost unviable. If I had some kind of real 
answer as to what was causing this kind of lag, it would help. 

Bob S

> On Jan 30, 2020, at 06:04 , Mark Waddingham via use-livecode 
>  wrote:
> 
> 
> Warmest Regards,
> 
> Mark.
> 
> -- 
> Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
> LiveCode: Everyone can create apps

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

RE: OMG text processing performance 6.7 - 9.5

2020-01-30 Thread Ralph DiMola via use-livecode

I found this as well. Another thing, it's faster to truncate the string and
search from the beginning than using a "start at" on the entire string when
searching for all occurrences of a string . This was counter intuitive to me
until Mark explained that skipping chars requires more work because
repetitive skipping of Unicode chars is slower than many "memcpy"s on very
long strings.

Ralph DiMola
IT Director
Evergreen Information Services
rdim...@evergreeninfo.net


-Original Message-
From: use-livecode [mailto:use-livecode-boun...@lists.runrev.com] On Behalf
Of Neville via use-livecode
Sent: Thursday, January 30, 2020 4:49 PM
To: use-livecode@lists.runrev.com
Cc: Neville
Subject: Re: OMG text processing performance 6.7 - 9.5

Are you perchance using lineOffset searches? I have found that lineOffset
performance on utf8 text degrades exponentially with the length of the file,
presumably as it searches for line breaks. Use offset instead which remains
fast (and much faster still if you can search on the raw text before
textencoding, then utf8 encode the found chunks)
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: OMG text processing performance 6.7 - 9.5

2020-01-30 Thread Neville via use-livecode

Are you perchance using lineOffset searches? I have found that lineOffset 
performance on utf8 text degrades exponentially with the length of the file, 
presumably as it searches for line breaks. Use offset instead which remains 
fast (and much faster still if you can search on the raw text before 
textencoding, then utf8 encode the found chunks)
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: OMG text processing performance 6.7 - 9.5

2020-01-30 Thread Matthias Rebbe via use-livecode

Ben,

what DB are you connecting to?

We are running here a VM with Windows 2019 and MS SQL 2017. On a Windows 10 
64bit VM we are using the 32 bit Microsoft ODBC Driver 11 for SQL Server to 
connect from our  32bit LC standalone to the MSSQL server, although 64bit ODBC 
Driver 11 is installed. But i cannot remember, if the 32bit driver was 
installed separately or was automatically installed when the 64bit ODBC 
Driver11 was installed.

-
Matthias Rebbe
Life Is Too Short For Boring Code

> Am 30.01.2020 um 14:20 schrieb Ben Rubinstein via use-livecode 
> :
> 
> I'm looking for a hints about where the speed has gone in the current 
> (Unicode era) LiveCode text processing. I've been vaguely aware that text 
> processing performance suffered in the transition, but haven't needed to 
> focus on it before.
> 
> The context is that I'm finally forced to replace an app that's been 
> processing data for a client for well over a decade. To date the standalone 
> has been built on LC 6.7.11; but now we need to put it on a new platform with 
> 64-bit database drivers. The performance has gone through the floor, through 
> the floors below, through the foundations, and is on its way to the centre of 
> the earth.
> 
> The first stage of the app - which retrieves a load of data from various 
> databases and online sources, does minimal processing on it, and dumps it to 
> cache files - is approx 2x slower. The main core of the app, which loads this 
> data in and does a vast amount of processing on it to generate various output 
> data and reports, has gone from 12 minutes to over *six hours*.
> 
> (The server itself is different, and running Windows Server 2016 rather than 
> Windows Server 2008, rather than but they're both VMs, quite likely on the 
> same underlying hardware, and if anything I'd expect the new server to be 
> more performant. Of course I assume that a new version of the OS will always 
> be slower )
> 
> The coding is gnarly - the oldest parts are probably at least 15 years old - 
> and I've no doubt it could be made more efficient; but we don't have time or 
> budget to rewrite it all. So, are there known gotchas, functions which have 
> taken a much greater hit than others, that I could concentrate on to get the 
> most ROI in speeding this up?
> 
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode



-
Matthias Rebbe
Life Is Too Short For Boring Code


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: OMG text processing performance 6.7 - 9.5

2020-01-30 Thread Mark Waddingham via use-livecode


On 2020-01-30 14:38, Ben Rubinstein via use-livecode wrote:

Hi Mark,

Thanks for taking the time to reply!

I'm indeed currently in the process of seeing whether I can persuade
the client's IT department to install the 32-bit drivers on the new
VM. I'm optimistic that will buy me some time, but it won't be a
complete solution because they outsource support to a third company,
which has warned that it doesn't intend to support the 32-bit drivers
much longer (apparently they're just waiting for Crystal Reports to be
updated!).


Ah! From that I'm guessing you are using the ODBC revdb driver - which 
needs

a third-party ODBC connector :)


And if that fails, one of my options is as you suggest to use the LC
9.5-built app to retrieve the data through the 64-bit drivers, and the
the LC 6.7-built app to process and (probably) build it. It will be
shonky.


It doesn't have to be 'shonky' - if the fetch-from-database part is 
already

separated from the data-processing-part through cache-files (i.e. fetch
writes to files on disk, data-process reads said files and processes) 
then

you could build a 64-bit win standalone which is the fetch-from-database
part, which is then called by the data-process part using shell (or open
process).

Of course, it would be slightly cleaner to all be one app :)


However, what you say certainly makes me feel more optimistic that
something should be possible. There's really very little going on in
the way of binary<->text conversion; there probably is a fair amount
of word chunking - although half the work is about tracing
cross-references etc, there's also a fair amount of processing of
'prose' and prose-like text. However, the nature of the text is that
although 99% of it is probably ASCII, in any given table of text there
will be just a few 'extended' characters - does that mean it all gets
treated as four-byte data?


Binary<->text can be quite subtle - as it isn't something you had to 
think
about before 6.7. For example, if you are fetching using *b via revDB 
from
the database, then *that* will now be binary data - not text. (Indeed, 
what

accessors are you using to get the data?)

Also, things like binfile and reading for binary (from a file) will also
produce binary rather than text.

You can test for binary data using 'is strictly a binary string'.

Native encoding means (on Windows at least) anything which fits into 
Latin-1
so any text you are getting out of revDB from the database should come 
through

as native strings.

Native strings get converted to unicode internally when they are 
combined with

a string which contains unicode and in two other places:
  1) Using matchText / replaceText (because we use the utf-16 variant of 
PCRE)
  2) When put into a field (because all text layout APIs on all 
platforms use UTF-16)


What sort of text operations are you using for 'tracing cross-references 
etc' and

'processing of 'prose' and prose-like text'?


I'll see how the negotiations with IT get on...


Good luck!

Warmest Regards,

Mark.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: OMG text processing performance 6.7 - 9.5

2020-01-30 Thread Ben Rubinstein via use-livecode


Hi Mark,

Thanks for taking the time to reply!

I'm indeed currently in the process of seeing whether I can persuade the 
client's IT department to install the 32-bit drivers on the new VM. I'm 
optimistic that will buy me some time, but it won't be a complete solution 
because they outsource support to a third company, which has warned that it 
doesn't intend to support the 32-bit drivers much longer (apparently they're 
just waiting for Crystal Reports to be updated!).


And if that fails, one of my options is as you suggest to use the LC 9.5-built 
app to retrieve the data through the 64-bit drivers, and the the LC 6.7-built 
app to process and (probably) build it. It will be shonky.


However, what you say certainly makes me feel more optimistic that something 
should be possible. There's really very little going on in the way of 
binary<->text conversion; there probably is a fair amount of word chunking - 
although half the work is about tracing cross-references etc, there's also a 
fair amount of processing of 'prose' and prose-like text. However, the nature 
of the text is that although 99% of it is probably ASCII, in any given table 
of text there will be just a few 'extended' characters - does that mean it all 
gets treated as four-byte data?


I'll see how the negotiations with IT get on...

Ben

On 30/01/2020 14:04, Mark Waddingham via use-livecode wrote:

On 2020-01-30 13:20, Ben Rubinstein via use-livecode wrote:

The context is that I'm finally forced to replace an app that's been
processing data for a client for well over a decade. To date the
standalone has been built on LC 6.7.11; but now we need to put it on a
new platform with 64-bit database drivers. The performance has gone
through the floor, through the floors below, through the foundations,
and is on its way to the centre of the earth.


What's the need for 64-bit database drivers? i.e. What are you currently
using to talk to the database and why can you not continue to use a 32-bit
Windows standalone?


The first stage of the app - which retrieves a load of data from
various databases and online sources, does minimal processing on it,
and dumps it to cache files - is approx 2x slower. The main core of
the app, which loads this data in and does a vast amount of processing
on it to generate various output data and reports, has gone from 12
minutes to over *six hours*.


I suspect it is probably a couple of things which are being done uniformly
causing the problem rather than lots of things all over the place...

Where exactly is the data coming from? (at a high-level) what sorts
of operations are being performed on it? what sort of I/O is being performed?

The main one I can think of is implicit binary<->text conversions. In 6.7
and below binary data and text were the same thing - in 7+ they are distinct
types which require a conversion operation. The functions which were always
really returning/taking binary data now actually do.

e.g. textEncode / Decode, compress / decompress, binaryEncode / binaryDecode,
the byte chunk, repeat for each byte, numToByte

Given the app is coming from 6.7 vintage, it is unlikely that any of the new
unicode text codepaths would be hit (unless there's something odd going on
somewhere) as binary data converts to native encoded text - unless of course
the means by which the data is getting into the app is being taken as unicode
strings (without knowing the exact I/O going on I can't really see how this
could happen, but I can't rule it out).

In general, native text processing (item detection, comparison, containment
and such) is all as fast if not faster in the post-7 engines than 6.7 as I
spent quite a while specializing a lot of lower level routines to make sure
it was.

I do know the word chunk has been somewhat adversely affected, however, as
that was never optimized in the same way.


The coding is gnarly - the oldest parts are probably at least 15 years
old - and I've no doubt it could be made more efficient; but we don't
have time or budget to rewrite it all. So, are there known gotchas,
functions which have taken a much greater hit than others, that I
could concentrate on to get the most ROI in speeding this up?


Given that you don't have time nor budget to really touch the code at all
in any depth then it would best to not have to touch it at all and keep
it in 6.7.11? i.e. Do you really need to move to 6?

Could you split the app into the bit which does the database communication
and caching (assuming that *really* needs to be 64-bit) and the bit which
does the data processing (which could remain as 32-bit in 6.7.11).

Note I should say that the reason I ask the above is not because of a lack
of confidence in getting your code to run as fast as it did before but
because of pure business reasoning - why spend time and money on something
which isn't necessarily really needed?

There's a difference between needing to update user-facing apps and true
back-office server apps after all - banks and

Re: OMG text processing performance 6.7 - 9.5

2020-01-30 Thread Mark Waddingham via use-livecode


On 2020-01-30 13:20, Ben Rubinstein via use-livecode wrote:

The context is that I'm finally forced to replace an app that's been
processing data for a client for well over a decade. To date the
standalone has been built on LC 6.7.11; but now we need to put it on a
new platform with 64-bit database drivers. The performance has gone
through the floor, through the floors below, through the foundations,
and is on its way to the centre of the earth.


What's the need for 64-bit database drivers? i.e. What are you currently
using to talk to the database and why can you not continue to use a 
32-bit

Windows standalone?


The first stage of the app - which retrieves a load of data from
various databases and online sources, does minimal processing on it,
and dumps it to cache files - is approx 2x slower. The main core of
the app, which loads this data in and does a vast amount of processing
on it to generate various output data and reports, has gone from 12
minutes to over *six hours*.


I suspect it is probably a couple of things which are being done 
uniformly

causing the problem rather than lots of things all over the place...

Where exactly is the data coming from? (at a high-level) what sorts
of operations are being performed on it? what sort of I/O is being 
performed?


The main one I can think of is implicit binary<->text conversions. In 
6.7
and below binary data and text were the same thing - in 7+ they are 
distinct
types which require a conversion operation. The functions which were 
always

really returning/taking binary data now actually do.

e.g. textEncode / Decode, compress / decompress, binaryEncode / 
binaryDecode,

the byte chunk, repeat for each byte, numToByte

Given the app is coming from 6.7 vintage, it is unlikely that any of the 
new
unicode text codepaths would be hit (unless there's something odd going 
on
somewhere) as binary data converts to native encoded text - unless of 
course
the means by which the data is getting into the app is being taken as 
unicode
strings (without knowing the exact I/O going on I can't really see how 
this

could happen, but I can't rule it out).

In general, native text processing (item detection, comparison, 
containment
and such) is all as fast if not faster in the post-7 engines than 6.7 as 
I
spent quite a while specializing a lot of lower level routines to make 
sure

it was.

I do know the word chunk has been somewhat adversely affected, however, 
as

that was never optimized in the same way.


The coding is gnarly - the oldest parts are probably at least 15 years
old - and I've no doubt it could be made more efficient; but we don't
have time or budget to rewrite it all. So, are there known gotchas,
functions which have taken a much greater hit than others, that I
could concentrate on to get the most ROI in speeding this up?


Given that you don't have time nor budget to really touch the code at 
all

in any depth then it would best to not have to touch it at all and keep
it in 6.7.11? i.e. Do you really need to move to 6?

Could you split the app into the bit which does the database 
communication
and caching (assuming that *really* needs to be 64-bit) and the bit 
which

does the data processing (which could remain as 32-bit in 6.7.11).

Note I should say that the reason I ask the above is not because of a 
lack

of confidence in getting your code to run as fast as it did before but
because of pure business reasoning - why spend time and money on 
something

which isn't necessarily really needed?

There's a difference between needing to update user-facing apps and true
back-office server apps after all - banks and insurance companies still 
have
software written on and running on machines which are decades old 
because
they work and the cost of keeping them running is vastly less than the 
cost

to rewrite and replace!).

Warmest Regards,

Mark.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

OMG text processing performance 6.7 - 9.5

2020-01-30 Thread Ben Rubinstein via use-livecode

I'm looking for a hints about where the speed has gone in the current (Unicode 
era) LiveCode text processing. I've been vaguely aware that text processing 
performance suffered in the transition, but haven't needed to focus on it before.


The context is that I'm finally forced to replace an app that's been 
processing data for a client for well over a decade. To date the standalone 
has been built on LC 6.7.11; but now we need to put it on a new platform with 
64-bit database drivers. The performance has gone through the floor, through 
the floors below, through the foundations, and is on its way to the centre of 
the earth.


The first stage of the app - which retrieves a load of data from various 
databases and online sources, does minimal processing on it, and dumps it to 
cache files - is approx 2x slower. The main core of the app, which loads this 
data in and does a vast amount of processing on it to generate various output 
data and reports, has gone from 12 minutes to over *six hours*.


(The server itself is different, and running Windows Server 2016 rather than 
Windows Server 2008, rather than but they're both VMs, quite likely on the 
same underlying hardware, and if anything I'd expect the new server to be more 
performant. Of course I assume that a new version of the OS will always be 
slower )


The coding is gnarly - the oldest parts are probably at least 15 years old - 
and I've no doubt it could be made more efficient; but we don't have time or 
budget to rewrite it all. So, are there known gotchas, functions which have 
taken a much greater hit than others, that I could concentrate on to get the 
most ROI in speeding this up?


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

37 matches

Mail list logo