I have been engaged in an off-list conversation with Dar Scott, mostly around his solution to the Yavelow test. In the course of it we found out or re-confirmed a few things about Rev characteristics and I pass them on here for anyone who is interested. The information comprises a somewhat random collection.

On a file of decent size it is faster to strip all non-alphanumerics (the set ".,!#$%&'()*+-/:;<=>[EMAIL PROTECTED]|}~]" & quote) and use -replace- than it is to use a single -matchText- on every word in a -repeat for each- loop. The code would be something like:

constant unwanted=".,!#$%&'()*+-/:;<=>[EMAIL PROTECTED]|}~]"
put quote & unwanted into stripThisLot
repeat for each char i in stripThisLot
  replace i with space in whatever
end repeat

If you can limit to a subset (most text uses less than half that punctuation) then of course you save some more time. Stripping non-alphanumeric characters saves a lot of trouble with word handling as otherwise constructions like "elsewhere!--Oh!" might confuse matters by appearing as one word.

Handling items is about 20% faster than handling words. This advantage may be lost where you have to account for empty items, whereas words are never empty. For example, "elsewhere,,,Oh," vs "elsewhere Oh " have five items and two words respectively. Three of the items are empty.

The -replace- command (replace x with y in z) slows down with larger data sets in a highly predictable way, defined as:
log(time) is a linear function of log(length(data))
To make the point, the log formula fits a regression of the experimental data to a level of 99.96% on data of length 1000 to 300,000 characters. Note that in this instance the log formula is not the same statement as "time is a linear function of length(data)".


Array insertion time is almost constant with array size, in arrays from ten to a million unique keys. Therefore, it is clearly faster to work on one large array than on multiple smaller arrays and then use union at the end. Sometimes, breaking a job into smaller pieces slows it down.

Stripping alphanumerics gives different answers. This may or may not be what you want. For example, below is Dar's code with minor modifications by me and some changes for consistency with Brian Yennie's code. It runs about 20% slower than Brian's solution and produces 15% more matches ;-). Not setting caseSensitive true would add another match. It all depends on what you want.

regards
David

constant nonWords=".,!#$%&'()*+-/:;<=>[EMAIL PROTECTED]|}~]" --Just so I remember the full set

on mouseUp
local sTicks, sMS, tt, stList, unwanted, i, wordNumber, prevWord, w, stL
local match, currentPos, numberMatches


  put empty into field "TheResults"
  set cursor to watch
  put the milliseconds into sMS
  lock screen

  -- Using direct data as we already know the data load takes 18mS
  put field "TargetText" into tt
  put field "SearchTextList" into stList

set caseSensitive to true
-- Stripping this sub-set takes 6% of run time.
put quote & ".,!&'()*-:;?" into unwanted
repeat for each char i in unwanted
replace i with space in tt
replace i with space in stList
end repeat
-- Create set of single words and set of position sets for double words
-- Following repeat loop takes just over 80% of run time
put 1 into wordNumber
put "$$$" into prevWord
repeat for each word w in tt
put true into boolA[w]
put wordNumber & space after ttA[prevWord,w]
add 1 to wordNumber
put w into prevWord
end repeat


-- Search
-- Just over 13% of run time
repeat for each line stLine in stList
put empty into stL
repeat for each word w in stLine
put w & space after stL
end repeat
if the number of words in stL = 1 then
put boolA[word 1 of stL] into match
else
repeat with i = 2 to the number of words in stL
put (word (i-1) of stL),(word i of stL) into w
if i=2 then
put ttA[w] into currentPos
else
-- This function absorbs 11 points of the 13%
put intersectPosSet(currentPos, ttA[w]) into currentPos
end if
put the number of words in currentPos into numberMatches
if numberMatches is 0 then exit repeat
end repeat
put numberMatches is not zero into match
end if
if match then
put stLine & cr after MatchList
end if
end repeat
put matchList into field "TheResults"
put timeit(sMS, "Check Matches") into sMS
-- Sort lines and counting them is outside the timing
sort lines of field "TheResults" ascending text
put (the number of lines of field "TheResults") & lf before field "TheResults"
end mouseUp


-- modified intersection function that adds one to elements of the first
function intersectPosSet cPos ttPos
  local newPos, p1, p2
  put empty into newPos
  put (word 1 of cPos)+1 into p1
  put word 1 of ttPos into p2
  repeat forever -- until (cPos is empty) or (ttPos is empty)
    if p1 < p2 then
      delete word 1 of cPos
      if cPos is empty then return newPos
      put (word 1 of cPos)+1 into p1
    else
      if p1 = p2 then
        put p1 & space after newPos
        delete word 1 of cPos
        if cPos is empty then return newPos
        delete word 1 of ttPos
        if ttPos is empty then return newPos
        put (word 1 of cPos)+1 into p1
        put word 1 of ttPos into p2
      else
        delete word 1 of ttPos
        if ttPos is empty then return newPos
        put word 1 of ttPos into p2
      end if
    end if
  end repeat
  return newPos
end intersectPosSet

function timeit tMS,tMSG
get the milliseconds - tMS
put it && "ms:" && tMSG && "(" & round(it/16.67) && "ticks)" & return before field "SpeedRecords"
return the milliseconds
end timeit


_______________________________________________
use-revolution mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to