Something about speed

David Vaughan Sun, 25 Apr 2004 01:16:59 -0700

I have been engaged in an off-list conversation with Dar Scott, mostly around his solution to the Yavelow test. In the course of it we found out or re-confirmed a few things about Rev characteristics and I pass them on here for anyone who is interested. The information comprises a somewhat random collection.

On a file of decent size it is faster to strip all non-alphanumerics (the set ".,!#$%&'()*+-/:;<=>[EMAIL PROTECTED]|}~]" & quote) and use -replace- than it is to use a single -matchText- on every word in a -repeat for each- loop. The code would be something like:

constant unwanted=".,!#$%&'()*+-/:;<=>[EMAIL PROTECTED]|}~]"
put quote & unwanted into stripThisLot
repeat for each char i in stripThisLot
  replace i with space in whatever
end repeat

If you can limit to a subset (most text uses less than half that punctuation) then of course you save some more time. Stripping non-alphanumeric characters saves a lot of trouble with word handling as otherwise constructions like "elsewhere!--Oh!" might confuse matters by appearing as one word.

Handling items is about 20% faster than handling words. This advantage may be lost where you have to account for empty items, whereas words are never empty. For example, "elsewhere,,,Oh," vs "elsewhere Oh " have five items and two words respectively. Three of the items are empty.

The -replace- command (replace x with y in z) slows down with larger data sets in a highly predictable way, defined as: log(time) is a linear function of log(length(data)) To make the point, the log formula fits a regression of the experimental data to a level of 99.96% on data of length 1000 to 300,000 characters. Note that in this instance the log formula is not the same statement as "time is a linear function of length(data)".

Array insertion time is almost constant with array size, in arrays from ten to a million unique keys. Therefore, it is clearly faster to work on one large array than on multiple smaller arrays and then use union at the end. Sometimes, breaking a job into smaller pieces slows it down.

Stripping alphanumerics gives different answers. This may or may not be what you want. For example, below is Dar's code with minor modifications by me and some changes for consistency with Brian Yennie's code. It runs about 20% slower than Brian's solution and produces 15% more matches ;-). Not setting caseSensitive true would add another match. It all depends on what you want.

regards
David

constant nonWords=".,!#$%&'()*+-/:;<=>[EMAIL PROTECTED]|}~]" --Just so I remember the full set

on mouseUp local sTicks, sMS, tt, stList, unwanted, i, wordNumber, prevWord, w, stL local match, currentPos, numberMatches

  put empty into field "TheResults"
  set cursor to watch
  put the milliseconds into sMS
  lock screen

  -- Using direct data as we already know the data load takes 18mS
  put field "TargetText" into tt
  put field "SearchTextList" into stList

set caseSensitive to true -- Stripping this sub-set takes 6% of run time. put quote & ".,!&'()*-:;?" into unwanted repeat for each char i in unwanted replace i with space in tt replace i with space in stList end repeat -- Create set of single words and set of position sets for double words -- Following repeat loop takes just over 80% of run time put 1 into wordNumber put "$$$" into prevWord repeat for each word w in tt put true into boolA[w] put wordNumber & space after ttA[prevWord,w] add 1 to wordNumber put w into prevWord end repeat

-- Search -- Just over 13% of run time repeat for each line stLine in stList put empty into stL repeat for each word w in stLine put w & space after stL end repeat if the number of words in stL = 1 then put boolA[word 1 of stL] into match else repeat with i = 2 to the number of words in stL put (word (i-1) of stL),(word i of stL) into w if i=2 then put ttA[w] into currentPos else -- This function absorbs 11 points of the 13% put intersectPosSet(currentPos, ttA[w]) into currentPos end if put the number of words in currentPos into numberMatches if numberMatches is 0 then exit repeat end repeat put numberMatches is not zero into match end if if match then put stLine & cr after MatchList end if end repeat put matchList into field "TheResults" put timeit(sMS, "Check Matches") into sMS -- Sort lines and counting them is outside the timing sort lines of field "TheResults" ascending text put (the number of lines of field "TheResults") & lf before field "TheResults" end mouseUp

-- modified intersection function that adds one to elements of the first
function intersectPosSet cPos ttPos
  local newPos, p1, p2
  put empty into newPos
  put (word 1 of cPos)+1 into p1
  put word 1 of ttPos into p2
  repeat forever -- until (cPos is empty) or (ttPos is empty)
    if p1 < p2 then
      delete word 1 of cPos
      if cPos is empty then return newPos
      put (word 1 of cPos)+1 into p1
    else
      if p1 = p2 then
        put p1 & space after newPos
        delete word 1 of cPos
        if cPos is empty then return newPos
        delete word 1 of ttPos
        if ttPos is empty then return newPos
        put (word 1 of cPos)+1 into p1
        put word 1 of ttPos into p2
      else
        delete word 1 of ttPos
        if ttPos is empty then return newPos
        put word 1 of ttPos into p2
      end if
    end if
  end repeat
  return newPos
end intersectPosSet

function timeit tMS,tMSG get the milliseconds - tMS put it && "ms:" && tMSG && "(" & round(it/16.67) && "ticks)" & return before field "SpeedRecords" return the milliseconds end timeit

_______________________________________________
use-revolution mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/use-revolution

Something about speed

Reply via email to