On a file of decent size it is faster to strip all non-alphanumerics (the set ".,!#$%&'()*+-/:;<=>[EMAIL PROTECTED]|}~]" & quote) and use -replace- than it is to use a single -matchText- on every word in a -repeat for each- loop. The code would be something like:
constant unwanted=".,!#$%&'()*+-/:;<=>[EMAIL PROTECTED]|}~]" put quote & unwanted into stripThisLot repeat for each char i in stripThisLot replace i with space in whatever end repeat
If you can limit to a subset (most text uses less than half that punctuation) then of course you save some more time. Stripping non-alphanumeric characters saves a lot of trouble with word handling as otherwise constructions like "elsewhere!--Oh!" might confuse matters by appearing as one word.
Handling items is about 20% faster than handling words. This advantage may be lost where you have to account for empty items, whereas words are never empty. For example, "elsewhere,,,Oh," vs "elsewhere Oh " have five items and two words respectively. Three of the items are empty.
The -replace- command (replace x with y in z) slows down with larger data sets in a highly predictable way, defined as:
log(time) is a linear function of log(length(data))
To make the point, the log formula fits a regression of the experimental data to a level of 99.96% on data of length 1000 to 300,000 characters. Note that in this instance the log formula is not the same statement as "time is a linear function of length(data)".
Array insertion time is almost constant with array size, in arrays from ten to a million unique keys. Therefore, it is clearly faster to work on one large array than on multiple smaller arrays and then use union at the end. Sometimes, breaking a job into smaller pieces slows it down.
Stripping alphanumerics gives different answers. This may or may not be what you want. For example, below is Dar's code with minor modifications by me and some changes for consistency with Brian Yennie's code. It runs about 20% slower than Brian's solution and produces 15% more matches ;-). Not setting caseSensitive true would add another match. It all depends on what you want.
regards David
constant nonWords=".,!#$%&'()*+-/:;<=>[EMAIL PROTECTED]|}~]" --Just so I remember the full set
on mouseUp
local sTicks, sMS, tt, stList, unwanted, i, wordNumber, prevWord, w, stL
local match, currentPos, numberMatches
put empty into field "TheResults" set cursor to watch put the milliseconds into sMS lock screen
-- Using direct data as we already know the data load takes 18mS put field "TargetText" into tt put field "SearchTextList" into stList
set caseSensitive to true
-- Stripping this sub-set takes 6% of run time.
put quote & ".,!&'()*-:;?" into unwanted
repeat for each char i in unwanted
replace i with space in tt
replace i with space in stList
end repeat
-- Create set of single words and set of position sets for double words
-- Following repeat loop takes just over 80% of run time
put 1 into wordNumber
put "$$$" into prevWord
repeat for each word w in tt
put true into boolA[w]
put wordNumber & space after ttA[prevWord,w]
add 1 to wordNumber
put w into prevWord
end repeat
-- Search
-- Just over 13% of run time
repeat for each line stLine in stList
put empty into stL
repeat for each word w in stLine
put w & space after stL
end repeat
if the number of words in stL = 1 then
put boolA[word 1 of stL] into match
else
repeat with i = 2 to the number of words in stL
put (word (i-1) of stL),(word i of stL) into w
if i=2 then
put ttA[w] into currentPos
else
-- This function absorbs 11 points of the 13%
put intersectPosSet(currentPos, ttA[w]) into currentPos
end if
put the number of words in currentPos into numberMatches
if numberMatches is 0 then exit repeat
end repeat
put numberMatches is not zero into match
end if
if match then
put stLine & cr after MatchList
end if
end repeat
put matchList into field "TheResults"
put timeit(sMS, "Check Matches") into sMS
-- Sort lines and counting them is outside the timing
sort lines of field "TheResults" ascending text
put (the number of lines of field "TheResults") & lf before field "TheResults"
end mouseUp
-- modified intersection function that adds one to elements of the first function intersectPosSet cPos ttPos local newPos, p1, p2 put empty into newPos put (word 1 of cPos)+1 into p1 put word 1 of ttPos into p2 repeat forever -- until (cPos is empty) or (ttPos is empty) if p1 < p2 then delete word 1 of cPos if cPos is empty then return newPos put (word 1 of cPos)+1 into p1 else if p1 = p2 then put p1 & space after newPos delete word 1 of cPos if cPos is empty then return newPos delete word 1 of ttPos if ttPos is empty then return newPos put (word 1 of cPos)+1 into p1 put word 1 of ttPos into p2 else delete word 1 of ttPos if ttPos is empty then return newPos put word 1 of ttPos into p2 end if end if end repeat return newPos end intersectPosSet
function timeit tMS,tMSG
get the milliseconds - tMS
put it && "ms:" && tMSG && "(" & round(it/16.67) && "ticks)" & return before field "SpeedRecords"
return the milliseconds
end timeit
_______________________________________________ use-revolution mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/use-revolution