Re: Searching for a word when it's more than one word

Richmond Mathewson via use-livecode Sat, 01 Sep 2018 03:07:38 -0700

Obviously, when considering names of places such as Colchester,Rochester and Chester one has

to search for the longer names first and exclude them from later searches.


Richmond.

On 1/9/2018 12:59 pm, Mark Waddingham via use-livecode wrote:

On 2018-09-01 06:48, Stephen MacLean via use-livecode wrote:
Hi All,

First, followed Keith Clarke’s thread and got a lot out of it, thank
you all. That’s gone into my code snippets!

Now I know, the title is not technically true, if it’s 2 words, they
are distinct and different. Maybe it’s because I’ve been banging my
head against this and some other things too long and need to step
back, but I’m having issues getting this all to work reliably.

I’m searching for town names in various text from a list of towns .
Most names are one word, easy to find and count. Some names are 2 or 3
words, like East Hartford or West Palm Beach. Those go against
distinct towns like Hartford and Palm Beach. Others have their names
inside of other town names like Colchester and Chester.
So the problem you are trying to solve sounds like this:
Given a source text TEXT, and a list of multi-word phrases PHRASES,find the longest elements of PHRASES which occur in TEXT when readingfrom left to right.
One way to do this is to preprocess the source TEXT and PHRASES, andthen iterate over it with back-tracking attempting to match eachphrase in the list.
Preprocessing can be done like this:
// pText is arbitrary language text, where it presumed 'trueWord'will extract
  // the words we can match against those in PHRASES
  command preprocessText pText, @rWords
    local tWords
    repeat for each trueWord tWord in pText
      -- normalize word variants - e.g. turn Chester's into Chester
      if tWord ends with "'s" then
        put char 1 to -3 of tWord into tWord
      else if ... then
        ...
      else if ... then
        ...
      end if
      put tWord into tWords[the number of elements in tWords + 1]
    end repeat
    put tWords into rWords
  end preprocessText
This gives a sequence of words, in order - where word variants havebeen normalized to the 'root' word (the general operation here iscalled 'stemming' - in your case as you are dealing with fragments ofproper nouns - 's / s suffixes are probably good enough).
The processing for PHRASES is needed to ensure that they all follow aconsistent form:
  // pPhrases is presumed to be a return-delimited list of phrases
  command preprocessPhrases pPhrases, @rPhrases
-- We accumulate phrases as the keys of tPhrasesA to eliminateduplicates
    local tPhrasesA
    put empty into tPhrasesA

    local tPhrases
    repeat for each line tPhrase in pPhrases
      local tPhrase
      put empty into tPhrase
      repeat for each trueWord tWord in tPhrase
        put tWord & space after tPhrase
      end repeat
      delete the last char of tPhrase
      put true into tPhrasesA[tPhrase]
    end repeat

    put the keys of tPhrasesA into rPhrases
  end preprocessPhrases
This produces a return-delimited list of phrases, where the individualwords in each phrase are separated by a *single* space with allpunctuation stripped, and no phrase appears twice.
With this pre-processing (not the PHRASES pre-processing only needs tobe done once for any set of PHRASES to match). A naive searchalgorithm would be:
// pText should be a sequence array of words to search (we use anarray here because we need fast random access)// pPhrases should be a line delimited string-list of multi-wordphrases to find
  // rMatches will be a string-list of phrases which have been found
  command searchTextForPhrases pText, pPhrases, @rMatches
    local tMatchesA
    put empty into tMatchesA

    -- Our phrases are single-space delimited, so set the item delimiter
    set the itemDelimiter to space

    -- Loop through pText, by default we bump tIndex by one each time
-- however, if a match is found, then we can skip the wordsconstituting
    -- the matched phrase.
    local tIndex
    put 1 into tIndex
    repeat until pText[tIndex] is empty
      -- Store the current longest match we have found starting at tIndex
      local tCurrentMatch
      put empty into tCurrentMatch

      -- Check each phrase in turn for a match.
      repeat for each line tPhrase in pPhrases
        -- Assume a match succeeds until it doesn't
        local tPhraseMatched
        put true into tPhraseMatched
-- Iterate through the items (words) in each phrase, if thesequence of-- words in the phrase is not the same as the sequence ofwords in the text-- starting at tIndex, then tPhraseMatched will be false onexit of the loop.
        local tSubIndex
        put tIndex into tSubIndex
        repeat for each item tWord in tPhrase
-- Failure to match the word at tSubIndex is failure tomatch the phrase
          if pText[tSubIndex] is not tWord then
            put false into tPhraseMatched
            exit repeat
          end if

          -- The current word of the phrase matches, so move to the nbext
          add 1 to tSubIndex
        end repeat
-- We are only interested in the longest match at any point,so only update
        -- the current match if it is longer.
if tPhraseMatched and the number of items in tPhrase > thenumber of items in tCurrentMatch then
          put tPhrase into tCurrentMatch
        end if
      end repeat
-- If a match was found, then we have used up those words inpText, otherwise
      -- we start the search again at the next word in pText.
      if tCurrentMatch is not empty then
        add the number of items in tCurrentMatch to tIndex
        put true into tMatchesA[tCurrentMatch]
      else
        add 1 to tIndex
      end if
    end repeat
-- At this point, the matched phrases are simply the keys oftMatchesA
    put the keys of tMatchesA into rMatches
  end searchTextForPhrases
Complexity wise, the above requires roughly N*M steps - where N is thenumber of words in pText, M is the number of words in pPhrases.
An immediate improvement can be made by sorting pPhrases descending bythe number of items - this then eliminates the need to check phrasematch length - the first match will always be the longest meaning onceyou have matched, you don't need to keep iterating through the phrases.
    ...
    put the keys of tPhrasesA into rPhrases
sort lines of rPhrases numeric descending by the number of itemsin each
  end preprocessPhrases

    ...
-- We are only interested in the longest match at any point, soonly update
    -- the current match if it is longer.
    if tPhraseMatched then
      put tPhrase into tCurrentMatch
      exit repeat
    end if
  end repeat
There's a lot more that can be done here to make the above a greatdeal more efficient (algorithmically-wise). Indeed, the best you canachieve is probably N*K steps for a source text containing N words -where K is the maximum difference in length between any two phraseswhich share a common prefix.
Warmest Regards,

Mark.


_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Searching for a word when it's more than one word

Reply via email to