Update : the following code should be more efficient for getting the actual substring(s) replacing the loop starting "For($i;1;$Size)".
The following code, uses the built up 2D array to determine where the longest substring(s) end, this gives a position in the shorter string where the common string ends. Then we 'walk backwards' over the shorter string by common string length -1 to build the common substring(s). This avoids a lot of thrashing with position and substring. ARRAY TEXT($aMatch;0) For ($i;1;$m) //$m - is length of shorter string/dimension of 2D array $Found:=Find in array($LCSub{$i};$longest) If ($Found>0) $string:="" For ($j;1;$longest) //Build up the common substring $String:=$shorter[[$Found-$j+1]]+$string End for append to array($aMatch;$String) // add to array to return End if End for On Mon, 26 Aug 2019 13:55:02 -0400, Chip Scheide via 4D_Tech wrote: > Keith, > Thanks for you input!! > we seem to have a nice collaboration on text utilities :) > > I ran the code you provided, and found an issue. > Given: > "This is my dog" & "My Dog does not have fleas" > > - Your initial code generates NO matches - you use Position with Case > sensitivity ON, and as for position > 1, missing matches starting at > position 1. > - With out case sensitivity, & Position > 0, Your code generates the > longest common string as length 5, and "my do" > This is because you end up truncating the last character comparison due > to using [[$i-1]] > > Follows, modified code to handle requests for both case sensitivity, > and not, as well as removing the truncation issue. > > > I am thinking that it might be slightly more efficient to: > - compare the first 2 characters -- removing 1 iteration over the > longer string. But it might not be worth the housekeeping to code that. > - to build the final longest string as you go, rather then reiterate > back over the strings looking for matches of the longest length. > Where X = the number of times the longest common string length can be > found in the shortest string... it would save X substrings, 2X > positions X find in array, 2X assignments, and 2X comparisons + the > loop counter incrementation and comparison. > > > For relatively short strings, there would likely be no significant > difference in execution time, however, very long strings (what length > is 'long' I do not know), or repeated execution of this code in some > sort of loop the difference might be large. > > > Code start:: > > // ---------------------------------------------------- > // Method: Suffix_Tree - from Wiki examples > // - > // INPUT1: Text - to compare > // INPUT2: Text - to compare > // INPUT3: Pointer - array of longest matches > // INPUT4: Boolean - Do case sensitive comparisons (default = false) > // OUTPUT: Longint - length > // ---------------------------------------------------- > C_LONGINT($i;$size;$longest;$0) > C_TEXT($longer;$shorter;$temp) > C_POINTER($3;$Return_Array) > C_BOOLEAN($4;$Case_Sensitive;$Same) > $str1:=$1 > $str2:=$2 > $Return_Array:=$3 > > If (Count parameters>=4) > $Case_Sensitive:=$4 > End if > > If (Length($str1)>Length($str2)) > $longer:=$str1 > $shorter:=$str2 > Else > $longer:=$str2 > $shorter:=$str1 > End if > > $m:=Length($longer) // is the longer string > $n:=Length($shorter) > > ARRAY LONGINT($LCSub;$m;$n) > $longest:=0 > > For ($i;1;$m) > For ($j;1;$n) > > If ($Case_Sensitive) //doing case sensitive comparision - yes > $Same:=(Character code($longer[[$i]])=Character code($shorter[[$j]])) > Else //- no > $Same:=($longer[[$i]]=$shorter[[$j]]) > End if > > If ($Same) //are the current strings the same? - yes > > //depending on the comparision point add this character count > Case of > : ($i=1) > $LCSub{$i}{$j}:=$LCSub{$i}{$j}+1 > : ($j=1) > $LCSub{$i}{$j}:=$LCSub{$i-1}{$j}+1 > Else > $LCSub{$i}{$j}:=$LCSub{$i-1}{$j-1}+1 > End case > > //longer then the previous longest? > If ($longest<$LCSub{$i}{$j}) // > $longest:=$LCSub{$i}{$j} //update the longest substring found > End if > Else > $LCSub{$i}{$j}:=0 > End if > > End for > End for > > // search for matches of $longest length > ARRAY TEXT($aMatch;0) > $size:=Length($shorter)-$longest+1 > > //for every possible substring of the longest length > //try to find substring matches > For ($i;1;$size) > $temp:=Substring($shorter;$i;$longest) > > If ($Case_Sensitive) //case sensitive comparison - Yes > $Found:=Position($temp;$longer;*) > Else //-no > $Found:=Position($temp;$longer) > End if > > If ($Found>0) //note was >1 > > //hve we found this substrig before? > If (Find in array($aMatch;$temp)=-1) > APPEND TO ARRAY($aMatch;$temp) //- no > End if > End if > End for > > COPY ARRAY($aMatch;$Return_Array->) > $0:=$longest > > > On Sat, 24 Aug 2019 16:23:56 -0500, Keith Culotta via 4D_Tech wrote: >> That's great. It's fast, and reduces the number of searches. >> >> Keith - CDI >> >> >> // ---------------------------------------------------- >> // Method: Suffix_Tree - from Wiki examples >> // - >> // INPUT1: Text - to compare >> // INPUT2: Text - to compare >> // INPUT3: Pointer - array of longest matches >> // OUTPUT: Longint - length >> // ---------------------------------------------------- >> C_LONGINT($i;$size;$longest;$0) >> C_TEXT($longer;$shorter;$temp) >> $str1:=$1 >> $str2:=$2 >> >> If (Length($str1)>Length($str2)) >> $longer:=$str1 >> $shorter:=$str2 >> Else >> $longer:=$str2 >> $shorter:=$str1 >> End if >> >> $m:=Length($longer) // is the longer string >> $n:=Length($shorter) >> >> ARRAY LONGINT($LCSub;$m;$n) >> $longest:=0 >> >> For ($i;1;$m) >> For ($j;1;$n) >> If ($i=1) | ($j=1) >> $LCSub{$i}{$j}:=0 >> >> Else >> If ($longer[[$i-1]]=$shorter[[$j-1]]) >> $LCSub{$i}{$j}:=$LCSub{$i-1}{$j-1}+1 >> If ($longest<$LCSub{$i}{$j}) >> $longest:=$LCSub{$i}{$j} >> End if >> Else >> $LCSub{$i}{$j}:=0 >> End if >> End if >> >> End for >> End for >> >> // search for matches of $longest length >> ARRAY TEXT($aMatch;0) >> $size:=Length($shorter)-$longest+1 >> For ($i;1;$size) >> $temp:=Substring($shorter;$i;$longest) >> If (Position($temp;$longer;*)>1) >> If (Find in array($aMatch;$temp)=-1) >> APPEND TO ARRAY($aMatch;$temp) >> End if >> End if >> End for >> >> COPY ARRAY($aMatch;$3->) >> $0:=$longest >> >> >> >> >>> On Aug 24, 2019, at 2:41 PM, John DeSoi via 4D_Tech >>> <4d_tech@lists.4d.com> wrote: >>> >>> See >>> >>> https://en.wikipedia.org/wiki/Longest_common_substring_problem >>> >>> John DeSoi, Ph.D. >>> >>> >>>> On Aug 24, 2019, at 10:48 AM, Keith Culotta via 4D_Tech >>>> <4d_tech@lists.4d.com> wrote: >>>> >>>> This version stands alone, and runs a little more efficiently. >>>> It's not been tested in every way, but the results are >>>> encouraging. Interesting problem. I can't think of a way to do it >>>> without comparing every character combination. The new "Split >>>> string" command would speed part of this up if Collections could be >>>> used. >>> >> >> ********************************************************************** >> 4D Internet Users Group (4D iNUG) >> Archive: http://lists.4d.com/archives.html >> Options: https://lists.4d.com/mailman/options/4d_tech >> Unsub: mailto:4d_tech-unsubscr...@lists.4d.com >> ********************************************************************** > --------------- > Gas is for washing parts > Alcohol is for drinkin' > Nitromethane is for racing > ********************************************************************** > 4D Internet Users Group (4D iNUG) > Archive: http://lists.4d.com/archives.html > Options: https://lists.4d.com/mailman/options/4d_tech > Unsub: mailto:4d_tech-unsubscr...@lists.4d.com > ********************************************************************** --------------- Gas is for washing parts Alcohol is for drinkin' Nitromethane is for racing ********************************************************************** 4D Internet Users Group (4D iNUG) Archive: http://lists.4d.com/archives.html Options: https://lists.4d.com/mailman/options/4d_tech Unsub: mailto:4d_tech-unsubscr...@lists.4d.com **********************************************************************