Hi Henrik, Your function is really faster, but I tested it to solve my problem. And I found it is too time consuming yet for me. This happens because I need to compare strings from two very large vectors. Bellow I wrote the syntax I have to use:
########################## Ls1<-length(s1) Ls2<-length(s2) for ( p in 1:ls1){ for (q in 1:ls2){ t1<-levenshteinFast(s1[p],s2[q]) if (t1<s12[p]){ s12[p]<-s2[q] n12[q]<-t1} } } ############################# If you want to have na idea, the size of my loop are: Ls1=42000 Ls2=70000 I think I will wait for months untill this program ends. Do you have any sugestion to increase the speed? The AGREP function is much faster, and I am using it, but I don't have a efficient comparation because AGREP(a,b) is not equal AGREP(b,a). Thanks, Marcos -----Mensagem original----- De: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Em nome de Henrik Bengtsson Enviada em: quinta-feira, 12 de fevereiro de 2004 13:12 Para: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Assunto: RE: [R] AGREP > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Gabor > Grothendieck > Sent: den 12 februari 2004 16:07 > To: [EMAIL PROTECTED]; [EMAIL PROTECTED] > Subject: Re: [R] AGREP > > One could shorten it slightly with these minor improvements. > Unfortunately, the key performance problem, the double loop > at the end which implements the dynamic programming calculation, > is still there. > > levenshtein<-function(s1,s2) { > # Make sure args are strings > a <- as.character(s1); an <- nchar(s1)+1 > b <- as.character(s2); bn <- nchar(s2)+1 > > # If one arg is an empty string, returns the length of the other > if (nchar(a)==0) return(nchar(b)) > if (nchar(b)==0) return(nchar(a)) > > # Initialize matrix for calculations > m <- matrix(0, nrow=an, ncol=bn) > m[1,] <- 1:bn > m[,1] <- 1:an > > # Cost calculation - line beginning (substr... is 0-1 cost f'n > for (i in 2:an) > for (j in 2:bn) > m[i,j] <- min( m[i-1,j]+1, m[i,j-1]+1, m[i-1,j-1]+ > (substr(a,i-1,i-1)!=substr(b,j-1,j-1)) ) > > # Returns the distance > m[an,bn]-1 > } > But a very expensive part of the code though is the substr() calls. Instead of doing this nchar(a)*nchar(b) times it's enough to do it nchar(a)+nchar(b). Even better is to use strsplit() first as below: levenshteinFast <- function(s1,s2) { # Make sure args are strings a <- as.character(s1) b <- as.character(s2) # Split strings into vectors a <- strsplit(a, split="")[[1]] b <- strsplit(b, split="")[[1]] # If one arg is an empty string, returns the length of the other an <- length(a) bn <- length(b) if (an==0) return(bn) if (bn==0) return(an) # Initialize matrix for calculations m <- matrix(0, nrow=an+1, ncol=bn+1) m[1,] <- 1:(bn+1) m[,1] <- 1:(an+1) # Cost calculation - line beginning (substr... is 0-1 cost f'n for (i in 2:(an+1)) { for (j in 2:(bn+1)) { m[i,j] <- min(m[i-1,j ] + 1, m[i ,j-1] + 1, m[i-1,j-1] + !identical(a[i-1],b[j-1])) } } # Returns the distance m[an+1,bn+1]-1; } # levenshteinFast() # Example N <- 500 s1 <- sample(letters, size=N, replace=TRUE) s1 <- paste(s1, collapse="") s2 <- sample(letters, size=N, replace=TRUE) s2 <- paste(s2, collapse="") t1 <- system.time(dist1 <- levenshtein(s1,s2)) print(c(t1,dist1)) # [1] 46.83 0.23 54.24 NA NA 443.00 t2 <- system.time(dist2 <- levenshteinFast(s1,s2)) print(c(t2,dist2)) # [1] 18.82 0.07 20.90 NA NA 443.00 /Henrik > --- > Date: Thu, 12 Feb 2004 11:01:19 -0300 > From: Rodrigo Abt <[EMAIL PROTECTED]> > To: 'Lista de Correo de R' <[EMAIL PROTECTED]> > Subject: Re: [R] AGREP > > > "Marcos Sanches" <[EMAIL PROTECTED]> writes: > > >Hi listers > > > >If you don't know what is the Edit Distance beetwen two > strings, I will > >try to explain, in fact it is very simple to understund but not to > >calculate througth a program. It is simplilly the minimum number of > >operations you must perform to transform string A on string B, by > >operations I mean delete letters, insert letters or > substitute letter. > > > >If you need to do few operations, it means string A is > almost the same > >as string B. Otherwise they are more differente as the number of > >operations increase. > > > >If you have a idea of how to make a function to calculate this > >distance, it would be very important for me. > > > >Thanks very much, > > > >Marcos > > I guess you're looking for Levenshtein distance, so try this: > > levenshtein<-function(s1,s2) { > # Make sure args are strings > a<-as.character(s1);an=nchar(s1)+1 > b<-as.character(s2);bn=nchar(s2)+1 > > # Initialize matrix for calculations > m<-matrix(nrow=an,ncol=bn) > > # If one arg is an empty string, returns the length of the other > if (nchar(a)==0) > return(nchar(b)) > if (nchar(b)==0) > return(nchar(a)) > > # Matrix initialization > for (i in 1:an) { > for (j in 1:bn) { > m[i,j]<-0 > m[1,j]<-j > } > m[i,1]<-i > } > > # Cost calculation > for (i in 2:an) { > for (j in 2:bn) { > if (substr(a,i-1,i-1)==substr(b,j-1,j-1)) > cost<-0 > else > cost<-1 > m[i,j]=min(m[i-1,j]+1,m[i,j-1]+1,m[i-1,j-1]+cost) > } > } > # Returns the distance > m[an,bn]-1 > } > > Examples: > > > levenshtein("Great","Grreat") <-- One addition > [1] 1 > > levenshtein("mahrcoz","Marcos") <-- One substitution,one deletion > and one substitution > [1] 3 > > Note that this function IS case sensitive. If you want to > apply this on vectors of strings you'll have to write the > corresponding wrapper function. > > Hope that helps, > > Rodrigo Abt B, > Analyst, > Dept. Economic Studies, > SII, Chile. > > ______________________________________________ > [EMAIL PROTECTED] mailing list > https://www.stat.math.ethz.ch/mailma> n/listinfo/r-help > PLEASE > do read the posting guide! > http://www.R-project.org/posting-guide.html ______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html ______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html