Re: [R] Comparison of two very large strings
On Jul 12, 2010, at 6:46 PM, David Winsemius wrote: On Jul 12, 2010, at 6:03 PM, harsh yadav wrote: Hi, I have a function in R that compares two very large strings for about 1 million records. The strings are very large URLs like:- http://query.nytimes.com/gst/sitesearch_selector.html?query=US+Visa+Laws&type=nyt&x=25&y=8 . .. or of larger lengths. The data-frame looks like:- id url 1 http://query.nytimes.com/gst/sitesearch_selector.html?query=US+Visa+Laws&type=nyt&x=25&y=8 . .. 2 http://query.nytimes.com/search/sitesearch?query=US+Visa+Laws&srchst=cse 3 http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N . .. 4 http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N 5 http://www.google.com/url?sa=U&start=11&q=http://app1.chinadaily.com.cn/star/2004/0610/fo4-1.html&ei=uUKwSe7XN9CCt and so on for about 1 million records. Here is the function that I am using to compare the two strings:- stringCompare <- function(currentURL, currentId){ j <- currentId - 1 while(j>=1) previousURL <- previousURLLength <- nchar(previousURL) #Compare smaller with bigger if(nchar(currentURL) <= previousURLLength){ matchPhrase <- substr(previousURL,1,nchar(currentURL)) if(matchPhrase == currentURL){ return(TRUE) } }else{ matchPhrase <- substr(currentURL,1,previousURLLength) if(matchPhrase == previousURL){ return(TRUE) } } j <- j -1 } return(FALSE) } Couldn't you just store the "url" vector after running through nchar and then do the comparison in a vectorized manner? test <- rd.txt('id url 1 "http://query.nytimes.com/gst/sitesearch_selector.html?query=US+Visa+Laws&type=nyt&x=25&y=8 " 2 "http://query.nytimes.com/search/sitesearch?query=US+Visa+Laws&srchst=cse " 3 "http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N " 4 "http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N " 5 "http://www.google.com/url?sa=U&start=11&q=http://app1.chinadaily.com.cn/star/2004/0610/fo4-1.html&ei=uUKwSe7XN9CCt "', stringsAsFactors=FALSE) copyUrls <- test[,"url"] sizeUrls <- nchar(copyUrls) lengU <- length(sizeUrls) sizidx <- pmax(sizeUrls[1:(lengU-1)], sizeUrls[2:(lengU)]) substr(copyUrls[2:lengU], 1, sizidx) == substr(copyUrls[1: (lengU-1)], 1, sizidx) #[1] FALSE FALSE TRUE FALSE Let me hasten to admit that when I tried to fix what I thought was an error in that program, I go the same result. It seemed as though I should have been getting errors by choosing the maximum string length. Changing the pmax to pmin did not alter the results ... to my puzzlement ... until I further noticed that urls #3 and #4 were of the same length. When I extend the lengths, then only the version using pmin works properly. -- David. Here, I compare the URL at a given row with all the previous URLs in the data-frame. I compare the smaller of the two given URls with the larger one (upto the length of the smaller). When I run the above function for about 1 million records, the execution becomes really slow, which otherwise is fast if I remove the string comparison step. Any ideas how it can be implemented in a fast and efficient way. Thanks and Regards, Harsh Yadav [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Comparison of two very large strings
On Jul 12, 2010, at 6:03 PM, harsh yadav wrote: Hi, I have a function in R that compares two very large strings for about 1 million records. The strings are very large URLs like:- http://query.nytimes.com/gst/sitesearch_selector.html?query=US+Visa+Laws&type=nyt&x=25&y=8 . .. or of larger lengths. The data-frame looks like:- id url 1 http://query.nytimes.com/gst/sitesearch_selector.html?query=US+Visa+Laws&type=nyt&x=25&y=8 . .. 2 http://query.nytimes.com/search/sitesearch?query=US+Visa+Laws&srchst=cse 3 http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N . .. 4 http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N 5 http://www.google.com/url?sa=U&start=11&q=http://app1.chinadaily.com.cn/star/2004/0610/fo4-1.html&ei=uUKwSe7XN9CCt and so on for about 1 million records. Here is the function that I am using to compare the two strings:- stringCompare <- function(currentURL, currentId){ j <- currentId - 1 while(j>=1) previousURL <- previousURLLength <- nchar(previousURL) #Compare smaller with bigger if(nchar(currentURL) <= previousURLLength){ matchPhrase <- substr(previousURL,1,nchar(currentURL)) if(matchPhrase == currentURL){ return(TRUE) } }else{ matchPhrase <- substr(currentURL,1,previousURLLength) if(matchPhrase == previousURL){ return(TRUE) } } j <- j -1 } return(FALSE) } Couldn't you just store the "url" vector after running through nchar and then do the comparison in a vectorized manner? test <- rd.txt('id url 1 "http://query.nytimes.com/gst/sitesearch_selector.html?query=US+Visa+Laws&type=nyt&x=25&y=8 " 2 "http://query.nytimes.com/search/sitesearch?query=US+Visa+Laws&srchst=cse " 3 "http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N " 4 "http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N " 5 "http://www.google.com/url?sa=U&start=11&q=http://app1.chinadaily.com.cn/star/2004/0610/fo4-1.html&ei=uUKwSe7XN9CCt "', stringsAsFactors=FALSE) copyUrls <- test[,"url"] sizeUrls <- nchar(copyUrls) lengU <- length(sizeUrls) sizidx <- pmax(sizeUrls[1:(lengU-1)], sizeUrls[2:(lengU)]) substr(copyUrls[2:lengU], 1, sizidx) == substr(copyUrls[1:(lengU-1)], 1, sizidx) #[1] FALSE FALSE TRUE FALSE Here, I compare the URL at a given row with all the previous URLs in the data-frame. I compare the smaller of the two given URls with the larger one (upto the length of the smaller). When I run the above function for about 1 million records, the execution becomes really slow, which otherwise is fast if I remove the string comparison step. Any ideas how it can be implemented in a fast and efficient way. Thanks and Regards, Harsh Yadav [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Comparison of two very large strings
Hi, I have a function in R that compares two very large strings for about 1 million records. The strings are very large URLs like:- http://query.nytimes.com/gst/sitesearch_selector.html?query=US+Visa+Laws&type=nyt&x=25&y=8. .. or of larger lengths. The data-frame looks like:- id url 1 http://query.nytimes.com/gst/sitesearch_selector.html?query=US+Visa+Laws&type=nyt&x=25&y=8. .. 2 http://query.nytimes.com/search/sitesearch?query=US+Visa+Laws&srchst=cse 3 http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N. .. 4 http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N 5 http://www.google.com/url?sa=U&start=11&q=http://app1.chinadaily.com.cn/star/2004/0610/fo4-1.html&ei=uUKwSe7XN9CCt and so on for about 1 million records. Here is the function that I am using to compare the two strings:- stringCompare <- function(currentURL, currentId){ j <- currentId - 1 while(j>=1) previousURL <- urlDataFrame[j,"url"] previousURLLength <- nchar(previousURL) #Compare smaller with bigger if(nchar(currentURL) <= previousURLLength){ matchPhrase <- substr(previousURL,1,nchar(currentURL)) if(matchPhrase == currentURL){ return(TRUE) } }else{ matchPhrase <- substr(currentURL,1,previousURLLength) if(matchPhrase == previousURL){ return(TRUE) } } j <- j -1 } return(FALSE) } Here, I compare the URL at a given row with all the previous URLs in the data-frame. I compare the smaller of the two given URls with the larger one (upto the length of the smaller). When I run the above function for about 1 million records, the execution becomes really slow, which otherwise is fast if I remove the string comparison step. Any ideas how it can be implemented in a fast and efficient way. Thanks and Regards, Harsh Yadav [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.