Re: [R] Comparison of two very large strings

2010-07-12 Thread David Winsemius


On Jul 12, 2010, at 6:46 PM, David Winsemius wrote:



On Jul 12, 2010, at 6:03 PM, harsh yadav wrote:


Hi,

I have a function in R that compares two very large strings for  
about 1

million records.

The strings are very large URLs like:-


http://query.nytimes.com/gst/sitesearch_selector.html?query=US+Visa+Laws&type=nyt&x=25&y=8 
.

..

or of larger lengths.

The data-frame looks like:-

id url
1
http://query.nytimes.com/gst/sitesearch_selector.html?query=US+Visa+Laws&type=nyt&x=25&y=8 
.

..
2   http://query.nytimes.com/search/sitesearch?query=US+Visa+Laws&srchst=cse
3
http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N 
.

..
4
http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N
5
http://www.google.com/url?sa=U&start=11&q=http://app1.chinadaily.com.cn/star/2004/0610/fo4-1.html&ei=uUKwSe7XN9CCt

and so on for about 1 million records.

Here is the function that I am using to compare the two strings:-

stringCompare <- function(currentURL, currentId){
j <- currentId - 1
while(j>=1)
previousURL <-
previousURLLength <- nchar(previousURL)
#Compare smaller with bigger
if(nchar(currentURL) <= previousURLLength){
matchPhrase <- substr(previousURL,1,nchar(currentURL))
if(matchPhrase == currentURL){
return(TRUE)
}
}else{
matchPhrase <- substr(currentURL,1,previousURLLength)
if(matchPhrase == previousURL){
return(TRUE)
}
}
j <- j -1
}
return(FALSE)
}


Couldn't you just store the "url" vector after running  through  
nchar and then do the comparison in a vectorized manner?


test <- rd.txt('id url
1 "http://query.nytimes.com/gst/sitesearch_selector.html?query=US+Visa+Laws&type=nyt&x=25&y=8 
"
2 "http://query.nytimes.com/search/sitesearch?query=US+Visa+Laws&srchst=cse 
"
3 "http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N 
"
4 "http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N 
"
5 "http://www.google.com/url?sa=U&start=11&q=http://app1.chinadaily.com.cn/star/2004/0610/fo4-1.html&ei=uUKwSe7XN9CCt 
"', stringsAsFactors=FALSE)


copyUrls <- test[,"url"]
sizeUrls <- nchar(copyUrls)
lengU <- length(sizeUrls)
sizidx <- pmax(sizeUrls[1:(lengU-1)], sizeUrls[2:(lengU)])
substr(copyUrls[2:lengU], 1, sizidx) == substr(copyUrls[1: 
(lengU-1)], 1, sizidx)


#[1] FALSE FALSE  TRUE FALSE



Let me hasten to admit that when I tried to fix what I thought was an  
error in that program, I go the same result. It seemed as though I  
should have been getting errors by choosing the maximum string  
length.  Changing the pmax to pmin did not alter the results ... to my  
puzzlement ... until I further noticed that urls #3 and #4 were of the  
same length. When I extend the lengths, then only the version using  
pmin works properly.



--
David.


Here, I compare the URL at a given row with all the previous URLs  
in the
data-frame. I compare the smaller of the two given URls with the  
larger one

(upto the length of the smaller).

When I run the above function for about 1 million records, the  
execution

becomes really slow, which otherwise is fast if I remove the
string comparison step.

Any ideas how it can be implemented in a fast and efficient way.

Thanks and Regards,
Harsh Yadav

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Comparison of two very large strings

2010-07-12 Thread David Winsemius


On Jul 12, 2010, at 6:03 PM, harsh yadav wrote:


Hi,

I have a function in R that compares two very large strings for  
about 1

million records.

The strings are very large URLs like:-


http://query.nytimes.com/gst/sitesearch_selector.html?query=US+Visa+Laws&type=nyt&x=25&y=8 
.

..

or of larger lengths.

The data-frame looks like:-

id url
1
http://query.nytimes.com/gst/sitesearch_selector.html?query=US+Visa+Laws&type=nyt&x=25&y=8 
.

..
2   http://query.nytimes.com/search/sitesearch?query=US+Visa+Laws&srchst=cse
3
http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N 
.

..
4
http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N
5
http://www.google.com/url?sa=U&start=11&q=http://app1.chinadaily.com.cn/star/2004/0610/fo4-1.html&ei=uUKwSe7XN9CCt

and so on for about 1 million records.

Here is the function that I am using to compare the two strings:-

stringCompare <- function(currentURL, currentId){
 j <- currentId - 1
while(j>=1)
previousURL <-
previousURLLength <- nchar(previousURL)
#Compare smaller with bigger
if(nchar(currentURL) <= previousURLLength){
matchPhrase <- substr(previousURL,1,nchar(currentURL))
if(matchPhrase == currentURL){
return(TRUE)
}
}else{
matchPhrase <- substr(currentURL,1,previousURLLength)
if(matchPhrase == previousURL){
return(TRUE)
}
}
j <- j -1
}
return(FALSE)
}


Couldn't you just store the "url" vector after running  through nchar  
and then do the comparison in a vectorized manner?


test <- rd.txt('id url
1 "http://query.nytimes.com/gst/sitesearch_selector.html?query=US+Visa+Laws&type=nyt&x=25&y=8 
"
2 "http://query.nytimes.com/search/sitesearch?query=US+Visa+Laws&srchst=cse 
"
3 "http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N 
"
4 "http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N 
"
5 "http://www.google.com/url?sa=U&start=11&q=http://app1.chinadaily.com.cn/star/2004/0610/fo4-1.html&ei=uUKwSe7XN9CCt 
"', stringsAsFactors=FALSE)


copyUrls <- test[,"url"]
sizeUrls <- nchar(copyUrls)
lengU <- length(sizeUrls)
sizidx <- pmax(sizeUrls[1:(lengU-1)], sizeUrls[2:(lengU)])
substr(copyUrls[2:lengU], 1, sizidx) == substr(copyUrls[1:(lengU-1)],  
1, sizidx)


#[1] FALSE FALSE  TRUE FALSE


Here, I compare the URL at a given row with all the previous URLs in  
the
data-frame. I compare the smaller of the two given URls with the  
larger one

(upto the length of the smaller).

When I run the above function for about 1 million records, the  
execution

becomes really slow, which otherwise is fast if I remove the
string comparison step.

Any ideas how it can be implemented in a fast and efficient way.

Thanks and Regards,
Harsh Yadav

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Comparison of two very large strings

2010-07-12 Thread harsh yadav
Hi,

I have a function in R that compares two very large strings for about 1
million records.

The strings are very large URLs like:-


http://query.nytimes.com/gst/sitesearch_selector.html?query=US+Visa+Laws&type=nyt&x=25&y=8.
..

or of larger lengths.

The data-frame looks like:-

id url
1
http://query.nytimes.com/gst/sitesearch_selector.html?query=US+Visa+Laws&type=nyt&x=25&y=8.
..
2   http://query.nytimes.com/search/sitesearch?query=US+Visa+Laws&srchst=cse
3
http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N.
..
4
http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N
5
http://www.google.com/url?sa=U&start=11&q=http://app1.chinadaily.com.cn/star/2004/0610/fo4-1.html&ei=uUKwSe7XN9CCt

and so on for about 1 million records.

Here is the function that I am using to compare the two strings:-

stringCompare <- function(currentURL, currentId){
  j <- currentId - 1
 while(j>=1)
previousURL <- urlDataFrame[j,"url"]
previousURLLength <- nchar(previousURL)
 #Compare smaller with bigger
if(nchar(currentURL) <= previousURLLength){
 matchPhrase <- substr(previousURL,1,nchar(currentURL))
if(matchPhrase == currentURL){
 return(TRUE)
}
}else{
 matchPhrase <- substr(currentURL,1,previousURLLength)
if(matchPhrase == previousURL){
 return(TRUE)
}
}
 j <- j -1
}
 return(FALSE)
}

Here, I compare the URL at a given row with all the previous URLs in the
data-frame. I compare the smaller of the two given URls with the larger one
(upto the length of the smaller).

When I run the above function for about 1 million records, the execution
becomes really slow, which otherwise is fast if I remove the
string comparison step.

Any ideas how it can be implemented in a fast and efficient way.

Thanks and Regards,
Harsh Yadav

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.