On Sat, 13 Mar 2010, Adrian Johnson wrote:
Hi:
I have a two large files (over 300K lines).
file 1:
Name X
UK 199
UK 230
UK 139
......
UAE 194
UAE 94
File 2:
Name X Y
UK 140 180
UK 195 240
UK 304 340
....
I want to select X of File 1 and search if it falls in range of X and
Y of File 2 and Print only those lines of File 1 that are in range of
File 2 X and Y
Probably, I'd use findOverlaps() in the IRanges BioConductor package.
If you want to do the UK search apart from the UAE search and so on, the
use of RangeData objects provided by IRanges is nice, clean way to go.
Something like:
library(IRanges)
file1 <- read.table("File1", header=TRUE)
file2 <- read.table("File2", header=TRUE)
file1.rl <- RangedData( IRanges(start=file1$X, width=1), space = Name )
file2.rl <- RangedData( IRanges(start=file2$X, width=file2$Y),
space = Name )
find.1.in.2 <- as.matrix( findOverlaps( file1.rl , file2.rl ) )
new.rl <- cbind( file1.rl[ find.1.in.2[,1], ],
file2X = start(file2.rl)[ find.1.in.2[,2] ],
file2Y = end(file2.rl)[ find.1.in.2[,2] ])
find.1.in.2 will be a matrix with one row for every match. The first
column will be the index of the row in file1.rl and the second that of
file2.rl.
new.rl will have on row for each match.
The order of the rows in the RangedData objects may not match the original
data frames, so beware.
For 300K rows, this would run pretty fast, I think.
(caveat: This is all untested code.)
Otherwise, without the IRanges package something like
gt.x <- findInterval( file1$X, file2$X )
gt.y <- findInterval( file1$X, file2$Y )
is.in.interval <- gt.x == gt.y + 1
will work iff the intervals defined in file2 do not overlap one another.
If you need to keep 'Name's separate, rolling this into mapply() would be
needed.
HTH,
Chuck
How can it be done it in R.
thanks
Adrian
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Charles C. Berry (858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cbe...@tajo.ucsd.edu UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.