On Sat, 13 Mar 2010, Adrian Johnson wrote:

Hi:

I have a two large files (over 300K lines).

file 1:

Name    X
UK       199
UK       230
UK       139
......
UAE    194
UAE     94




File 2:

Name   X    Y
UK    140   180
UK    195    240
UK    304    340
....


I want to select X of File 1 and search if it falls in range of X and
Y of File 2 and Print only those lines of File 1 that are in range of
File 2 X and Y

Probably, I'd use findOverlaps() in the IRanges BioConductor package.

If you want to do the UK search apart from the UAE search and so on, the use of RangeData objects provided by IRanges is nice, clean way to go.

Something like:

library(IRanges)

file1 <- read.table("File1", header=TRUE)
file2 <- read.table("File2", header=TRUE)

file1.rl <- RangedData( IRanges(start=file1$X, width=1), space = Name )
file2.rl <- RangedData( IRanges(start=file2$X, width=file2$Y),
                        space = Name )

find.1.in.2 <- as.matrix( findOverlaps( file1.rl , file2.rl ) )

new.rl <- cbind( file1.rl[ find.1.in.2[,1], ],
                        file2X = start(file2.rl)[ find.1.in.2[,2] ],
                        file2Y = end(file2.rl)[ find.1.in.2[,2] ])

find.1.in.2 will be a matrix with one row for every match. The first column will be the index of the row in file1.rl and the second that of file2.rl.

new.rl will have on row for each match.

The order of the rows in the RangedData objects may not match the original data frames, so beware.

For 300K rows, this would run pretty fast, I think.

(caveat: This is all untested code.)

Otherwise, without the IRanges package something like


gt.x <- findInterval( file1$X, file2$X )
gt.y <- findInterval( file1$X, file2$Y )

is.in.interval <- gt.x == gt.y + 1

will work iff the intervals defined in file2 do not overlap one another.

If you need to keep 'Name's separate, rolling this into mapply() would be needed.

HTH,

Chuck



How can it be done it in R.

thanks
Adrian

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Charles C. Berry                            (858) 534-2098
                                            Dept of Family/Preventive Medicine
E mailto:cbe...@tajo.ucsd.edu               UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to