Typo corrected below.

On Sat, 13 Mar 2010, Charles C. Berry wrote:

On Sat, 13 Mar 2010, Adrian Johnson wrote:

 Hi:

 I have a two large files (over 300K lines).

 file 1:

 Name    X
 UK       199
 UK       230
 UK       139
 ......
 UAE    194
 UAE     94




 File 2:

 Name   X    Y
 UK    140   180
 UK    195    240
 UK    304    340
 ....


 I want to select X of File 1 and search if it falls in range of X and
 Y of File 2 and Print only those lines of File 1 that are in range of
 File 2 X and Y

Probably, I'd use findOverlaps() in the IRanges BioConductor package.

If you want to do the UK search apart from the UAE search and so on, the use of RangeData objects provided by IRanges is nice, clean way to go.

Something like:

library(IRanges)

file1 <- read.table("File1", header=TRUE)
file2 <- read.table("File2", header=TRUE)

file1.rl <- RangedData( IRanges(start=file1$X, width=1), space = Name )
file2.rl <- RangedData( IRanges(start=file2$X, width=file2$Y),
                         space = Name )


Correct the above to:

file1.rl <- RangedData( IRanges(start=file1$X, width=1),
                        space = file1$Name )
file2.rl <- RangedData( IRanges(start=file2$X, width=file2$Y),
                         space = file2$Name )

Chuck


find.1.in.2 <- as.matrix( findOverlaps( file1.rl , file2.rl ) )

new.rl <- cbind( file1.rl[ find.1.in.2[,1], ],
                         file2X = start(file2.rl)[ find.1.in.2[,2] ],
                         file2Y = end(file2.rl)[ find.1.in.2[,2] ])

find.1.in.2 will be a matrix with one row for every match. The first column will be the index of the row in file1.rl and the second that of file2.rl.

new.rl will have on row for each match.

The order of the rows in the RangedData objects may not match the original data frames, so beware.

For 300K rows, this would run pretty fast, I think.

(caveat: This is all untested code.)

Otherwise, without the IRanges package something like


gt.x <- findInterval( file1$X, file2$X )
gt.y <- findInterval( file1$X, file2$Y )

is.in.interval <- gt.x == gt.y + 1

will work iff the intervals defined in file2 do not overlap one another.

If you need to keep 'Name's separate, rolling this into mapply() would be needed.

HTH,

Chuck



 How can it be done it in R.

 thanks
 Adrian

 ______________________________________________
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


Charles C. Berry                            (858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cbe...@tajo.ucsd.edu               UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901





Charles C. Berry                            (858) 534-2098
                                            Dept of Family/Preventive Medicine
E mailto:cbe...@tajo.ucsd.edu               UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to