One way to make it faster is to remove the accessing of the dataframe
in the loop.  By converting the values you need to compare to
matrices, the compares will be faster.  Try the following to see if
there is a speedup:

a1<-data.frame(id=c(1:6), cat=c('cat 1','cat 1','cat 2','cat 2','cat
2','cat 3'), st=c(1,7,30,40,59,91), en=c(5,25,39,55,70,120));
a2<-data.frame(id=paste('probe',c(1:8)), cat=c('cat 1','cat 1','cat
2','cat 2','cat 2','cat 3','cat 3','cat 3'),
    st=c(1,9,20,38,53,70,80,95), en=c(6,15,36,43,58,75,85,98));


# convert to matrices for faster access than data frames
# make sure that the 'cat' has the same numeric values in both matrices
uniqCAT <- unique(levels(a1$cat), levels(a2$cat))
a1.m <- cbind(a1$st, a1$en, as.numeric(factor(a1$cat, levels=uniqCAT)))
a2.m <- cbind(a2$st, a2$en, as.numeric(factor(a2$cat, levels=uniqCAT)))

# now do the comparison
a1$coverage <- apply(a1.m, 1, function(.row){
    sum((.row[1] < a2.m[,2]) & (.row[2] > a2.m[,1]) & (.row[3] == a2.m[,3]))
})





On Sat, Aug 2, 2008 at 2:04 AM, Anh Tran <[EMAIL PROTECTED]> wrote:
> Hi all,I know this topic has came up multiple times, but I've never fully
> understand the apply() function.
>
> Anyway, I'm here asking for your help again to convert this loop to apply().
>
> I have 2 data frames with the following information: a1 is the fragment that
> is need to be covered, a2 is the probes that cover the specific fragment.
>
> I need to count the number of probes cover every given fragment (they need
> to have the same cat ID to be on the same fragment)
>
> a1<-data.frame(id=c(1:6), cat=c('cat 1','cat 1','cat 2','cat 2','cat 2','cat
> 3'), st=c(1,7,30,40,59,91), en=c(5,25,39,55,70,120));
> a2<-data.frame(id=paste('probe',c(1:8)), cat=c('cat 1','cat 1','cat 2','cat
> 2','cat 2','cat 3','cat 3','cat 3'), st=c(1,9,20,38,53,70,80,95),
> en=c(6,15,36,43,58,75,85,98));
> a1$coverage<-NULL;
>
> I came up with this for loop (basically, if a probe starts before the
> fragment end, and end after a fragment start, it cover that fragment)
>
> for (i in 1:length(a1$id))
> {
> a1$coverage[i]<-length(a2[a2$st<=a1$en[i]&a2$en>=a1$st[i]&a2$cat==a1$cat[i],]$id);
> }
>
>> a1$coverage
> [1] 1 1 2 2 0 1
>
>
> This loop runs awefully slow when I have 200,000 probes and 30,000
> fragments. Is there anyway I can speed this up with apply()?
>
> This is the time for my for loop to scan through the first 20 record of my
> dataset:
>   user  system elapsed
>  2.264   0.501   2.770
>
> I think there is room for improvement here. Any idea?
>
> Thanks
> --
> Regards,
> Anh Tran
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to