Hi all,

This is not strictly a Bioconductor question, but I hope some of the experts here can help me understand what's going on with a performance issue I've found working on a package.

It has to do with selecting elements from a named vector.

If we have a vector with the names of the chromosomes and their order

    chrs <- setNames(1:24, paste0("chr", c(1:22, "X", "Y")))
    chrs

chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
chr18 chr19 chr20 chr21 chr22  chrX  chrY
   18    19    20    21    22    23    24

And we have a second vector of chromosomes (in this case, the chromosomes from SNP-array probes)
And we want to use the second vector to select from the first one by name

    cc <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19", 14726),
        rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252),
        rep("chrX", 17498), rep("chrY", 1296))
    print(system.time(replicate(10, chrs[cc])))

user  system elapsed
0.136   0.004   0.141

It's fast.

However, if I get the wrong names for the last two chromosomes (chr23 and chr24 instead of chrX and chrY)

cc2 <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19", 14726),
        rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252),
        rep("chr23", 17498), rep("chr24", 1296))
     print(system.time(replicate(10, chrs[cc2])))

user  system elapsed
144.672   0.012 144.675


It is MUCH slower. (1000x)


BUT, if I shuffle the elements in the second vector

    cc3 <- sample(cc2, length(cc), replace = FALSE)
    print(system.time(replicate(10, chrs[cc3])))

user  system elapsed
0.096   0.004   0.102

It's fast again!!!



The elapsed time is related to the number of elements BEFORE the failing names,

    cc4 <- c(rep("chr22", 10252), rep("chr23", 17498), rep("chr24", 1296))
    print(system.time(replicate(10, chrs[cc4])))

user  system elapsed
17.332   0.004  17.336

    cc5 <- c(rep("chr23", 17498), rep("chr24", 1296))
    print(system.time(replicate(10, chrs[cc5])))

user  system elapsed
1.872   0.000   1.901


so my guess is that it might come from moving around the vector in memory for each "failed" selection or something similar...

Is it correct? Is there anything I'm missing?

Thanks a lot

Bernat

--

*Bernat Gel Moreno*
Bioinformatician

Hereditary Cancer Program
Program of Predictive and Personalized Medicine of Cancer (PMPPC)
Germans Trias i Pujol Research Institute (IGTP)

Campus Can Ruti
Carretera de Can Ruti, Camí de les Escoles s/n
08916 Badalona, Barcelona, Spain

Tel: (+34) 93 554 3068
Fax: (+34) 93 497 8654
08916 Badalona, Barcelona, Spain
b...@igtp.cat <mailto:b...@igtp.cat>
www.germanstrias.org <http://www.germanstrias.org/>

<http://www.germanstrias.org/>







_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to