Preliminary analysis suggests that this is due to hash misses. When that happens, R ends up doing costly string comparisons that are on the order of n^2 where 'n' is the length of the subscript. Looking into it.
On Thu, Jun 29, 2017 at 10:43 AM, Bernat Gel <b...@igtp.cat> wrote: > Hi all, > > This is not strictly a Bioconductor question, but I hope some of the experts > here can help me understand what's going on with a performance issue I've > found working on a package. > > It has to do with selecting elements from a named vector. > > If we have a vector with the names of the chromosomes and their order > > chrs <- setNames(1:24, paste0("chr", c(1:22, "X", "Y"))) > chrs > > chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 > chr14 chr15 chr16 chr17 > 1 2 3 4 5 6 7 8 9 10 11 12 13 > 14 15 16 17 > chr18 chr19 chr20 chr21 chr22 chrX chrY > 18 19 20 21 22 23 24 > > And we have a second vector of chromosomes (in this case, the chromosomes > from SNP-array probes) > And we want to use the second vector to select from the first one by name > > cc <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19", 14726), > rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252), > rep("chrX", 17498), rep("chrY", 1296)) > print(system.time(replicate(10, chrs[cc]))) > > user system elapsed > 0.136 0.004 0.141 > > It's fast. > > However, if I get the wrong names for the last two chromosomes (chr23 and > chr24 instead of chrX and chrY) > > cc2 <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19", 14726), > rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252), > rep("chr23", 17498), rep("chr24", 1296)) > print(system.time(replicate(10, chrs[cc2]))) > > user system elapsed > 144.672 0.012 144.675 > > > It is MUCH slower. (1000x) > > > BUT, if I shuffle the elements in the second vector > > cc3 <- sample(cc2, length(cc), replace = FALSE) > print(system.time(replicate(10, chrs[cc3]))) > > user system elapsed > 0.096 0.004 0.102 > > It's fast again!!! > > > > The elapsed time is related to the number of elements BEFORE the failing > names, > > cc4 <- c(rep("chr22", 10252), rep("chr23", 17498), rep("chr24", 1296)) > print(system.time(replicate(10, chrs[cc4]))) > > user system elapsed > 17.332 0.004 17.336 > > cc5 <- c(rep("chr23", 17498), rep("chr24", 1296)) > print(system.time(replicate(10, chrs[cc5]))) > > user system elapsed > 1.872 0.000 1.901 > > > so my guess is that it might come from moving around the vector in memory > for each "failed" selection or something similar... > > Is it correct? Is there anything I'm missing? > > Thanks a lot > > Bernat > > -- > > *Bernat Gel Moreno* > Bioinformatician > > Hereditary Cancer Program > Program of Predictive and Personalized Medicine of Cancer (PMPPC) > Germans Trias i Pujol Research Institute (IGTP) > > Campus Can Ruti > Carretera de Can Ruti, Camí de les Escoles s/n > 08916 Badalona, Barcelona, Spain > > Tel: (+34) 93 554 3068 > Fax: (+34) 93 497 8654 > 08916 Badalona, Barcelona, Spain > b...@igtp.cat <mailto:b...@igtp.cat> > www.germanstrias.org <http://www.germanstrias.org/> > > <http://www.germanstrias.org/> > > > > > > > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel