The reason it's faster when shuffled vs. all that end is that when a miss happens R compares the string to all strings before it in the subscript. So it's a lot worse to have a miss towards the end.
As Martin wrote, there are basically two possible improvements that are somewhat complementary: 1) Tell stringSubscript() that it is not replacing so there is no need to do that scan. This would require passing an argument down the call stack. 2) Do a self match on the subscript like in Martin's patch, although it should probably be done lazily on the first miss. Michael On Fri, Jun 30, 2017 at 3:32 AM, Bernat Gel <b...@igtp.cat> wrote: > Ok, so it seems more like a bug somewhere than something I falied to > understand, then. > > One of the surprises for me is that shuffling the data so the misses do not > happen one after the other seems to solve the issue... > > Thanks, > > Bernat > > *Bernat Gel Moreno* > Bioinformatician > > Hereditary Cancer Program > Program of Predictive and Personalized Medicine of Cancer (PMPPC) > Germans Trias i Pujol Research Institute (IGTP) > > Campus Can Ruti > Carretera de Can Ruti, Camí de les Escoles s/n > 08916 Badalona, Barcelona, Spain > > Tel: (+34) 93 554 3068 > Fax: (+34) 93 497 8654 > 08916 Badalona, Barcelona, Spain > b...@igtp.cat <mailto:b...@igtp.cat> > www.germanstrias.org <http://www.germanstrias.org/> > > <http://www.germanstrias.org/> > > > > > > > > > El 06/30/2017 a las 11:21 AM, Hervé Pagès escribió: >> >> Hi Bernat, Michael, >> >> FWIW I reported this issue on R-devel a couple of times. Last time was >> in 2013: >> >> https://stat.ethz.ch/pipermail/r-devel/2013-May/066616.html >> >> Cheers, >> H. >> >> On 06/29/2017 11:58 PM, Bernat Gel wrote: >>> >>> Yes, that would explain part of the situation. But example cc5 shows >>> that hash misses would account only for part of the time. >>> >>> Thanks for taking a look into it >>> >>> Bernat >>> >>> *Bernat Gel Moreno* >>> Bioinformatician >>> >>> Hereditary Cancer Program >>> Program of Predictive and Personalized Medicine of Cancer (PMPPC) >>> Germans Trias i Pujol Research Institute (IGTP) >>> >>> Campus Can Ruti >>> Carretera de Can Ruti, Camí de les Escoles s/n >>> 08916 Badalona, Barcelona, Spain >>> >>> Tel: (+34) 93 554 3068 >>> Fax: (+34) 93 497 8654 >>> 08916 Badalona, Barcelona, Spain >>> b...@igtp.cat <mailto:b...@igtp.cat> >>> www.germanstrias.org >>> >>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e= >>> > >>> >>> >>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e= >>> > >>> >>> >>> >>> >>> >>> >>> >>> El 06/29/2017 a las 08:48 PM, Michael Lawrence escribió: >>>> >>>> Preliminary analysis suggests that this is due to hash misses. When >>>> that happens, R ends up doing costly string comparisons that are on >>>> the order of n^2 where 'n' is the length of the subscript. Looking >>>> into it. >>>> >>>> On Thu, Jun 29, 2017 at 10:43 AM, Bernat Gel <b...@igtp.cat> wrote: >>>>> >>>>> Hi all, >>>>> >>>>> This is not strictly a Bioconductor question, but I hope some of the >>>>> experts >>>>> here can help me understand what's going on with a performance issue >>>>> I've >>>>> found working on a package. >>>>> >>>>> It has to do with selecting elements from a named vector. >>>>> >>>>> If we have a vector with the names of the chromosomes and their order >>>>> >>>>> chrs <- setNames(1:24, paste0("chr", c(1:22, "X", "Y"))) >>>>> chrs >>>>> >>>>> chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 >>>>> chr12 chr13 >>>>> chr14 chr15 chr16 chr17 >>>>> 1 2 3 4 5 6 7 8 9 10 11 >>>>> 12 13 >>>>> 14 15 16 17 >>>>> chr18 chr19 chr20 chr21 chr22 chrX chrY >>>>> 18 19 20 21 22 23 24 >>>>> >>>>> And we have a second vector of chromosomes (in this case, the >>>>> chromosomes >>>>> from SNP-array probes) >>>>> And we want to use the second vector to select from the first one by >>>>> name >>>>> >>>>> cc <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19", >>>>> 14726), >>>>> rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252), >>>>> rep("chrX", 17498), rep("chrY", 1296)) >>>>> print(system.time(replicate(10, chrs[cc]))) >>>>> >>>>> user system elapsed >>>>> 0.136 0.004 0.141 >>>>> >>>>> It's fast. >>>>> >>>>> However, if I get the wrong names for the last two chromosomes (chr23 >>>>> and >>>>> chr24 instead of chrX and chrY) >>>>> >>>>> cc2 <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19", >>>>> 14726), >>>>> rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252), >>>>> rep("chr23", 17498), rep("chr24", 1296)) >>>>> print(system.time(replicate(10, chrs[cc2]))) >>>>> >>>>> user system elapsed >>>>> 144.672 0.012 144.675 >>>>> >>>>> >>>>> It is MUCH slower. (1000x) >>>>> >>>>> >>>>> BUT, if I shuffle the elements in the second vector >>>>> >>>>> cc3 <- sample(cc2, length(cc), replace = FALSE) >>>>> print(system.time(replicate(10, chrs[cc3]))) >>>>> >>>>> user system elapsed >>>>> 0.096 0.004 0.102 >>>>> >>>>> It's fast again!!! >>>>> >>>>> >>>>> >>>>> The elapsed time is related to the number of elements BEFORE the >>>>> failing >>>>> names, >>>>> >>>>> cc4 <- c(rep("chr22", 10252), rep("chr23", 17498), rep("chr24", >>>>> 1296)) >>>>> print(system.time(replicate(10, chrs[cc4]))) >>>>> >>>>> user system elapsed >>>>> 17.332 0.004 17.336 >>>>> >>>>> cc5 <- c(rep("chr23", 17498), rep("chr24", 1296)) >>>>> print(system.time(replicate(10, chrs[cc5]))) >>>>> >>>>> user system elapsed >>>>> 1.872 0.000 1.901 >>>>> >>>>> >>>>> so my guess is that it might come from moving around the vector in >>>>> memory >>>>> for each "failed" selection or something similar... >>>>> >>>>> Is it correct? Is there anything I'm missing? >>>>> >>>>> Thanks a lot >>>>> >>>>> Bernat >>>>> >>>>> -- >>>>> >>>>> *Bernat Gel Moreno* >>>>> Bioinformatician >>>>> >>>>> Hereditary Cancer Program >>>>> Program of Predictive and Personalized Medicine of Cancer (PMPPC) >>>>> Germans Trias i Pujol Research Institute (IGTP) >>>>> >>>>> Campus Can Ruti >>>>> Carretera de Can Ruti, Camí de les Escoles s/n >>>>> 08916 Badalona, Barcelona, Spain >>>>> >>>>> Tel: (+34) 93 554 3068 >>>>> Fax: (+34) 93 497 8654 >>>>> 08916 Badalona, Barcelona, Spain >>>>> b...@igtp.cat <mailto:b...@igtp.cat> >>>>> www.germanstrias.org >>>>> >>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e= >>>>> > >>>>> >>>>> >>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e= >>>>> > >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Bioc-devel@r-project.org mailing list >>>>> >>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=4AkjVXY9i8VhAZjQ5gpQD1gtNh2arVzMoNoadhtUUbY&e= >>> >>> >>> >>> _______________________________________________ >>> Bioc-devel@r-project.org mailing list >>> >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=4AkjVXY9i8VhAZjQ5gpQD1gtNh2arVzMoNoadhtUUbY&e= >>> >> > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel