Ok, that makes senseIn my current use case I think I'll be able to filter out first the elements that will miss, so this behaviour is not triggered.
But it's good to know this happens so I can try to avoid it in the future. Thanks. Bernat *Bernat Gel Moreno* Bioinformatician Hereditary Cancer Program Program of Predictive and Personalized Medicine of Cancer (PMPPC) Germans Trias i Pujol Research Institute (IGTP) Campus Can Ruti Carretera de Can Ruti, Camí de les Escoles s/n 08916 Badalona, Barcelona, Spain Tel: (+34) 93 554 3068 Fax: (+34) 93 497 8654 08916 Badalona, Barcelona, Spain b...@igtp.cat <mailto:b...@igtp.cat> www.germanstrias.org <http://www.germanstrias.org/> <http://www.germanstrias.org/> El 06/30/2017 a las 03:20 PM, Michael Lawrence escribió:
The reason it's faster when shuffled vs. all that end is that when a miss happens R compares the string to all strings before it in the subscript. So it's a lot worse to have a miss towards the end. As Martin wrote, there are basically two possible improvements that are somewhat complementary: 1) Tell stringSubscript() that it is not replacing so there is no need to do that scan. This would require passing an argument down the call stack. 2) Do a self match on the subscript like in Martin's patch, although it should probably be done lazily on the first miss. Michael On Fri, Jun 30, 2017 at 3:32 AM, Bernat Gel <b...@igtp.cat> wrote:Ok, so it seems more like a bug somewhere than something I falied to understand, then. One of the surprises for me is that shuffling the data so the misses do not happen one after the other seems to solve the issue... Thanks, Bernat *Bernat Gel Moreno* Bioinformatician Hereditary Cancer Program Program of Predictive and Personalized Medicine of Cancer (PMPPC) Germans Trias i Pujol Research Institute (IGTP) Campus Can Ruti Carretera de Can Ruti, Camí de les Escoles s/n 08916 Badalona, Barcelona, Spain Tel: (+34) 93 554 3068 Fax: (+34) 93 497 8654 08916 Badalona, Barcelona, Spain b...@igtp.cat <mailto:b...@igtp.cat> www.germanstrias.org <http://www.germanstrias.org/> <http://www.germanstrias.org/> El 06/30/2017 a las 11:21 AM, Hervé Pagès escribió:Hi Bernat, Michael, FWIW I reported this issue on R-devel a couple of times. Last time was in 2013: https://stat.ethz.ch/pipermail/r-devel/2013-May/066616.html Cheers, H. On 06/29/2017 11:58 PM, Bernat Gel wrote:Yes, that would explain part of the situation. But example cc5 shows that hash misses would account only for part of the time. Thanks for taking a look into it Bernat *Bernat Gel Moreno* Bioinformatician Hereditary Cancer Program Program of Predictive and Personalized Medicine of Cancer (PMPPC) Germans Trias i Pujol Research Institute (IGTP) Campus Can Ruti Carretera de Can Ruti, Camí de les Escoles s/n 08916 Badalona, Barcelona, Spain Tel: (+34) 93 554 3068 Fax: (+34) 93 497 8654 08916 Badalona, Barcelona, Spain b...@igtp.cat <mailto:b...@igtp.cat> www.germanstrias.org <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e= > <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e= > El 06/29/2017 a las 08:48 PM, Michael Lawrence escribió:Preliminary analysis suggests that this is due to hash misses. When that happens, R ends up doing costly string comparisons that are on the order of n^2 where 'n' is the length of the subscript. Looking into it. On Thu, Jun 29, 2017 at 10:43 AM, Bernat Gel <b...@igtp.cat> wrote:Hi all, This is not strictly a Bioconductor question, but I hope some of the experts here can help me understand what's going on with a performance issue I've found working on a package. It has to do with selecting elements from a named vector. If we have a vector with the names of the chromosomes and their order chrs <- setNames(1:24, paste0("chr", c(1:22, "X", "Y"))) chrs chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 chr18 chr19 chr20 chr21 chr22 chrX chrY 18 19 20 21 22 23 24 And we have a second vector of chromosomes (in this case, the chromosomes from SNP-array probes) And we want to use the second vector to select from the first one by name cc <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19", 14726), rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252), rep("chrX", 17498), rep("chrY", 1296)) print(system.time(replicate(10, chrs[cc]))) user system elapsed 0.136 0.004 0.141 It's fast. However, if I get the wrong names for the last two chromosomes (chr23 and chr24 instead of chrX and chrY) cc2 <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19", 14726), rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252), rep("chr23", 17498), rep("chr24", 1296)) print(system.time(replicate(10, chrs[cc2]))) user system elapsed 144.672 0.012 144.675 It is MUCH slower. (1000x) BUT, if I shuffle the elements in the second vector cc3 <- sample(cc2, length(cc), replace = FALSE) print(system.time(replicate(10, chrs[cc3]))) user system elapsed 0.096 0.004 0.102 It's fast again!!! The elapsed time is related to the number of elements BEFORE the failing names, cc4 <- c(rep("chr22", 10252), rep("chr23", 17498), rep("chr24", 1296)) print(system.time(replicate(10, chrs[cc4]))) user system elapsed 17.332 0.004 17.336 cc5 <- c(rep("chr23", 17498), rep("chr24", 1296)) print(system.time(replicate(10, chrs[cc5]))) user system elapsed 1.872 0.000 1.901 so my guess is that it might come from moving around the vector in memory for each "failed" selection or something similar... Is it correct? Is there anything I'm missing? Thanks a lot Bernat -- *Bernat Gel Moreno* Bioinformatician Hereditary Cancer Program Program of Predictive and Personalized Medicine of Cancer (PMPPC) Germans Trias i Pujol Research Institute (IGTP) Campus Can Ruti Carretera de Can Ruti, Camí de les Escoles s/n 08916 Badalona, Barcelona, Spain Tel: (+34) 93 554 3068 Fax: (+34) 93 497 8654 08916 Badalona, Barcelona, Spain b...@igtp.cat <mailto:b...@igtp.cat> www.germanstrias.org <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e= <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e= _______________________________________________ Bioc-devel@r-project.org mailing list https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=4AkjVXY9i8VhAZjQ5gpQD1gtNh2arVzMoNoadhtUUbY&e=_______________________________________________ Bioc-devel@r-project.org mailing list https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=4AkjVXY9i8VhAZjQ5gpQD1gtNh2arVzMoNoadhtUUbY&e=_______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel