Hi Bernat, Michael,

FWIW I reported this issue on R-devel a couple of times. Last time was
in 2013:

  https://stat.ethz.ch/pipermail/r-devel/2013-May/066616.html

Cheers,
H.

On 06/29/2017 11:58 PM, Bernat Gel wrote:
Yes, that would explain part of the situation. But example cc5 shows
that hash misses would account only for part of the time.

Thanks for taking a look into it

Bernat

*Bernat Gel Moreno*
Bioinformatician

Hereditary Cancer Program
Program of Predictive and Personalized Medicine of Cancer (PMPPC)
Germans Trias i Pujol Research Institute (IGTP)

Campus Can Ruti
Carretera de Can Ruti, Camí de les Escoles s/n
08916 Badalona, Barcelona, Spain

Tel: (+34) 93 554 3068
Fax: (+34) 93 497 8654
08916 Badalona, Barcelona, Spain
b...@igtp.cat <mailto:b...@igtp.cat>
www.germanstrias.org
<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e=
 >

<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e=
 >







El 06/29/2017 a las 08:48 PM, Michael Lawrence escribió:
Preliminary analysis suggests that this is due to hash misses. When
that happens, R ends up doing costly string comparisons that are on
the order of n^2 where 'n' is the length of the subscript. Looking
into it.

On Thu, Jun 29, 2017 at 10:43 AM, Bernat Gel <b...@igtp.cat> wrote:
Hi all,

This is not strictly a Bioconductor question, but I hope some of the
experts
here can help me understand what's going on with a performance issue
I've
found working on a package.

It has to do with selecting elements from a named vector.

If we have a vector with the names of the chromosomes and their order

     chrs <- setNames(1:24, paste0("chr", c(1:22, "X", "Y")))
     chrs

chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9 chr10 chr11
chr12 chr13
chr14 chr15 chr16 chr17
     1     2     3     4     5     6     7     8     9    10    11
12    13
14    15    16    17
chr18 chr19 chr20 chr21 chr22  chrX  chrY
    18    19    20    21    22    23    24

And we have a second vector of chromosomes (in this case, the
chromosomes
from SNP-array probes)
And we want to use the second vector to select from the first one by
name

     cc <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19",
14726),
         rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252),
         rep("chrX", 17498), rep("chrY", 1296))
     print(system.time(replicate(10, chrs[cc])))

user  system elapsed
0.136   0.004   0.141

It's fast.

However, if I get the wrong names for the last two chromosomes (chr23
and
chr24 instead of chrX and chrY)

      cc2 <- c(rep("chr17", 19891), rep("chr18", 21353), rep("chr19",
14726),
         rep("chr20", 18135), rep("chr21", 10068), rep("chr22", 10252),
         rep("chr23", 17498), rep("chr24", 1296))
      print(system.time(replicate(10, chrs[cc2])))

user  system elapsed
144.672   0.012 144.675


It is MUCH slower. (1000x)


BUT, if I shuffle the elements in the second vector

     cc3 <- sample(cc2, length(cc), replace = FALSE)
     print(system.time(replicate(10, chrs[cc3])))

user  system elapsed
0.096   0.004   0.102

It's fast again!!!



The elapsed time is related to the number of elements BEFORE the failing
names,

     cc4 <- c(rep("chr22", 10252), rep("chr23", 17498), rep("chr24",
1296))
     print(system.time(replicate(10, chrs[cc4])))

user  system elapsed
17.332   0.004  17.336

     cc5 <- c(rep("chr23", 17498), rep("chr24", 1296))
     print(system.time(replicate(10, chrs[cc5])))

user  system elapsed
1.872   0.000   1.901


so my guess is that it might come from moving around the vector in
memory
for each "failed" selection or something similar...

Is it correct? Is there anything I'm missing?

Thanks a lot

Bernat

--

*Bernat Gel Moreno*
Bioinformatician

Hereditary Cancer Program
Program of Predictive and Personalized Medicine of Cancer (PMPPC)
Germans Trias i Pujol Research Institute (IGTP)

Campus Can Ruti
Carretera de Can Ruti, Camí de les Escoles s/n
08916 Badalona, Barcelona, Spain

Tel: (+34) 93 554 3068
Fax: (+34) 93 497 8654
08916 Badalona, Barcelona, Spain
b...@igtp.cat <mailto:b...@igtp.cat>
www.germanstrias.org
<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e=
>

<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.germanstrias.org_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=xNWXpfkTzxBoF_c0HoPoyQ0c3v6DA9_xY2WLtwleFlA&e=
>







_______________________________________________
Bioc-devel@r-project.org mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=4AkjVXY9i8VhAZjQ5gpQD1gtNh2arVzMoNoadhtUUbY&e=


_______________________________________________
Bioc-devel@r-project.org mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=J5Gs0N5MH_g9sSCZ6jNoZm_Dkc0EcHLbOVPcNwdqZ_4&s=4AkjVXY9i8VhAZjQ5gpQD1gtNh2arVzMoNoadhtUUbY&e=


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to