On 08/05/2014 07:46 AM, Michael Lawrence wrote:
Hi guys (Val, Martin, Herve):

Anyone have an itch for optimization? The writeVcf function is currently a
bottleneck in our WGS genotyping pipeline. For a typical 50 million row
gVCF, it was taking 2.25 hours prior to yesterday's improvements
(pasteCollapseRows) that brought it down to about 1 hour, which is still
too long by my standards (> 0). Only takes 3 minutes to call the genotypes
(and associated likelihoods etc) from the variant calls (using 80 cores and
450 GB RAM on one node), so the output is an issue. Profiling suggests that
the running time scales non-linearly in the number of rows.

Digging a little deeper, it seems to be something with R's string/memory
allocation. Below, pasting 1 million strings takes 6 seconds, but 10
million strings takes over 2 minutes. It gets way worse with 50 million. I
suspect it has something to do with R's string hash table.

set.seed(1000)
end <- sample(1e8, 1e6)
system.time(paste0("END", "=", end))
    user  system elapsed
   6.396   0.028   6.420

end <- sample(1e8, 1e7)
system.time(paste0("END", "=", end))
    user  system elapsed
134.714   0.352 134.978

Indeed, even this takes a long time (in a fresh session):

set.seed(1000)
end <- sample(1e8, 1e6)
end <- sample(1e8, 1e7)
system.time(as.character(end))
    user  system elapsed
  57.224   0.156  57.366

my usual trick is R --no-save --quiet --min-vsize=2048M --min-nsize=45M, which changes the example above from

> system.time(as.character(end))
   user  system elapsed
 82.835   0.343  83.195

to

> system.time(as.character(end))
   user  system elapsed
  9.245   0.169   9.424

but I think it's a one-time gain; I wonder what the writeVcf command is that you're running?

Martin


But running it a second time is faster (about what one would expect?):

system.time(levels <- as.character(end))
    user  system elapsed
  23.582   0.021  23.589

I did some simple profiling of R to find that the resizing of the string
hash table is not a significant component of the time. So maybe something
to do with the R heap/gc? No time right now to go deeper. But I know Martin
likes this sort of thing ;)

Michael

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to