Hi guys (Val, Martin, Herve): Anyone have an itch for optimization? The writeVcf function is currently a bottleneck in our WGS genotyping pipeline. For a typical 50 million row gVCF, it was taking 2.25 hours prior to yesterday's improvements (pasteCollapseRows) that brought it down to about 1 hour, which is still too long by my standards (> 0). Only takes 3 minutes to call the genotypes (and associated likelihoods etc) from the variant calls (using 80 cores and 450 GB RAM on one node), so the output is an issue. Profiling suggests that the running time scales non-linearly in the number of rows.
Digging a little deeper, it seems to be something with R's string/memory allocation. Below, pasting 1 million strings takes 6 seconds, but 10 million strings takes over 2 minutes. It gets way worse with 50 million. I suspect it has something to do with R's string hash table. set.seed(1000) end <- sample(1e8, 1e6) system.time(paste0("END", "=", end)) user system elapsed 6.396 0.028 6.420 end <- sample(1e8, 1e7) system.time(paste0("END", "=", end)) user system elapsed 134.714 0.352 134.978 Indeed, even this takes a long time (in a fresh session): set.seed(1000) end <- sample(1e8, 1e6) end <- sample(1e8, 1e7) system.time(as.character(end)) user system elapsed 57.224 0.156 57.366 But running it a second time is faster (about what one would expect?): system.time(levels <- as.character(end)) user system elapsed 23.582 0.021 23.589 I did some simple profiling of R to find that the resizing of the string hash table is not a significant component of the time. So maybe something to do with the R heap/gc? No time right now to go deeper. But I know Martin likes this sort of thing ;) Michael [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel