Hi Sean [hope you don't mind if I cc Bioc-devel]
On 04/15/2014 11:47 PM, Maintainer wrote:
Hi The Bioconductor Dev Team, A new package called BSgenome.Hsapiens.NCBI.GRCh38 has been available for one week. But the single-sequence.fa.gz file in this package is too big. The sequences of all the chromosomes are put together. It is difficult to load. Why do you remake this package as BSgenome.Hsapiens.UCSC.hg19 do? In BSgenome.Hsapiens.UCSC.hg19, sequence files for each chromosome are separated.
Yeah I know... sigh!! ;-) All the BSgenome data packages use this new on disk storage format, not just GRCh38. All the chromosomes are now put together in a single FASTA file compressed in the RAZip format (extension is rz, not gz). The file is indexed so it makes direct random access faster. So for example, extracting only some small portions of the genome with getSeq() is much faster than with the previous on disk storage format where the chromosomes were serialized into individual files. Some people have manifested interest in having getSeq() do fast random access to the genome sequences for a while. See for example this post from Michael in 2011: https://stat.ethz.ch/pipermail/bioc-devel/2011-May/002601.html I'm aware that the new on disk storage format slows down operations where you actually need to compute on the entire genome, which is typically done in a loop where one chromosome is loaded at a time. It's hard to beat the speed (and simplicity) of just loading the individual serialized DNAString objects in that case. This is unfortunate and I was a little bit worried about this when I pushed the new BSgenome packages to the public repos because I think this is a common use case. Note that the switch to this new format happened in devel in January and was announced on the Bioc-devel mailing list: https://stat.ethz.ch/pipermail/bioc-devel/2014-January/005150.html I was hoping people would test this and maybe start a discussion about whether this change was worth it or not. The good news is that I think we can have the best of both worlds i.e. fast direct random access and fast loading of the full sequences. I did some testing with the 2bit format and it looks very promising: - Random access is much faster than with current RAZip'ed FASTA format, - Loading a full chromosome into memory is also very fast because there isn't the overhead of decompressing the data. It could even be that it's faster than loading the chromosome stored in a serialized DNAString object, or maybe not, but at least it should be very close. - The sequences take less space on disk and the resulting BSgenome package will also be slightly smaller. It's something that I was planning to work on early in this new devel cycle. Seems like I should start even earlier and maybe backport to BioC release? Thanks, H.
Best, Sean 2014-04-16 ------------------------------------------------------------------------ ________________________________________________________________________ devteam-bioc mailing list To unsubscribe from this mailing list send a blank email to devteam-bioc-le...@lists.fhcrc.org You can also unsubscribe or change your personal options at https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
-- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel