On 5/28/19 07:57, Pariksheet Nanda wrote: > Hi Hervé, > > Indeed, an IRanges with 2^31 elements is 17.1 GB. > The reason I was interested in IRanges, was GRanges are needed to create > the BSgenome::BSgenomeViews. > More broadly, my use case is chopping up a large genome into a fixed > kmer size so that repetitive "unmappable" regions can be removed. > https://github.com/coregenomics/kmap > <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_coregenomics_kmap&d=DwMFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=ypDrHBrsinoLiVwLABxbsNgXtWaBVZChcPk1j0Iocf8&s=ZFJpfYD2xGsRQNowAQ7KlQqnyhf48rrnSUgWJ0E7aQs&e=> > My interest in long vectors is to address issue #8 > https://github.com/coregenomics/kmap/issues/8 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_coregenomics_kmap_issues_8&d=DwMFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=ypDrHBrsinoLiVwLABxbsNgXtWaBVZChcPk1j0Iocf8&s=HWlBSmaEWmAyMcQs3tQKVXUlebXBiKeO7zbn22r3A9M&e=> > > The workaround I've imagined so far is to have my kmap::kmerize function > return an iterator that creates GRanges less than length 2^31. > Using iterators doesn't even need any additional packages: they're > implemented in the BiocParallel bpiterator unit tests as returning a > function that keeps returning objects until it returns NULL. > > But looking at how much more efficient your GPos, etc functions are, > perhaps maybe BSgenomeViews requiring a GRanges is not as reasonable?
Good point. I opened an issue on GitHub for this: https://github.com/Bioconductor/BSgenome/issues/2 > I don't even know of a sane way to mock a BSgenome object for writing > tests. It's irritating to have to use actual small genomes for tests. Yeah, no user-level BSgenome constructor at the moment. Here is one you could use: BSgenome <- function(dna, circ_seqs=NA, genome=NA, organism=NA, common_name=NA, provider=NA, release_date=NA, release_name=NA, source_url=NA) { ## Some sanity checks. if (!is(dna, "DNAStringSet")) dna <- as(dna, "DNAStringSet") seqnames <- names(dna) if (is.null(seqnames)) stop("'dna' must have names") if (!is.character(circ_seqs)) circ_seqs <- as.character(circ_seqs) if (!identical(circ_seqs, NA_character_)) { if (anyNA(circ_seqs) || anyDuplicated(circ_seqs) || !all(circ_seqs %in% seqnames)) stop(wmsg("when not set to NA, 'circ_seqs' must ", "contain unique sequence names that are ", "present in 'names(dna)'")) } stopifnot(isSingleStringOrNA(genome), isSingleStringOrNA(organism), isSingleStringOrNA(common_name), isSingleStringOrNA(provider), isSingleStringOrNA(release_date), isSingleStringOrNA(release_name), isSingleStringOrNA(source_url)) ## Write the sequences to disk. seqs_dirpath <- tempfile(pattern="BSgenome_seqs_dir") dir.create(seqs_dirpath) twobit_filepath <- file.path(seqs_dirpath, "single_sequences.2bit") rtracklayer::export(dna, twobit_filepath) ## Create the BSgenome object. BSgenome:::BSgenome(organism=as.character(organism), common_name=as.character(organism), provider=as.character(provider), provider_version=as.character(genome), release_date=as.character(release_date), release_name=as.character(release_name), source_url=as.character(source_url), seqnames=seqnames, circ_seqs=circ_seqs, mseqnames=NULL, seqs_pkgname=NA_character_, seqs_dirpath=seqs_dirpath) } Then: genome <- BSgenome(c(chr1="TCCTTGAAAAC", chrM="GGAATAT"), circ_seqs="chrM", genome="hg00") genome # organism: NA (NA) # provider: NA # provider version: hg00 # release date: NA # release name: NA # 2 sequences: # chr1 chrM # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator # to access a given sequence) seqinfo(genome) # Seqinfo object with 2 sequences (1 circular) from hg00 genome: # seqnames seqlengths isCircular genome # chr1 11 FALSE hg00 # chrM 7 TRUE hg00 genome$chr1 # 11-letter "DNAString" instance # seq: TCCTTGAAAAC We should probably have something like that in the BSgenome package. Also opened an issue for that one: https://github.com/Bioconductor/BSgenome/issues/3 Best, H. > > Pariksheet > > On Tue, May 28, 2019 at 3:35 AM Pages, Herve <hpa...@fredhutch.org > <mailto:hpa...@fredhutch.org>> wrote: > > Hi Pariksheet, > > On 5/25/19 12:49, Pariksheet Nanda wrote: > >> Hello, >> >> R 3.0 added support for long vectors, but it's not yet possible to use >> them >> with IRanges. Without long vector support it's not possible to construct >> an IRanges object with more than 2^31 elements: >> >> >>> ir <- IRanges(start = 1:(2^31 - 1), width = 1) >>> ir <- IRanges(start = 1:2^31, width = 1) >> Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = >> "IRanges") >> : >> long vectors not supported yet: memory.c:3715 >> In addition: Warning message: >> In .normargSEW0(start, "start") : >> NAs introduced by coercion to integer range > > Right. This is a known limitation of IRanges objects and Vector > derivatives in general. > > I wonder what's your use case? > > FWIW supporting long Vector derivatives (including long IRanges) has > been on the TODO list for a while. Unfortunately it seems that we > keep getting distracted by other things. > > Note that even when long IRanges objects are supported, computing on > them will not be very efficient because the memory footprint of > these objects will be very big (> 16Gb). It is much more interesting > (and fun) to use long Vector derivatives that have a **small** > memory footprint like long Rle's or long StitchedIPos/StitchedGPos > objects: > > library(S4Vectors) > > x <- Rle(1:15, 1e9) > x > # integer-Rle of length 15000000000 with 15 runs > # Lengths: 1000000000 1000000000 1000000000 ... 1000000000 1000000000 > # Values : 1 2 3 ... 14 15 > > object.size(x) > # 1288 bytes > > library(IRanges) > > ipos <- IPos(IRanges(1, 2e9)) > ipos > # StitchedIPos object with 2000000000 positions and 0 metadata columns: > # pos > # <integer> > # [1] 1 > # [2] 2 > # [3] 3 > # [4] 4 > # [5] 5 > # ... ... > # [1999999996] 1999999996 > # [1999999997] 1999999997 > # [1999999998] 1999999998 > # [1999999999] 1999999999 > # [2000000000] 2000000000 > > object.size(ipos) > # 2736 bytes > > library(GenomicRanges) > > gpos <- GPos("chr1:1-5e8") # not a real organism ;-) > gpos > # StitchedGPos object with 500000000 positions and 0 metadata columns: > # seqnames pos strand > # <Rle> <integer> <Rle> > # [1] chr1 1 * > # [2] chr1 2 * > # [3] chr1 3 * > # [4] chr1 4 * > # [5] chr1 5 * > # ... ... ... ... > # [499999996] chr1 499999996 * > # [499999997] chr1 499999997 * > # [499999998] chr1 499999998 * > # [499999999] chr1 499999999 * > # [500000000] chr1 500000000 * > # ------- > # seqinfo: 1 sequence from an unspecified genome; no seqlengths > > object.size(gpos) > # 10552 bytes > > We're not here yet but the goal would be to have light-weight > objects that can represent all the genomic positions in the Human > genome. > > H. > > >> This is true when using the latest version from GitHub >> >> >>> BiocManager::install("Bioconductor/IRanges") >>> sessionInfo() >> R version 3.6.0 (2019-04-26) >> Platform: x86_64-pc-linux-gnu (64-bit) >> Running under: Red Hat Enterprise Linux Server release 6.7 (Santiago) >> >> Matrix products: default >> BLAS: >> >> /home/pan14001/spack/opt/spack/linux-rhel6-x86_64/gcc-7.4.0/r-3.6.0-r7m53dthhqtxyrrdghjuiw2otasowvbl/rlib/R/lib/libRblas.so >> LAPACK: >> >> /home/pan14001/spack/opt/spack/linux-rhel6-x86_64/gcc-7.4.0/r-3.6.0-r7m53dthhqtxyrrdghjuiw2otasowvbl/rlib/R/lib/libRlapack.so >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats4 parallel stats graphics grDevices utils datasets >> [8] methods base >> >> other attached packages: >> [1] IRanges_2.19.5 S4Vectors_0.22.0 BiocGenerics_0.30.0 >> >> loaded via a namespace (and not attached): >> [1] ps_1.3.0 prettyunits_1.0.2 withr_2.1.2 >> crayon_1.3.4 >> >> [5] rprojroot_1.3-2 assertthat_0.2.1 R6_2.4.0 >> backports_1.1.4 >> [9] magrittr_1.5 cli_1.1.0 curl_3.3 >> remotes_2.0.4 >> >> [13] callr_3.2.0 tools_3.6.0 compiler_3.6.0 >> processx_3.3.1 >> [17] pkgbuild_1.0.3 BiocManager_1.30.4 >> Pariksheet >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing list >> >> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=n-ClvxxGJJ0dHFwPMExjAYre_kqKvi-YPrVMP5Oyhqw&s=pkNJuBKcSYIy8xLk4Sao82m4w_GhgjEsoffdW0jgzIc&e= >> >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__nam01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Furldefense.proofpoint.com-252Fv2-252Furl-253Fu-253Dhttps-2D3A-5F-5Fstat.ethz.ch-5Fmailman-5Flistinfo-5Fbioc-2D2Ddevel-2526d-253DDwICAg-2526c-253DeRAMFD45gAfqt84VtBcfhQ-2526r-253DBK7q3XeAvimeWdGbWY-5FwJYbW0WYiZvSXAJJKaaPhzWA-2526m-253Dn-2DClvxxGJJ0dHFwPMExjAYre-5FkqKvi-2DYPrVMP5Oyhqw-2526s-253DpkNJuBKcSYIy8xLk4Sao82m4w-5FGhgjEsoffdW0jgzIc-2526e-253D-26data-3D02-257C01-257Cpariksheet.nanda-2540uconn.edu-257C6eae687ace5f4c0340cd08d6e33f128d-257C17f1a87e2a254eaab9df9d439034b080-257C0-257C0-257C636946257374964712-26sdata-3DejesWIst1vuOrzlL6s-252BPA6MkgXnSoHQuZIDDCDV6dkM-253D-26reserved-3D0&d=DwMFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=ypDrHBrsinoLiVwLABxbsNgXtWaBVZChcPk1j0Iocf8&s=_lnw4Bd1DrIhzYtVKDb6KAeuJyjxcc3yw8y59kirk7Y&e=> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail:hpa...@fredhutch.org <mailto:hpa...@fredhutch.org> > Phone: (206) 667-5791 > Fax: (206) 667-1319 > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319 _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel