Hi Hervé, Indeed, an IRanges with 2^31 elements is 17.1 GB. The reason I was interested in IRanges, was GRanges are needed to create the BSgenome::BSgenomeViews. More broadly, my use case is chopping up a large genome into a fixed kmer size so that repetitive "unmappable" regions can be removed. https://github.com/coregenomics/kmap My interest in long vectors is to address issue #8 https://github.com/coregenomics/kmap/issues/8
The workaround I've imagined so far is to have my kmap::kmerize function return an iterator that creates GRanges less than length 2^31. Using iterators doesn't even need any additional packages: they're implemented in the BiocParallel bpiterator unit tests as returning a function that keeps returning objects until it returns NULL. But looking at how much more efficient your GPos, etc functions are, perhaps maybe BSgenomeViews requiring a GRanges is not as reasonable? I don't even know of a sane way to mock a BSgenome object for writing tests. It's irritating to have to use actual small genomes for tests. Pariksheet On Tue, May 28, 2019 at 3:35 AM Pages, Herve <hpa...@fredhutch.org> wrote: > Hi Pariksheet, > > On 5/25/19 12:49, Pariksheet Nanda wrote: > > Hello, > > R 3.0 added support for long vectors, but it's not yet possible to use them > with IRanges. Without long vector support it's not possible to construct > an IRanges object with more than 2^31 elements: > > > > ir <- IRanges(start = 1:(2^31 - 1), width = 1) > ir <- IRanges(start = 1:2^31, width = 1) > > Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges") > : > long vectors not supported yet: memory.c:3715 > In addition: Warning message: > In .normargSEW0(start, "start") : > NAs introduced by coercion to integer range > > Right. This is a known limitation of IRanges objects and Vector > derivatives in general. > > I wonder what's your use case? > > FWIW supporting long Vector derivatives (including long IRanges) has been > on the TODO list for a while. Unfortunately it seems that we keep getting > distracted by other things. > > Note that even when long IRanges objects are supported, computing on them > will not be very efficient because the memory footprint of these objects > will be very big (> 16Gb). It is much more interesting (and fun) to use > long Vector derivatives that have a **small** memory footprint like long > Rle's or long StitchedIPos/StitchedGPos objects: > > library(S4Vectors) > > x <- Rle(1:15, 1e9) > x > # integer-Rle of length 15000000000 with 15 runs > # Lengths: 1000000000 1000000000 1000000000 ... 1000000000 1000000000 > # Values : 1 2 3 ... 14 15 > > object.size(x) > # 1288 bytes > > library(IRanges) > > ipos <- IPos(IRanges(1, 2e9)) > ipos > # StitchedIPos object with 2000000000 positions and 0 metadata columns: > # pos > # <integer> > # [1] 1 > # [2] 2 > # [3] 3 > # [4] 4 > # [5] 5 > # ... ... > # [1999999996] 1999999996 > # [1999999997] 1999999997 > # [1999999998] 1999999998 > # [1999999999] 1999999999 > # [2000000000] 2000000000 > > object.size(ipos) > # 2736 bytes > > library(GenomicRanges) > > gpos <- GPos("chr1:1-5e8") # not a real organism ;-) > gpos > # StitchedGPos object with 500000000 positions and 0 metadata columns: > # seqnames pos strand > # <Rle> <integer> <Rle> > # [1] chr1 1 * > # [2] chr1 2 * > # [3] chr1 3 * > # [4] chr1 4 * > # [5] chr1 5 * > # ... ... ... ... > # [499999996] chr1 499999996 * > # [499999997] chr1 499999997 * > # [499999998] chr1 499999998 * > # [499999999] chr1 499999999 * > # [500000000] chr1 500000000 * > # ------- > # seqinfo: 1 sequence from an unspecified genome; no seqlengths > > object.size(gpos) > # 10552 bytes > > > We're not here yet but the goal would be to have light-weight objects that > can represent all the genomic positions in the Human genome. > > H. > > > This is true when using the latest version from GitHub > > > > BiocManager::install("Bioconductor/IRanges") > sessionInfo() > > R version 3.6.0 (2019-04-26) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Red Hat Enterprise Linux Server release 6.7 (Santiago) > > Matrix products: default > BLAS: > /home/pan14001/spack/opt/spack/linux-rhel6-x86_64/gcc-7.4.0/r-3.6.0-r7m53dthhqtxyrrdghjuiw2otasowvbl/rlib/R/lib/libRblas.so > LAPACK: > /home/pan14001/spack/opt/spack/linux-rhel6-x86_64/gcc-7.4.0/r-3.6.0-r7m53dthhqtxyrrdghjuiw2otasowvbl/rlib/R/lib/libRlapack.so > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats4 parallel stats graphics grDevices utils datasets > [8] methods base > > other attached packages: > [1] IRanges_2.19.5 S4Vectors_0.22.0 BiocGenerics_0.30.0 > > loaded via a namespace (and not attached): > [1] ps_1.3.0 prettyunits_1.0.2 withr_2.1.2 crayon_1.3.4 > > [5] rprojroot_1.3-2 assertthat_0.2.1 R6_2.4.0 > backports_1.1.4 > [9] magrittr_1.5 cli_1.1.0 curl_3.3 remotes_2.0.4 > > [13] callr_3.2.0 tools_3.6.0 compiler_3.6.0 > processx_3.3.1 > [17] pkgbuild_1.0.3 BiocManager_1.30.4 > > Pariksheet > > [[alternative HTML version deleted]] > > _______________________________________________bioc-de...@r-project.org > mailing > listhttps://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=n-ClvxxGJJ0dHFwPMExjAYre_kqKvi-YPrVMP5Oyhqw&s=pkNJuBKcSYIy8xLk4Sao82m4w_GhgjEsoffdW0jgzIc&e= > > <https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel%26d%3DDwICAg%26c%3DeRAMFD45gAfqt84VtBcfhQ%26r%3DBK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA%26m%3Dn-ClvxxGJJ0dHFwPMExjAYre_kqKvi-YPrVMP5Oyhqw%26s%3DpkNJuBKcSYIy8xLk4Sao82m4w_GhgjEsoffdW0jgzIc%26e%3D&data=02%7C01%7Cpariksheet.nanda%40uconn.edu%7C6eae687ace5f4c0340cd08d6e33f128d%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C636946257374964712&sdata=ejesWIst1vuOrzlL6s%2BPA6MkgXnSoHQuZIDDCDV6dkM%3D&reserved=0> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpa...@fredhutch.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel