Re: [Bioc-devel] Efficient Random Sampling of Positions in GRanges
Hi, On 02/18/2016 03:00 AM, Dario Strbenac wrote: Good day, Thank you for these two suggestions. I have questions about both options. GPos seems to have a limit on the number of bases it can represent. I get an error: Error in GPos(samplingAreas) : too many genomic positions in 'pos_runs'. What exactly is this limit ? Could it be added to the documentation ? It is documented. The man page for GPos says: Note: Like for any Vector derivative, the length of a GPos object cannot exceed ‘.Machine$integer.max’ (i.e. 2^31 on most platforms). GPos() will return an error if 'pos_runs' contains too many genomic positions. I've started to believe that with GPos we might have a use case that is strong enough to justify adding support for long Vector objects. This is a big change to our infrastructure though and it won't happen before the next release. I used all of the human chromosomes. There is also no documentation of the sample function for GPos objects. That's because there is no sample() function for GPos objects: > sample function (x, size, replace = FALSE, prob = NULL) { if (length(x) == 1L && is.numeric(x) && x >= 1) { if (missing(size)) size <- x sample.int(x, size, replace, prob) } else { if (missing(size)) size <- length(x) x[sample.int(length(x), size, replace, prob)] } } As you can see, sample() works on anything that has a length() and is subsettable (a.k.a. "vector-like" object). See ?sample for more information. H. Could regioneR be improved to consider strand ? It generates regions with no strand. I would like the regions to have a strand, since I have a strand-specific sequencing dataset, and for regions to be possibly chosen on the opposite strand to a masked region, such as when the genome is masked by transcripts for the purpose of choosing intergenic sequences. -- Dario Strbenac PhD Student University of Sydney Camperdown NSW 2050 Australia ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Efficient Random Sampling of Positions in GRanges
Good day, Thank you for these two suggestions. I have questions about both options. GPos seems to have a limit on the number of bases it can represent. I get an error: Error in GPos(samplingAreas) : too many genomic positions in 'pos_runs'. What exactly is this limit ? Could it be added to the documentation ? I used all of the human chromosomes. There is also no documentation of the sample function for GPos objects. Could regioneR be improved to consider strand ? It generates regions with no strand. I would like the regions to have a strand, since I have a strand-specific sequencing dataset, and for regions to be possibly chosen on the opposite strand to a masked region, such as when the genome is masked by transcripts for the purpose of choosing intergenic sequences. -- Dario Strbenac PhD Student University of Sydney Camperdown NSW 2050 Australia ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Efficient Random Sampling of Positions in GRanges
Hi Dario, On 02/16/2016 03:00 AM, Dario Strbenac wrote: Hello, There is no convenience function to sample nucleotide positions from a GRanges object. My approach is to generate a GRanges of every chromosomal position with a width of 1, then find the overlaps with the desired ranges (admissible regions), then sample the positions that overlapped. You can replace these 3 steps with the following 1-liner by using a GPos object (new in GenomicRanges devel): sample(GPos(admissible_regions), ...) It should also be more efficient. From the GPos man page: The GPos class is a container for storing a set of genomic _positions_, that is, genomic ranges of width 1. Even though a GRanges object can be used for that, using a GPos object can be much more memory-efficient, especially when the object contains long runs of adjacent positions. So you might improve efficiency a little bit by making sure that the sampling preserves the order of the genomic positions e.g. with something like this: gpos <- GPos(admissible_regions) ## Extract 1000 random genomic positions: gpos[order(sample(length(gpos), 1000))] H. The construction of the GRanges object containing every chromosome position is inefficient, as is finding its overlaps with another GRanges object. Could an optimised function for this task be added to GenomicRanges ? -- Dario Strbenac PhD Student University of Sydney Camperdown NSW 2050 Australia ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Efficient Random Sampling of Positions in GRanges
Hi Dario, You could use a package called regioneR for that. It has functions to create random regions and to randomize existing sets of regions along the genome and it can do it taking into account a possible set of masked regions. In your case, you could create a mask for the genome that is the genome minus your GRanges object, so the random regions will be placed only on your regions. For example, with a GRanges (A) with only two regions in chromosome 1: library(regioneR) A <- toGRanges(data.frame(chr=c("chr1", "chr1"), start=c(20, 100), end=c(30, 150))) hg19.genome <- getGenome("hg19") #You need the hg19 BSgenome package installed to do this hg19.mask <- subtractRegions(hg19.genome, A) random.regions <- createRandomRegions(nregions=10, length.mean=1, length.sd=0, genome=hg19.genome, mask=hg19.mask) Although the randomization process can be a bit slow when dealing with thousands of regions, it will be way faster than the approach you propose. Hope this helps Bernat Bioinformatician, PhD. Genetic Variation and Cancer Genetic Diagnostic Unit of Hereditary Cancer (UDGCH-IMPPC) Institut de Medicina Predictiva i Personalitzada del Càncer (IMPPC) Campus de Can Ruti Ctra de Can Ruti, Camí de les Escoles s/n 08916 Badalona, Spain Tel (+34) 93 557 28 36 b...@imppc.org http://www.imppc.org/en/ On 02/16/16 12:00, Dario Strbenac wrote: Hello, There is no convenience function to sample nucleotide positions from a GRanges object. My approach is to generate a GRanges of every chromosomal position with a width of 1, then find the overlaps with the desired ranges (admissible regions), then sample the positions that overlapped. The construction of the GRanges object containing every chromosome position is inefficient, as is finding its overlaps with another GRanges object. Could an optimised function for this task be added to GenomicRanges ? -- Dario Strbenac PhD Student University of Sydney Camperdown NSW 2050 Australia ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[Bioc-devel] Efficient Random Sampling of Positions in GRanges
Hello, There is no convenience function to sample nucleotide positions from a GRanges object. My approach is to generate a GRanges of every chromosomal position with a width of 1, then find the overlaps with the desired ranges (admissible regions), then sample the positions that overlapped. The construction of the GRanges object containing every chromosome position is inefficient, as is finding its overlaps with another GRanges object. Could an optimised function for this task be added to GenomicRanges ? -- Dario Strbenac PhD Student University of Sydney Camperdown NSW 2050 Australia ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel