Re: [Bioc-devel] Efficient Random Sampling of Positions in GRanges

2016-02-18 Thread Hervé Pagès

Hi,

On 02/18/2016 03:00 AM, Dario Strbenac wrote:

Good day,

Thank you for these two suggestions. I have questions about both options.

GPos seems to have a limit on the number of bases it can represent. I get an 
error: Error in GPos(samplingAreas) : too many genomic positions in 'pos_runs'. 
What exactly is this limit ? Could it be added to the documentation ?


It is documented. The man page for GPos says:

  Note:

 Like for any Vector derivative, the length of a GPos object cannot
 exceed ‘.Machine$integer.max’ (i.e. 2^31 on most platforms).
 GPos() will return an error if 'pos_runs' contains too many
 genomic positions.

I've started to believe that with GPos we might have a use case that is
strong enough to justify adding support for long Vector objects. This
is a big change to our infrastructure though and it won't happen before
the next release.


I used all of the human chromosomes. There is also no documentation of the 
sample function for GPos objects.


That's because there is no sample() function for GPos objects:

  > sample
  function (x, size, replace = FALSE, prob = NULL)
  {
if (length(x) == 1L && is.numeric(x) && x >= 1) {
if (missing(size))
size <- x
sample.int(x, size, replace, prob)
}
else {
if (missing(size))
size <- length(x)
x[sample.int(length(x), size, replace, prob)]
}
  }

As you can see, sample() works on anything that has a length() and
is subsettable (a.k.a. "vector-like" object). See ?sample for more
information.

H.



Could regioneR be improved to consider strand ? It generates regions with no 
strand. I would like the regions to have a strand, since I have a 
strand-specific sequencing dataset, and for regions to be possibly chosen on 
the opposite strand to a masked region, such as when the genome is masked by 
transcripts for the purpose of choosing intergenic sequences.

--
Dario Strbenac
PhD Student
University of Sydney
Camperdown NSW 2050
Australia
___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Efficient Random Sampling of Positions in GRanges

2016-02-18 Thread Dario Strbenac
Good day,

Thank you for these two suggestions. I have questions about both options.

GPos seems to have a limit on the number of bases it can represent. I get an 
error: Error in GPos(samplingAreas) : too many genomic positions in 'pos_runs'. 
What exactly is this limit ? Could it be added to the documentation ? I used 
all of the human chromosomes. There is also no documentation of the sample 
function for GPos objects.

Could regioneR be improved to consider strand ? It generates regions with no 
strand. I would like the regions to have a strand, since I have a 
strand-specific sequencing dataset, and for regions to be possibly chosen on 
the opposite strand to a masked region, such as when the genome is masked by 
transcripts for the purpose of choosing intergenic sequences.

--
Dario Strbenac
PhD Student
University of Sydney
Camperdown NSW 2050
Australia
___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Efficient Random Sampling of Positions in GRanges

2016-02-16 Thread Hervé Pagès

Hi Dario,

On 02/16/2016 03:00 AM, Dario Strbenac wrote:

Hello,

There is no convenience function to sample nucleotide positions from a GRanges 
object. My approach is to generate a GRanges of every chromosomal position with 
a width of 1, then find the overlaps with the desired ranges (admissible 
regions), then sample the positions that overlapped.


You can replace these 3 steps with the following 1-liner by using a
GPos object (new in GenomicRanges devel):

  sample(GPos(admissible_regions), ...)

It should also be more efficient.

From the GPos man page:

 The GPos class is a container for storing a set of genomic
 _positions_, that is, genomic ranges of width 1. Even though a
 GRanges object can be used for that, using a GPos object can be
 much more memory-efficient, especially when the object contains
 long runs of adjacent positions.

So you might improve efficiency a little bit by making sure that the
sampling preserves the order of the genomic positions e.g. with
something like this:

  gpos <- GPos(admissible_regions)
  ## Extract 1000 random genomic positions:
  gpos[order(sample(length(gpos), 1000))]

H.


The construction of the GRanges object containing every chromosome position is 
inefficient, as is finding its overlaps with another GRanges object. Could an 
optimised function for this task be added to GenomicRanges ?

--
Dario Strbenac
PhD Student
University of Sydney
Camperdown NSW 2050
Australia

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Efficient Random Sampling of Positions in GRanges

2016-02-16 Thread Bernat Gel

Hi Dario,

You could use a package called regioneR for that. It has functions to 
create random regions and to randomize existing sets of regions along 
the genome and it can do it taking into account a possible set of masked 
regions.


In your case, you could create a mask for the genome that is the genome 
minus your GRanges object, so the random regions will be placed only on 
your regions.


For example, with a GRanges (A) with only two regions in chromosome 1:

library(regioneR)

A <- toGRanges(data.frame(chr=c("chr1", "chr1"), start=c(20, 100), 
end=c(30, 150)))


hg19.genome <- getGenome("hg19") #You need the hg19 BSgenome package 
installed to do this

hg19.mask <- subtractRegions(hg19.genome, A)

random.regions <- createRandomRegions(nregions=10, length.mean=1, 
length.sd=0, genome=hg19.genome, mask=hg19.mask)




Although the randomization process can be a bit slow when dealing with 
thousands of regions, it will be way faster than the approach you propose.


Hope this helps

Bernat

Bioinformatician, PhD.
Genetic Variation and Cancer
Genetic Diagnostic Unit of Hereditary Cancer (UDGCH-IMPPC)
Institut de Medicina Predictiva i Personalitzada del Càncer (IMPPC)
Campus de Can Ruti
Ctra de Can Ruti, Camí de les Escoles s/n
08916 Badalona, Spain

Tel (+34) 93 557 28 36
b...@imppc.org

http://www.imppc.org/en/

On 02/16/16 12:00, Dario Strbenac wrote:

Hello,

There is no convenience function to sample nucleotide positions from a GRanges 
object. My approach is to generate a GRanges of every chromosomal position with 
a width of 1, then find the overlaps with the desired ranges (admissible 
regions), then sample the positions that overlapped. The construction of the 
GRanges object containing every chromosome position is inefficient, as is 
finding its overlaps with another GRanges object. Could an optimised function 
for this task be added to GenomicRanges ?

--
Dario Strbenac
PhD Student
University of Sydney
Camperdown NSW 2050
Australia

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] Efficient Random Sampling of Positions in GRanges

2016-02-16 Thread Dario Strbenac
Hello,

There is no convenience function to sample nucleotide positions from a GRanges 
object. My approach is to generate a GRanges of every chromosomal position with 
a width of 1, then find the overlaps with the desired ranges (admissible 
regions), then sample the positions that overlapped. The construction of the 
GRanges object containing every chromosome position is inefficient, as is 
finding its overlaps with another GRanges object. Could an optimised function 
for this task be added to GenomicRanges ?

--
Dario Strbenac
PhD Student
University of Sydney
Camperdown NSW 2050
Australia

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel