Re: [Bioc-devel] Block bootstrap for GenomicRanges

2018-08-14 Thread Hervé Pagès

I see. If the blocks need to be completely random (but still
with the constraint that once rearranged they form a partition
of the chromosome), then a slightly modified solution would be:

## Add a feature id to the ranges in y. This is not required
## but will help see what happens to the features:

  mcols(y)$feature_id <- head(letters, length(y))
  y
  # IRanges object with 12 ranges and 1 metadata column:
  #  start   end width |  feature_id
  # | 
  #  [1]5155 5 |   a
  #  [2]6165 5 |   b
  #  [3]7175 5 |   c
  #  [4]   111   115 5 |   d
  #  [5]   121   125 5 |   e
  #  ...   ...   ...   ... . ...
  #  [8]   511   515 5 |   h
  #  [9]   521   525 5 |   i
  # [10]   921   925 5 |   j
  # [11]   931   935 5 |   k
  # [12]   941   945 5 |   l

## Generate the random blocks:

  random_blocks <- IRanges(start=round(runif(10,1,901)), width=100)

## Add a block id. Again, not needed for the algo below, but will
## help understand the final object y_prime:

  mcols(random_blocks)$block_id <- head(LETTERS, length(random_blocks))
  random_blocks
  # IRanges object with 10 ranges and 1 metadata column:
  #  start   end width |block_id
  # | 
  #  [1]   283   382   100 |   A
  #  [2]   898   997   100 |   B
  #  [3]   298   397   100 |   C
  #  [4]   680   779   100 |   D
  #  [5]   722   821   100 |   E
  #  [6]   632   731   100 |   F
  #  [7]   594   693   100 |   G
  #  [8]   689   788   100 |   H
  #  [9]   886   985   100 |   I
  # [10]   673   772   100 |   J

## Compute the shift involved in rearranging each block:

  rearranged_blocks <- successiveIRanges(width(random_blocks))
  block_shift <- start(rearranged_blocks) - start(random_blocks)

## Compute y':

  y_prime <- do.call(c,
lapply(seq_along(random_blocks),
  function(b) {
features_to_shift <- subsetByOverlaps(y, random_blocks[b])
block_id <- mcols(random_blocks)$block_id[b]
mcols(features_to_shift)$block_id <- rep(block_id, 
length(features_to_shift))

shift(features_to_shift, block_shift[b])
  }
)
  )

  y_prime
  # IRanges object with 6 ranges and 2 metadata columns:
  # start   end width |  feature_idblock_id
  #|  
  # [1]   124   128 5 |   j   B
  # [2]   134   138 5 |   k   B
  # [3]   144   148 5 |   l   B
  # [4]   836   840 5 |   j   I
  # [5]   846   850 5 |   k   I
  # [6]   856   860 5 |   l   I

Still based on shift(), which avoids all the little annoyances
of using Rle's as an intermediate representation of the ranges.

It uses a loop which might be problem if the number of blocks is
big (say more than 5). There might be a way to avoid the loop
though, but it's probably not trivial...

H.


On 08/14/2018 05:26 AM, Michael Love wrote:

dear Hervé,

Thanks again for the quick and useful reply!

I think that the theory behind the block bootstrap [Kunsch (1989), Liu
and Singh (1992), Politis and Romano (1994)], needs that the blocks be
drawn with replacement (you can get some features twice) and that the
blocks can be overlapping. In a hand-waving way, I think, it's "good"
for the variance estimation on any statistic of interest that y' may
have more or less features than y.

I will explore a bit using the solutions you've laid out.

Now that I think about it, the start-position based solution that I
was thinking about will break if two features in y share the same
start position, so that's not good.

On Mon, Aug 13, 2018 at 11:58 PM, Hervé Pagès  wrote:

That helps. I think I start to understand what you are after.

See below...


On 08/13/2018 06:07 PM, Michael Love wrote:


dear Hervé,

Thanks for the quick reply about directions to take this.

I'm sorry for not providing sufficient detail about the goal of block
bootstrapping in my initial post. Let me try again. For a moment, let
me ignore multiple chromosomes/seqs and just focus on a single set of
IRanges.

The point of the block bootstrap is: Let's say we want to find the
number of overlaps of x and y, and then assess how surprised we are at
how large that overlap is. Both of them may have features that tend to
cluster together along the genome (independently). One method would
just be to move the features in y around to random start sites, making
y', say B times, and then calculate 

Re: [Bioc-devel] Block bootstrap for GenomicRanges

2018-08-14 Thread Bernat Gel

Hi all,

Since you are talking about genomic ranges permutation let me just say 
that the regioneR package already does that. It does not have block 
bootstrapping as defined here, but it implements two different 
randomization models (one that randomizes each region independently and 
another one based on "chromosome 'spinning' " that conserves the 
internal structure of the set of ranges) and both can take into account 
a mask defining regions where ranges cannot be randomized. It's quite 
customizable and so it would be possible to add new permutation 
strategies if needed.


It's probably not the most efficient implementation, but the 
randomization process scales decently and it has proven useful over the 
years.



Bernat

*Bernat Gel Moreno*
Bioinformatician

Hereditary Cancer Program
Program of Predictive and Personalized Medicine of Cancer (PMPPC)
Germans Trias i Pujol Research Institute (IGTP)

Campus Can Ruti
Carretera de Can Ruti, Camí de les Escoles s/n
08916 Badalona, Barcelona, Spain

Tel: (+34) 93 554 3068
Fax: (+34) 93 497 8654
08916 Badalona, Barcelona, Spain
b...@igtp.cat 
www.germanstrias.org 









El 08/14/2018 a las 03:06 PM, Kasper Daniel Hansen escribió:

I agree this is super important.  I think there may be multiple ways of
thinking about a decent bootstrapping or permutation of ranges, in
genomics. I am quite interested in the topic. I think it might belong in a
new package. I would be interesting in extending the conversation and have
a couple of different approaches (theoretical) that we could work on being
efficient.

Best,
Kasper

On Tue, Aug 14, 2018 at 8:27 AM Michael Love 
wrote:


dear Hervé,

Thanks again for the quick and useful reply!

I think that the theory behind the block bootstrap [Kunsch (1989), Liu
and Singh (1992), Politis and Romano (1994)], needs that the blocks be
drawn with replacement (you can get some features twice) and that the
blocks can be overlapping. In a hand-waving way, I think, it's "good"
for the variance estimation on any statistic of interest that y' may
have more or less features than y.

I will explore a bit using the solutions you've laid out.

Now that I think about it, the start-position based solution that I
was thinking about will break if two features in y share the same
start position, so that's not good.

On Mon, Aug 13, 2018 at 11:58 PM, Hervé Pagès 
wrote:

That helps. I think I start to understand what you are after.

See below...


On 08/13/2018 06:07 PM, Michael Love wrote:

dear Hervé,

Thanks for the quick reply about directions to take this.

I'm sorry for not providing sufficient detail about the goal of block
bootstrapping in my initial post. Let me try again. For a moment, let
me ignore multiple chromosomes/seqs and just focus on a single set of
IRanges.

The point of the block bootstrap is: Let's say we want to find the
number of overlaps of x and y, and then assess how surprised we are at
how large that overlap is. Both of them may have features that tend to
cluster together along the genome (independently). One method would
just be to move the features in y around to random start sites, making
y', say B times, and then calculate each of the B times the number of
overlaps between x and y'. Or we might make this better by having
blacklisted sites where the randomly shuffled features in y cannot go.

The block bootstrap is an alternative to randomly moving the start
sites, where instead we create random data, by taking big "blocks" of
features in y. Each block is a lot like a View. And the ultimate goal
is to make B versions of the data y where the features have been
shuffled around, but by taking blocks, we preserve the clumpiness of
the features in y.

Let me give some numbers to make this more concrete, so say we're
making a single block bootstrap sample of a chromosome that is 1000 bp
long. Here is the original y:

y <- IRanges(c(51,61,71,111,121,131,501,511,521,921,931,941),width=5)

If I go with my coverage approach, I should extend it all the way to
the end of the chromosome. Here I lose information if there are
overlaps of features in y, and I'm thinking of a fix I'll describe
below.

cov <- c(coverage(y), Rle(rep(0,55)))

I could make one block bootstrap sample of y (this is 1 out of B in
the ultimate procedure) by taking 10 blocks of width 100. The blocks
have random start positions from 1 to 901.

y.boot.1 <- unlist(Views(cov, start=round(runif(10,1,901)), width=100))


Choosing blocks that can overlap with each others could make y' appear
to have more features than y (by repeating some of the original
features). Also choosing blocks that can leave big gaps in the
chromosome could make y' appear to have less features than y
(by dropping some of the original ranges). Isn't that a problem?

Have you considered choosing a set of blocks that represent a
partitioning of the chromosome? This would make y' look more like y
by 

Re: [Bioc-devel] Block bootstrap for GenomicRanges

2018-08-14 Thread Michael Love
Maybe we can break off to a #blockboot channel on the Bioc-community Slack.

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Block bootstrap for GenomicRanges

2018-08-14 Thread Kasper Daniel Hansen
I agree this is super important.  I think there may be multiple ways of
thinking about a decent bootstrapping or permutation of ranges, in
genomics. I am quite interested in the topic. I think it might belong in a
new package. I would be interesting in extending the conversation and have
a couple of different approaches (theoretical) that we could work on being
efficient.

Best,
Kasper

On Tue, Aug 14, 2018 at 8:27 AM Michael Love 
wrote:

> dear Hervé,
>
> Thanks again for the quick and useful reply!
>
> I think that the theory behind the block bootstrap [Kunsch (1989), Liu
> and Singh (1992), Politis and Romano (1994)], needs that the blocks be
> drawn with replacement (you can get some features twice) and that the
> blocks can be overlapping. In a hand-waving way, I think, it's "good"
> for the variance estimation on any statistic of interest that y' may
> have more or less features than y.
>
> I will explore a bit using the solutions you've laid out.
>
> Now that I think about it, the start-position based solution that I
> was thinking about will break if two features in y share the same
> start position, so that's not good.
>
> On Mon, Aug 13, 2018 at 11:58 PM, Hervé Pagès 
> wrote:
> > That helps. I think I start to understand what you are after.
> >
> > See below...
> >
> >
> > On 08/13/2018 06:07 PM, Michael Love wrote:
> >>
> >> dear Hervé,
> >>
> >> Thanks for the quick reply about directions to take this.
> >>
> >> I'm sorry for not providing sufficient detail about the goal of block
> >> bootstrapping in my initial post. Let me try again. For a moment, let
> >> me ignore multiple chromosomes/seqs and just focus on a single set of
> >> IRanges.
> >>
> >> The point of the block bootstrap is: Let's say we want to find the
> >> number of overlaps of x and y, and then assess how surprised we are at
> >> how large that overlap is. Both of them may have features that tend to
> >> cluster together along the genome (independently). One method would
> >> just be to move the features in y around to random start sites, making
> >> y', say B times, and then calculate each of the B times the number of
> >> overlaps between x and y'. Or we might make this better by having
> >> blacklisted sites where the randomly shuffled features in y cannot go.
> >>
> >> The block bootstrap is an alternative to randomly moving the start
> >> sites, where instead we create random data, by taking big "blocks" of
> >> features in y. Each block is a lot like a View. And the ultimate goal
> >> is to make B versions of the data y where the features have been
> >> shuffled around, but by taking blocks, we preserve the clumpiness of
> >> the features in y.
> >>
> >> Let me give some numbers to make this more concrete, so say we're
> >> making a single block bootstrap sample of a chromosome that is 1000 bp
> >> long. Here is the original y:
> >>
> >> y <- IRanges(c(51,61,71,111,121,131,501,511,521,921,931,941),width=5)
> >>
> >> If I go with my coverage approach, I should extend it all the way to
> >> the end of the chromosome. Here I lose information if there are
> >> overlaps of features in y, and I'm thinking of a fix I'll describe
> >> below.
> >>
> >> cov <- c(coverage(y), Rle(rep(0,55)))
> >>
> >> I could make one block bootstrap sample of y (this is 1 out of B in
> >> the ultimate procedure) by taking 10 blocks of width 100. The blocks
> >> have random start positions from 1 to 901.
> >>
> >> y.boot.1 <- unlist(Views(cov, start=round(runif(10,1,901)), width=100))
> >
> >
> > Choosing blocks that can overlap with each others could make y' appear
> > to have more features than y (by repeating some of the original
> > features). Also choosing blocks that can leave big gaps in the
> > chromosome could make y' appear to have less features than y
> > (by dropping some of the original ranges). Isn't that a problem?
> >
> > Have you considered choosing a set of blocks that represent a
> > partitioning of the chromosome? This would make y' look more like y
> > by preserving the number of original ranges.
> >
> > In other words, if you can assign each range in y to one block and
> > one block only, then you could generate y' by just shifting the
> > ranges in y. The exact amount of shifting (positive or negative)
> > would only depend on the block that the range belongs to. This would
> > give you an y' that is parallel to y (i.e. same number of ranges
> > and y'[i] corresponds to y[i]).
> >
> > Something like this:
> >
> > 1) Define the blocks (must be a partitioning of the chromosome):
> >
> >   blocks <- successiveIRanges(rep(100, 10))
> >
> > 2) Assign each range in y to the block it belongs to:
> >
> >   mcols(y)$block <- findOverlaps(y, blocks, select="first")
> >
> > 1) and 2) are preliminary steps that you only need to do once,
> > before you generate B versions of the shuffled data (what you
> > call y'). The next steps are for generating one version of the
> > shuffled data so would need to be repeated B times.
> >
> > 3) 

Re: [Bioc-devel] Block bootstrap for GenomicRanges

2018-08-14 Thread Michael Love
dear Hervé,

Thanks again for the quick and useful reply!

I think that the theory behind the block bootstrap [Kunsch (1989), Liu
and Singh (1992), Politis and Romano (1994)], needs that the blocks be
drawn with replacement (you can get some features twice) and that the
blocks can be overlapping. In a hand-waving way, I think, it's "good"
for the variance estimation on any statistic of interest that y' may
have more or less features than y.

I will explore a bit using the solutions you've laid out.

Now that I think about it, the start-position based solution that I
was thinking about will break if two features in y share the same
start position, so that's not good.

On Mon, Aug 13, 2018 at 11:58 PM, Hervé Pagès  wrote:
> That helps. I think I start to understand what you are after.
>
> See below...
>
>
> On 08/13/2018 06:07 PM, Michael Love wrote:
>>
>> dear Hervé,
>>
>> Thanks for the quick reply about directions to take this.
>>
>> I'm sorry for not providing sufficient detail about the goal of block
>> bootstrapping in my initial post. Let me try again. For a moment, let
>> me ignore multiple chromosomes/seqs and just focus on a single set of
>> IRanges.
>>
>> The point of the block bootstrap is: Let's say we want to find the
>> number of overlaps of x and y, and then assess how surprised we are at
>> how large that overlap is. Both of them may have features that tend to
>> cluster together along the genome (independently). One method would
>> just be to move the features in y around to random start sites, making
>> y', say B times, and then calculate each of the B times the number of
>> overlaps between x and y'. Or we might make this better by having
>> blacklisted sites where the randomly shuffled features in y cannot go.
>>
>> The block bootstrap is an alternative to randomly moving the start
>> sites, where instead we create random data, by taking big "blocks" of
>> features in y. Each block is a lot like a View. And the ultimate goal
>> is to make B versions of the data y where the features have been
>> shuffled around, but by taking blocks, we preserve the clumpiness of
>> the features in y.
>>
>> Let me give some numbers to make this more concrete, so say we're
>> making a single block bootstrap sample of a chromosome that is 1000 bp
>> long. Here is the original y:
>>
>> y <- IRanges(c(51,61,71,111,121,131,501,511,521,921,931,941),width=5)
>>
>> If I go with my coverage approach, I should extend it all the way to
>> the end of the chromosome. Here I lose information if there are
>> overlaps of features in y, and I'm thinking of a fix I'll describe
>> below.
>>
>> cov <- c(coverage(y), Rle(rep(0,55)))
>>
>> I could make one block bootstrap sample of y (this is 1 out of B in
>> the ultimate procedure) by taking 10 blocks of width 100. The blocks
>> have random start positions from 1 to 901.
>>
>> y.boot.1 <- unlist(Views(cov, start=round(runif(10,1,901)), width=100))
>
>
> Choosing blocks that can overlap with each others could make y' appear
> to have more features than y (by repeating some of the original
> features). Also choosing blocks that can leave big gaps in the
> chromosome could make y' appear to have less features than y
> (by dropping some of the original ranges). Isn't that a problem?
>
> Have you considered choosing a set of blocks that represent a
> partitioning of the chromosome? This would make y' look more like y
> by preserving the number of original ranges.
>
> In other words, if you can assign each range in y to one block and
> one block only, then you could generate y' by just shifting the
> ranges in y. The exact amount of shifting (positive or negative)
> would only depend on the block that the range belongs to. This would
> give you an y' that is parallel to y (i.e. same number of ranges
> and y'[i] corresponds to y[i]).
>
> Something like this:
>
> 1) Define the blocks (must be a partitioning of the chromosome):
>
>   blocks <- successiveIRanges(rep(100, 10))
>
> 2) Assign each range in y to the block it belongs to:
>
>   mcols(y)$block <- findOverlaps(y, blocks, select="first")
>
> 1) and 2) are preliminary steps that you only need to do once,
> before you generate B versions of the shuffled data (what you
> call y'). The next steps are for generating one version of the
> shuffled data so would need to be repeated B times.
>
> 3) Shuffle the blocks:
>
>   perm <- sample(length(blocks))
>   perm
>   # [1]  6  5  8  3  2  7  1  9  4 10
>
>   permuted_blocks <- successiveIRanges(width(blocks)[perm])
>   permuted_blocks[perm] <- permuted_blocks
>
> 4) Compute the shift along the chromosome that each block went
> thru:
>
>   block_shift <- start(permuted_blocks) - start(blocks)
>
> 5) Apply this shift to the ranges in y:
>
>   shift(y, block_shift[mcols(y)$block])
>   # IRanges object with 12 ranges and 1 metadata column:
>   #  start   end width | block
>   # | 
>   #  [1]   651   655 5 | 1
>   #  [2]   661