Re: [Bioc-sig-seq] `+` for GenomeData and coverage from several lanes

Patrick Aboyoun Mon, 06 Jul 2009 11:05:03 -0700

Simon,

As you noted, there are some discrepancies between Map and gdreduce.I'll explain what I have done so far and start by taking a step back.The IRanges package contains a number of S4 class definitions startingwith Sequence that mimic the S3 conventions developed for vector in baseR. (The Sequence man page in IRanges contains all the methods that havebeen developed for Sequence and its subclasses. Most of these methodsare for "generics" originally developed for S3 vector objects.) Giventhe situation that you were encountering, I added Reduce, Map, Find,Filter, Position, and mapply methods for the Sequence class. Given thatthe names attributes for S3 vectors generally serve as labels ratherthan true metadata attributes, functions like Map and mapply don't usethe names attributes to align elements of vectors (including lists)before operating on them. Here is an example using S3 list objects:


> a <- list(chr1 = 1:10, chr2 = 11:20, chr3 = 21:30)
> b <- list(chr1 = -(1:10), chr3 = -(21:30), chr2 = -(11:20))
> Map("+", a, b)
$chr1
[1] 0 0 0 0 0 0 0 0 0 0

$chr2
[1] -10 -10 -10 -10 -10 -10 -10 -10 -10 -10

$chr3
[1] 10 10 10 10 10 10 10 10 10 10

> as.data.frame(a) + as.data.frame(b)
  chr1 chr2 chr3
1     0  -10   10
2     0  -10   10
3     0  -10   10
4     0  -10   10
5     0  -10   10
6     0  -10   10
7     0  -10   10
8     0  -10   10
9     0  -10   10
10    0  -10   10
> sessionInfo()
R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
i386-apple-darwin9.7.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:

[1] stats graphics grDevices utils datasets methods base

If the elements of a and b were aligned by the names attributes, thenall of the values would have been 0. There are pros and cons to makingSequence treat names differently than vector. On the pro side, promotingnames to a true metadata attribute would avoid the problem you describe.On the con side, this complicates the user's software design becausethey can no longer simply implement code for Sequence and its subclassesas they would for vector and expect to obtain the same conceptual result.

The GenomeData class is an interesting beast because it was conceptuallydesigned as a list with some metadata laid on top to indicate whatgenome the data are related to. There is currently nothing in the classthat guarantees the names for the list correspond to the names ofchromosomes for the particular genome in question or even if the listelements have names at all:


> GenomeData(list(1:10, 1:100))
A GenomeData instance
chromosomes(2): 1 2
> names(GenomeData(list(1:10, 1:100)))
NULL
> validObject(GenomeData(list(1:10, 1:100)))
[1] TRUE

Therefore methods for GenomeData that assume the elements are named arenot guaranteed to work for all valid GenomeData objects. I could changethis by requiring GenomeData to have element names, but I need to talkto the class designers to see if this is what they intended and if itwould break workflows they have designed.

Based on what I have been seeing so far, it appears that GenomeData andRangedData are dually serving in a role similar to whatAnnotatedDataFrame has in the microarray space. We just need to find theright mix of flexibility (so people can stores different data types) andrestrictions (so we can define methods that perform non-trivialoperations) for these two classes so the users don't have to spendneedless amounts of time creating clever workarounds if we are toorestrictive or reinventing the wheel if we are not restrictive enough.



Patrick



Simon Anders wrote:

Hi Patrick

Patrick Aboyoun wrote:

I worked on beefing up the low-level IRanges Sequence class, which has
around 80 subclasses including GenomeData, GenomeDataList, RleList, to
include functional programming tools like Reduce, Filter, Find, Map,

[...]

Thanks, this looks quite useful.


I've just played with it and found one issue with chromsome names.

It seems that you Map function finds corresponding Rle vectors in two
GenomeData objects by looking at the list indices but not at the names.

Here you have two GenomeData objects which have the same values but
swapped chromosome names:

gd1 <- GenomeData( list(

    chrA = Rle( rep( c(0,3,0), each=20 ) ),
    chrB = Rle( rep( c(0,2,0), each=30 ) ) ) )

gd2 <- GenomeData( list(

    chrB = Rle( rep( c(0,3,0), each=20 ) ),
    chrA = Rle( rep( c(0,2,0), each=30 ) ) ) )

If I add them up, I do not add chromosome A to chromosome A, but the
first chromsome to the first one:

Map( "+", gd1, gd2 )

$chrA
  'numeric' Rle instance of length 60 with 3 runs
  Lengths:  20 20 20
  Values :  0 6 0

$chrB
  'numeric' Rle instance of length 90 with 3 runs
  Lengths:  30 30 30
  Values :  0 4 0

This is of course consistent with what base::Map does but violated the
semantics of the GenomeData object. Imagine you use 'coverage' on two
sets of aligned reads, and for some reasons, one chromosome never
appears in the first set, and another one never appears in the second
one. You will have two GenomeData objects with the same number of
chromosomes, and their sum will be completely wrong.


Martin's gdreduce function, as I've just noticed, does not have this issue:

str( gdreduce( "+", gd1, gd2 ) )

Formal class 'GenomeData' [package "BSgenome"] with 4 slots
  ..@ listData       :List of 2
  .. ..$ chrA:Formal class 'Rle' [package "IRanges"] with 5 slots
  .. .. .. ..@ values         : num [1:6] 0 3 5 2 0 3
  .. .. .. ..@ lengths        : int [1:6] 20 10 10 20 20 10
  .. .. .. ..@ elementMetadata: NULL
  .. .. .. ..@ elementType    : chr "ANYTHING"
  .. .. .. ..@ metadata       : list()
  .. ..$ chrB:Formal class 'Rle' [package "IRanges"] with 5 slots
  .. .. .. ..@ values         : num [1:6] 0 3 5 2 0 3
  .. .. .. ..@ lengths        : int [1:6] 20 10 10 20 20 10
  .. .. .. ..@ elementMetadata: NULL
  .. .. .. ..@ elementType    : chr "ANYTHING"
  .. .. .. ..@ metadata       : list()
  ..@ elementMetadata: NULL
  ..@ elementType    : chr "ANYTHING"
  ..@ metadata       :List of 3
  .. ..$ organism       : NULL
  .. ..$ provider       : NULL
  .. ..$ providerVersion: NULL
Warning messages:
1: In f(init, x[[i]]) :
  longer object length is not a multiple of shorter object length
2: In f(init, x[[i]]) :
  longer object length is not a multiple of shorter object length


Finally: Why did you call it 'map' and Martin called it 'reduce'?
Actually, I think, 'map' is correct for the two-argument case. See the
following use-case here for an example of reducing with a map function.
Assuming that 'lanes' is a list of 'AlignedRead' objects, I expect to
get the coverage, summed over all these lanes (which are from the same
condition), by writing something like

coverageOfAllLanes <-
   Reduce(
      function( gd1, gd2 ) Map( "+", gd1, gd2 ),
      lapply( lanes, coverage, width = chromLengths ) )


Cheers
  Simon


_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] `+` for GenomeData and coverage from several lanes

Reply via email to