Simon,
As you noted, there are some discrepancies between Map and gdreduce. I'll explain what I have done so far and start by taking a step back. The IRanges package contains a number of S4 class definitions starting with Sequence that mimic the S3 conventions developed for vector in base R. (The Sequence man page in IRanges contains all the methods that have been developed for Sequence and its subclasses. Most of these methods are for "generics" originally developed for S3 vector objects.) Given the situation that you were encountering, I added Reduce, Map, Find, Filter, Position, and mapply methods for the Sequence class. Given that the names attributes for S3 vectors generally serve as labels rather than true metadata attributes, functions like Map and mapply don't use the names attributes to align elements of vectors (including lists) before operating on them. Here is an example using S3 list objects:

> a <- list(chr1 = 1:10, chr2 = 11:20, chr3 = 21:30)
> b <- list(chr1 = -(1:10), chr3 = -(21:30), chr2 = -(11:20))
> Map("+", a, b)
$chr1
[1] 0 0 0 0 0 0 0 0 0 0

$chr2
[1] -10 -10 -10 -10 -10 -10 -10 -10 -10 -10

$chr3
[1] 10 10 10 10 10 10 10 10 10 10

> as.data.frame(a) + as.data.frame(b)
  chr1 chr2 chr3
1     0  -10   10
2     0  -10   10
3     0  -10   10
4     0  -10   10
5     0  -10   10
6     0  -10   10
7     0  -10   10
8     0  -10   10
9     0  -10   10
10    0  -10   10
> sessionInfo()
R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
i386-apple-darwin9.7.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

If the elements of a and b were aligned by the names attributes, then all of the values would have been 0. There are pros and cons to making Sequence treat names differently than vector. On the pro side, promoting names to a true metadata attribute would avoid the problem you describe. On the con side, this complicates the user's software design because they can no longer simply implement code for Sequence and its subclasses as they would for vector and expect to obtain the same conceptual result.

The GenomeData class is an interesting beast because it was conceptually designed as a list with some metadata laid on top to indicate what genome the data are related to. There is currently nothing in the class that guarantees the names for the list correspond to the names of chromosomes for the particular genome in question or even if the list elements have names at all:

> GenomeData(list(1:10, 1:100))
A GenomeData instance
chromosomes(2): 1 2
> names(GenomeData(list(1:10, 1:100)))
NULL
> validObject(GenomeData(list(1:10, 1:100)))
[1] TRUE

Therefore methods for GenomeData that assume the elements are named are not guaranteed to work for all valid GenomeData objects. I could change this by requiring GenomeData to have element names, but I need to talk to the class designers to see if this is what they intended and if it would break workflows they have designed.

Based on what I have been seeing so far, it appears that GenomeData and RangedData are dually serving in a role similar to what AnnotatedDataFrame has in the microarray space. We just need to find the right mix of flexibility (so people can stores different data types) and restrictions (so we can define methods that perform non-trivial operations) for these two classes so the users don't have to spend needless amounts of time creating clever workarounds if we are too restrictive or reinventing the wheel if we are not restrictive enough.


Patrick



Simon Anders wrote:
Hi Patrick

Patrick Aboyoun wrote:
I worked on beefing up the low-level IRanges Sequence class, which has
around 80 subclasses including GenomeData, GenomeDataList, RleList, to
include functional programming tools like Reduce, Filter, Find, Map,
[...]

Thanks, this looks quite useful.


I've just played with it and found one issue with chromsome names.

It seems that you Map function finds corresponding Rle vectors in two
GenomeData objects by looking at the list indices but not at the names.

Here you have two GenomeData objects which have the same values but
swapped chromosome names:

gd1 <- GenomeData( list(
    chrA = Rle( rep( c(0,3,0), each=20 ) ),
    chrB = Rle( rep( c(0,2,0), each=30 ) ) ) )

gd2 <- GenomeData( list(
    chrB = Rle( rep( c(0,3,0), each=20 ) ),
    chrA = Rle( rep( c(0,2,0), each=30 ) ) ) )

If I add them up, I do not add chromosome A to chromosome A, but the
first chromsome to the first one:

Map( "+", gd1, gd2 )
$chrA
  'numeric' Rle instance of length 60 with 3 runs
  Lengths:  20 20 20
  Values :  0 6 0

$chrB
  'numeric' Rle instance of length 90 with 3 runs
  Lengths:  30 30 30
  Values :  0 4 0

This is of course consistent with what base::Map does but violated the
semantics of the GenomeData object. Imagine you use 'coverage' on two
sets of aligned reads, and for some reasons, one chromosome never
appears in the first set, and another one never appears in the second
one. You will have two GenomeData objects with the same number of
chromosomes, and their sum will be completely wrong.


Martin's gdreduce function, as I've just noticed, does not have this issue:

str( gdreduce( "+", gd1, gd2 ) )
Formal class 'GenomeData' [package "BSgenome"] with 4 slots
  ..@ listData       :List of 2
  .. ..$ chrA:Formal class 'Rle' [package "IRanges"] with 5 slots
  .. .. .. ..@ values         : num [1:6] 0 3 5 2 0 3
  .. .. .. ..@ lengths        : int [1:6] 20 10 10 20 20 10
  .. .. .. ..@ elementMetadata: NULL
  .. .. .. ..@ elementType    : chr "ANYTHING"
  .. .. .. ..@ metadata       : list()
  .. ..$ chrB:Formal class 'Rle' [package "IRanges"] with 5 slots
  .. .. .. ..@ values         : num [1:6] 0 3 5 2 0 3
  .. .. .. ..@ lengths        : int [1:6] 20 10 10 20 20 10
  .. .. .. ..@ elementMetadata: NULL
  .. .. .. ..@ elementType    : chr "ANYTHING"
  .. .. .. ..@ metadata       : list()
  ..@ elementMetadata: NULL
  ..@ elementType    : chr "ANYTHING"
  ..@ metadata       :List of 3
  .. ..$ organism       : NULL
  .. ..$ provider       : NULL
  .. ..$ providerVersion: NULL
Warning messages:
1: In f(init, x[[i]]) :
  longer object length is not a multiple of shorter object length
2: In f(init, x[[i]]) :
  longer object length is not a multiple of shorter object length


Finally: Why did you call it 'map' and Martin called it 'reduce'?
Actually, I think, 'map' is correct for the two-argument case. See the
following use-case here for an example of reducing with a map function.
Assuming that 'lanes' is a list of 'AlignedRead' objects, I expect to
get the coverage, summed over all these lanes (which are from the same
condition), by writing something like

coverageOfAllLanes <-
   Reduce(
      function( gd1, gd2 ) Map( "+", gd1, gd2 ),
      lapply( lanes, coverage, width = chromLengths ) )


Cheers
  Simon


_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to