Re: [Bioc-sig-seq] ExpressionSet alikes for next-gen data

Michael Lawrence Fri, 02 Apr 2010 08:43:48 -0700

I've recently taken over the maintenance/development of the chipseq package
and have plans for a lot of refactoring, including some new formal classes
for ChIP-seq data. I'm wondering though if 'chipseq' is the best place,
given that it also includes some specific analytical methods. That's not a
huge deal, but might GenomicRanges be the place for these high-level
structures?


On Fri, Apr 2, 2010 at 8:31 AM, Vincent Carey <[email protected]>wrote:

>
>
> On Fri, Apr 2, 2010 at 11:21 AM, Michael Lawrence <
> [email protected]> wrote:
>
>>
>>
>> On Fri, Apr 2, 2010 at 7:55 AM, Vincent Carey <[email protected]
>> > wrote:
>>
>>> To get a bit more concrete regarding these notions, the leeBamViews
>>> package is in the experimental data archive, a VERY rudimentary illustration
>>> of a workflow rooted in BAM archive files through region specification and
>>> read counting.  For the very latest checkin, after running
>>>
>>> example(bs1)
>>>
>>> we have an ad hoc tabulation of read counts:
>>>
>>> bs1> tabulateReads(bs1, "+")
>>>          intv1  intv2
>>> start   861250 863000
>>> end     862750 864000
>>> isowt.5   3673   2692
>>> isowt.6   3770   2650
>>> rlp.5     1532   1045
>>> rlp.6     1567   1139
>>> ssr.1     4304   3052
>>> ssr.2     4627   3381
>>> xrn.1     2841   1693
>>> xrn.2     3477   2197
>>>
>>> or, by setting as.GRanges, a GRanges-based representation
>>>
>>> > tabulateReads(bs1, "+", as.GRanges=TRUE)
>>> GRanges with 2 ranges and 9 elementMetadata values
>>>     seqnames           ranges strand |        name   isowt.5   isowt.6
>>>        <Rle>        <IRanges>  <Rle> | <character> <integer> <integer>
>>> [1]  Scchr13 [861250, 862750]      + |       intv1      3673      3770
>>> [2]  Scchr13 [863000, 864000]      + |       intv2      2692      2650
>>>         rlp.5     rlp.6     ssr.1     ssr.2     xrn.1     xrn.2
>>>     <integer> <integer> <integer> <integer> <integer> <integer>
>>> [1]      1532      1567      4304      4627      2841      3477
>>> [2]      1045      1139      3052      3381      1693      2197
>>>
>>> seqlengths
>>> Scchr13
>>>      NA
>>> > tabulateReads(bs1, "+", as.GRanges=TRUE) -> OO
>>> > metadata(OO)
>>> list()
>>>
>>> It seems that we would want more structure in a metadata component to get
>>> closer to the values of ExpressionSet discipline.  We would also want some
>>> accommodation of this kind of representation in the downstream packages like
>>> edgeR, DEseq.
>>>
>>>
>> The actual 'metadata' slot was meant to be general, in order to
>> accommodate all needs. If a particular type of data requires a certain
>> structure, then additional formal classes may be necessary.  For example,
>> gene expression RNA-seq may want a featureData equivalent annotating each
>> transcript, whereas with ChIP-seq data, that sort of structure would make
>> less sense, short of some additional assumptions.
>>
>
> I agree completely.  Our task is to think/experiment about how to suitably
> specialize these structures for most effective downstream use.  Reuse by
> multiple downstream toolchains would be great.
>
>

>> Michael
>>
>> > sessionInfo()
>>> R version 2.11.0 Under development (unstable) (2010-03-24 r51388)
>>> x86_64-apple-darwin10.2.0
>>>
>>> locale:
>>> [1] C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices datasets  tools     utils     methods
>>> [8] base
>>>
>>> other attached packages:
>>>  [1] leeBamViews_0.99.3  BSgenome_1.15.18    Rsamtools_0.2.1
>>>  [4] Biostrings_2.15.25  GenomicRanges_0.1.3 IRanges_1.5.74
>>>  [7] Biobase_2.7.5       weaver_1.13.0       codetools_0.2-2
>>> [10] digest_0.4.1
>>>
>>>
>>> On Thu, Apr 1, 2010 at 10:15 AM, Martin Morgan <[email protected]>wrote:
>>>
>>>> On 03/31/2010 04:06 AM, Michael Lawrence wrote:
>>>> > On Wed, Mar 31, 2010 at 3:55 AM, David Rossell <
>>>> > [email protected]> wrote:
>>>> >
>>>> >> Following a recent thread, I also have found convenient to store
>>>> nextgen
>>>> >> data as RangedData instead of ShortRead objects. They require far
>>>> less
>>>> >> memory and make feasible working with several samples at the same
>>>> time (in
>>>> >> my 8Gb RAM desktop I can load 2 ShortRead objects at the most, with
>>>> >> RangedData I haven't struck the upper limit yet).
>>>> >>
>>>> >> I am thinking about taking this idea a step forward: RangedDataList
>>>> allows
>>>> >> storing info from several samples (e.g. IP and control) in a single
>>>> object.
>>>> >> The only problem is RangedDataList does not store information about
>>>> the
>>>> >> samples, e.g. the phenoData we're used to in ExpressionSet objects.
>>>> My idea
>>>> >> is to define something like a "SequenceSet" class, which would
>>>> contain a
>>>> >> RangedDataList with the ranges, a phenoData with sample information,
>>>> and
>>>> >> possibly also information about the experiment (e.g. with the MIAME
>>>> analog
>>>> >> for sequencing, MIASEQE).
>>>> >>
>>>> >> The thing is I don't want to re-invent the wheel. I haven't seen that
>>>> this
>>>> >> is implemented yet, but is someone working on it? Any criticism/
>>>> ideas?
>>>> >>
>>>> >>
>>>> > RangedDataList already supports this. See the 'elementMetadata' and
>>>> > 'metadata' slots in the Sequence class.
>>>>
>>>> Hi David et al.,
>>>>
>>>> I've also found the elementMetadata slot excellent for this purpose.
>>>> The ShortRead data objects retain sequence and quality information, this
>>>> information is often not needed after a certain point in the analysis.
>>>>
>>>> Wanted to point to the GenomicRanges package in Bioc-devel, which has a
>>>> GRanges class that is more fastidious about strand information (maybe a
>>>> plus?) and conforms more to an 'I am a rectangular data structure' world
>>>> view. Also the GappedAlignments class for efficiently representing large
>>>> numbers of reads.
>>>>
>>>> Martin
>>>>
>>>> >
>>>> > Michael
>>>> >
>>>> >
>>>> >
>>>> >> Best,
>>>> >>
>>>> >> David
>>>> >>
>>>> >> --
>>>> >> David Rossell, PhD
>>>> >> Manager, Bioinformatics and Biostatistics unit
>>>> >> IRB Barcelona
>>>> >> Tel (+34) 93 402 0217
>>>> >> Fax (+34) 93 402 0257
>>>> >> http://www.irbbarcelona.org/bioinformatics
>>>> >>
>>>> >>        [[alternative HTML version deleted]]
>>>> >>
>>>> >> _______________________________________________
>>>> >> Bioc-sig-sequencing mailing list
>>>> >> [email protected]
>>>> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>> >>
>>>> >
>>>> >       [[alternative HTML version deleted]]
>>>> >
>>>> > _______________________________________________
>>>> > Bioc-sig-sequencing mailing list
>>>> > [email protected]
>>>> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>
>>>>
>>>> --
>>>> Martin Morgan
>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N.
>>>> PO Box 19024 Seattle, WA 98109
>>>>
>>>> Location: Arnold Building M1 B861
>>>> Phone: (206) 667-2793
>>>>
>>>> _______________________________________________
>>>> Bioc-sig-sequencing mailing list
>>>> [email protected]
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>
>>>
>>>
>>
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] ExpressionSet alikes for next-gen data

Reply via email to