I've recently taken over the maintenance/development of the chipseq package and have plans for a lot of refactoring, including some new formal classes for ChIP-seq data. I'm wondering though if 'chipseq' is the best place, given that it also includes some specific analytical methods. That's not a huge deal, but might GenomicRanges be the place for these high-level structures?
On Fri, Apr 2, 2010 at 8:31 AM, Vincent Carey <[email protected]>wrote: > > > On Fri, Apr 2, 2010 at 11:21 AM, Michael Lawrence < > [email protected]> wrote: > >> >> >> On Fri, Apr 2, 2010 at 7:55 AM, Vincent Carey <[email protected] >> > wrote: >> >>> To get a bit more concrete regarding these notions, the leeBamViews >>> package is in the experimental data archive, a VERY rudimentary illustration >>> of a workflow rooted in BAM archive files through region specification and >>> read counting. For the very latest checkin, after running >>> >>> example(bs1) >>> >>> we have an ad hoc tabulation of read counts: >>> >>> bs1> tabulateReads(bs1, "+") >>> intv1 intv2 >>> start 861250 863000 >>> end 862750 864000 >>> isowt.5 3673 2692 >>> isowt.6 3770 2650 >>> rlp.5 1532 1045 >>> rlp.6 1567 1139 >>> ssr.1 4304 3052 >>> ssr.2 4627 3381 >>> xrn.1 2841 1693 >>> xrn.2 3477 2197 >>> >>> or, by setting as.GRanges, a GRanges-based representation >>> >>> > tabulateReads(bs1, "+", as.GRanges=TRUE) >>> GRanges with 2 ranges and 9 elementMetadata values >>> seqnames ranges strand | name isowt.5 isowt.6 >>> <Rle> <IRanges> <Rle> | <character> <integer> <integer> >>> [1] Scchr13 [861250, 862750] + | intv1 3673 3770 >>> [2] Scchr13 [863000, 864000] + | intv2 2692 2650 >>> rlp.5 rlp.6 ssr.1 ssr.2 xrn.1 xrn.2 >>> <integer> <integer> <integer> <integer> <integer> <integer> >>> [1] 1532 1567 4304 4627 2841 3477 >>> [2] 1045 1139 3052 3381 1693 2197 >>> >>> seqlengths >>> Scchr13 >>> NA >>> > tabulateReads(bs1, "+", as.GRanges=TRUE) -> OO >>> > metadata(OO) >>> list() >>> >>> It seems that we would want more structure in a metadata component to get >>> closer to the values of ExpressionSet discipline. We would also want some >>> accommodation of this kind of representation in the downstream packages like >>> edgeR, DEseq. >>> >>> >> The actual 'metadata' slot was meant to be general, in order to >> accommodate all needs. If a particular type of data requires a certain >> structure, then additional formal classes may be necessary. For example, >> gene expression RNA-seq may want a featureData equivalent annotating each >> transcript, whereas with ChIP-seq data, that sort of structure would make >> less sense, short of some additional assumptions. >> > > I agree completely. Our task is to think/experiment about how to suitably > specialize these structures for most effective downstream use. Reuse by > multiple downstream toolchains would be great. > > >> Michael >> >> > sessionInfo() >>> R version 2.11.0 Under development (unstable) (2010-03-24 r51388) >>> x86_64-apple-darwin10.2.0 >>> >>> locale: >>> [1] C >>> >>> attached base packages: >>> [1] stats graphics grDevices datasets tools utils methods >>> [8] base >>> >>> other attached packages: >>> [1] leeBamViews_0.99.3 BSgenome_1.15.18 Rsamtools_0.2.1 >>> [4] Biostrings_2.15.25 GenomicRanges_0.1.3 IRanges_1.5.74 >>> [7] Biobase_2.7.5 weaver_1.13.0 codetools_0.2-2 >>> [10] digest_0.4.1 >>> >>> >>> On Thu, Apr 1, 2010 at 10:15 AM, Martin Morgan <[email protected]>wrote: >>> >>>> On 03/31/2010 04:06 AM, Michael Lawrence wrote: >>>> > On Wed, Mar 31, 2010 at 3:55 AM, David Rossell < >>>> > [email protected]> wrote: >>>> > >>>> >> Following a recent thread, I also have found convenient to store >>>> nextgen >>>> >> data as RangedData instead of ShortRead objects. They require far >>>> less >>>> >> memory and make feasible working with several samples at the same >>>> time (in >>>> >> my 8Gb RAM desktop I can load 2 ShortRead objects at the most, with >>>> >> RangedData I haven't struck the upper limit yet). >>>> >> >>>> >> I am thinking about taking this idea a step forward: RangedDataList >>>> allows >>>> >> storing info from several samples (e.g. IP and control) in a single >>>> object. >>>> >> The only problem is RangedDataList does not store information about >>>> the >>>> >> samples, e.g. the phenoData we're used to in ExpressionSet objects. >>>> My idea >>>> >> is to define something like a "SequenceSet" class, which would >>>> contain a >>>> >> RangedDataList with the ranges, a phenoData with sample information, >>>> and >>>> >> possibly also information about the experiment (e.g. with the MIAME >>>> analog >>>> >> for sequencing, MIASEQE). >>>> >> >>>> >> The thing is I don't want to re-invent the wheel. I haven't seen that >>>> this >>>> >> is implemented yet, but is someone working on it? Any criticism/ >>>> ideas? >>>> >> >>>> >> >>>> > RangedDataList already supports this. See the 'elementMetadata' and >>>> > 'metadata' slots in the Sequence class. >>>> >>>> Hi David et al., >>>> >>>> I've also found the elementMetadata slot excellent for this purpose. >>>> The ShortRead data objects retain sequence and quality information, this >>>> information is often not needed after a certain point in the analysis. >>>> >>>> Wanted to point to the GenomicRanges package in Bioc-devel, which has a >>>> GRanges class that is more fastidious about strand information (maybe a >>>> plus?) and conforms more to an 'I am a rectangular data structure' world >>>> view. Also the GappedAlignments class for efficiently representing large >>>> numbers of reads. >>>> >>>> Martin >>>> >>>> > >>>> > Michael >>>> > >>>> > >>>> > >>>> >> Best, >>>> >> >>>> >> David >>>> >> >>>> >> -- >>>> >> David Rossell, PhD >>>> >> Manager, Bioinformatics and Biostatistics unit >>>> >> IRB Barcelona >>>> >> Tel (+34) 93 402 0217 >>>> >> Fax (+34) 93 402 0257 >>>> >> http://www.irbbarcelona.org/bioinformatics >>>> >> >>>> >> [[alternative HTML version deleted]] >>>> >> >>>> >> _______________________________________________ >>>> >> Bioc-sig-sequencing mailing list >>>> >> [email protected] >>>> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >>>> >> >>>> > >>>> > [[alternative HTML version deleted]] >>>> > >>>> > _______________________________________________ >>>> > Bioc-sig-sequencing mailing list >>>> > [email protected] >>>> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >>>> >>>> >>>> -- >>>> Martin Morgan >>>> Computational Biology / Fred Hutchinson Cancer Research Center >>>> 1100 Fairview Ave. N. >>>> PO Box 19024 Seattle, WA 98109 >>>> >>>> Location: Arnold Building M1 B861 >>>> Phone: (206) 667-2793 >>>> >>>> _______________________________________________ >>>> Bioc-sig-sequencing mailing list >>>> [email protected] >>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >>>> >>> >>> >> > [[alternative HTML version deleted]] _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
