Re: [Bioc-sig-seq] as.data.frame on GRanges object with DNAStringSet in values

Michael Lawrence Thu, 29 Sep 2011 14:18:08 -0700

I saw that all coercions to atomic vectors from AtomicList are now
deprecated. You had proposed deprecating as.vector(), because it should not
unlist, and I agreed. Really as.vector() should return an ordinary R list.
However, as.character(), as.numeric(), etc, in base R will unlist. I'd like
to keep consistency with base R. Do we really need to deprecate those, as
well?


Michael

2011/6/15 Michael Lawrence <micha...@gene.com>

>
>
> 2011/6/15 Hervé Pagès <hpa...@fhcrc.org>
>
>> On 11-06-15 03:38 PM, Michael Lawrence wrote:
>>
>>>
>>>
>>> 2011/6/15 Hervé Pagès <hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>>
>>>
>>>
>>>    Hi Michael, Janet,
>>>
>>>    I just added an "as.vector" method for XStringSet objects to
>>>    Biostrings 2.21.6:
>>>
>>>     > library(Biostrings)
>>>     > x <- DNAStringSet(c("aaatg", "gt"))
>>>     > as.vector(x)
>>>      [1] "AAATG" "GT"
>>>
>>>    But that doesn't solve Janet's problem:
>>>
>>>     > df <- DataFrame(id=c("ID1", "ID2"), seqs=x)
>>>     > df
>>>      DataFrame with 2 rows and 2 columns
>>>                 id           seqs
>>>    <character> <DNAStringSet>
>>>      1         ID1          AAATG
>>>      2         ID2             GT
>>>     > as.data.frame(df)
>>>
>>>      Error in as.data.frame.default(y, optional = TRUE, ...) :
>>>        cannot coerce class 'structure("DNAStringSet", package =
>>>    "Biostrings")' into a data.frame
>>>
>>>    Michael?
>>>
>>>
>>> Well, sorry for that. I just added a coercion from Vector to data.frame
>>> through as.vector, so this works.
>>>
>>
>> Thanks!
>>
>>
>>  But someone might add a coercion from
>>> List to data.frame that would treat the elements as columns. Would this
>>> make sense?
>>>
>>
>> Hard to tell. Maybe sometimes it would make sense, but sometimes it
>> definitely does not (e.g. DNAStringSet).
>>
>>
>>  AtomicList to data.frame does something even stranger: it
>>> creates a two column data frame with the unlisted values and
>>> names/indices rep'd out as a factor. Actually, that's kind of cool,
>>> since usually one does not have a list with equal element lengths, but
>>> it's somewhat unintuitive. But why does it apply only to AtomicList?
>>>
>>
>> Glad you bring this on the table.
>>
>> For the record, "as.vector" also unrolls an AtomicList:
>>
>>  > as.vector(IntegerList(1:4, 0:-2))
>>  [1]  1  2  3  4  0 -1 -2
>>
>> IMO, we should not do things like that. Because:
>>
>>  1) The same can be achieved with unlist():
>>
>>    > unlist(IntegerList(1:4, 0:-2))
>>    [1]  1  2  3  4  0 -1 -2
>>
>>  2) It's totally unintuitive to use as.vector for unlisting
>>     a list (as.vector on a standard list does not do that).
>>
>>  3) There is a strong expectation that as.vector() will preserve
>>     the length of its input.
>>
>> So I propose to deprecate those "as.vector" and "as.data.frame"
>> methods for AtomicList objects.
>>
>>
> Sounds good to me. In fact, the stack method on List is almost identical to
> as.data.frame on AtomicList (and the stack method actually makes sense). You
> could make as.vector return an ordinary list, since list is a vector.
>
>
>> H.
>>
>>
>>  Anyway, given the special correspondence between a XStringSet and a
>>> character vector, we could always add an as.data.frame method for
>>> XStringSet, just to make sure stuff behaves as expected.
>>>
>>>    Thanks,
>>>    H.
>>>
>>>
>>>     > sessionInfo()
>>>    R version 2.14.0 Under development (unstable) (2011-05-30 r56024)
>>>    Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>>    locale:
>>>      [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
>>>      [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
>>>      [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8
>>>      [7] LC_PAPER=C                 LC_NAME=C
>>>      [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>    [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
>>>
>>>
>>>    attached base packages:
>>>    [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>>    other attached packages:
>>>    [1] Biostrings_2.21.6 IRanges_1.11.10
>>>
>>>
>>>
>>>    On 11-06-15 12:49 PM, Janet Young wrote:
>>>
>>>        yes - as.character seems a good choice, I think
>>>
>>>        thanks,
>>>
>>>        Janet
>>>
>>>        On Jun 15, 2011, at 12:46 PM, Michael Lawrence wrote:
>>>
>>>            So you would expect that the DNAStringSet is converted to a
>>>            character vector? DNAStringSet (technically XStringSet) then
>>>            just needs an as.vector method that delegates to as.character.
>>>
>>>            Michael
>>>
>>>
>>>            On Wed, Jun 15, 2011 at 12:37 PM, Janet
>>>            Young<jayo...@fhcrc.org <mailto:jayo...@fhcrc.org>>  wrote:
>>>
>>>            Hi there,
>>>
>>>            I'm trying to as as.data.frame on a GRanges object. On
>>>            regular GRanges objects it works fine but I have some
>>>            objects that contain a DNAStringSet in the values column,
>>>            which isn't built in to the as.data.frame method.  Is it
>>>            possible to add the ability to coerce the DNAStringSet too,
>>>            please?
>>>
>>>            Here's some code that demonstrates the issue:
>>>
>>>            ################
>>>            library(GenomicRanges)
>>>            library(Biostrings)
>>>
>>>            gr1<-
>>>
>>>  
>>> GRanges(seqnames=rep("chr1",3),ranges=IRanges(start=c(1,101,201),width=50),strand=c("+","-","+"),
>>>            genenames=c("seq1","seq2","seq3") )
>>>
>>>            as.data.frame(gr1)
>>>            # works
>>>
>>>            gr2<- gr1
>>>            values(gr2)[,"myseqs"]<- DNAStringSet(c ("AACGTG",
>>>            "ACGGTGGTGTT", "GAGGCTG"))
>>>
>>>            as.data.frame(gr2)
>>>            # Error in as.data.frame.default(y, optional = TRUE, ...) :
>>>            #   cannot coerce class 'structure("DNAStringSet", package =
>>>            "Biostrings")' into a data.frame
>>>            ################
>>>
>>>            and here's   sessionInfo() output:
>>>
>>>            R version 2.13.0 (2011-04-13)
>>>            Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>>>
>>>            locale:
>>>            [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>
>>>            attached base packages:
>>>            [1] stats     graphics  grDevices utils     datasets
>>>              methods   base
>>>
>>>            other attached packages:
>>>            [1] Biostrings_2.20.1   GenomicRanges_1.4.6 IRanges_1.10.4
>>>
>>>            ################
>>>
>>>
>>>            You might wonder why I'm storing sequences in the GRanges
>>>            values - in my real data they're sequencing reads that have
>>>            mapped back to that region, but I'm still curious to
>>>            maintain the sequence itself (for the moment) because it's
>>>            not always identical to the underlying genomic sequence of
>>>            that region (investigating mapping issues).
>>>
>>>            (and my desire to use as.data.frame relates to a suggestion
>>>            from Herve to let me workaround some issues with the
>>>            identical function)
>>>
>>>            thanks,
>>>
>>>            Janet
>>>
>>>            _______________________________________________
>>>            Bioc-sig-sequencing mailing list
>>>            Bioc-sig-sequencing@r-project.org
>>>            <mailto:Bioc-sig-sequencing@r-project.org>
>>>
>>>            https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>>
>>>        _______________________________________________
>>>        Bioc-sig-sequencing mailing list
>>>        Bioc-sig-sequencing@r-project.org
>>>        <mailto:Bioc-sig-sequencing@r-project.org>
>>>
>>>        https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>>
>>>
>>>    --
>>>    Hervé Pagès
>>>
>>>    Program in Computational Biology
>>>    Division of Public Health Sciences
>>>    Fred Hutchinson Cancer Research Center
>>>    1100 Fairview Ave. N, M1-B514
>>>    P.O. Box 19024
>>>    Seattle, WA 98109-1024
>>>
>>>    E-mail: hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>
>>>
>>>    Phone:  (206) 667-5791
>>>    Fax:    (206) 667-1319
>>>
>>>
>>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpa...@fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>
>
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] as.data.frame on GRanges object with DNAStringSet in values

Reply via email to