Re: [Bioc-devel] cbind SummarizedExperiments containing a DNAStringSet not working
Thanks for looking into this! Maarten On Mon, Apr 3, 2017 at 7:00 PM, Hervé Pagès wrote: > Hi Maarten, > > identical() is not reliable on DNAStringSet objects or other objects > that contain external pointers as it can return false negatives as well > as false positives. We'll fix the "cbind" and "rbind" methods for > SummarizedExperiment to work around this problem. > > Thanks for the report. > > H. > > > On 04/03/2017 12:58 AM, Maarten van Iterson wrote: > >> Dear list, >> >> Combining SummarizedExperiment object, containing a DNAStringSet in the >> rowData seems not to work properly. If I cbind two SummarizedExperiment >> objects, which I know are identical, an error is reported: >> >> Error in FUN(X[[i]], ...) (from #2) : >> column(s) 'sourceSeq' in ‘mcols’ are duplicated and the data do not >> match >> >> I think I traced the problem existing in `SummarizedExperiment:::.compa >> re` >> in that `identical` is used to compare DNAStringSets which is not behaving >> as expected. Whereas it should return all identical it returns it is not! >> >> Here is a counter example (which was easier to construct) showing that >> `identical` returns FALSE where it should return TRUE. >> >> library(Biostrings) >>> seq1 <- paste(DNA_BASES[sample(1:4,5,replace=T)], collapse="") >>> seq2 <- paste(DNA_BASES[sample(1:4,5,replace=T)], collapse="") >>> >>> seq1 >>> >> [1] "GACTC" >> >>> seq2 >>> >> [1] "GAATG" >> >>> >>> s1 <- DNAStringSet(seq1) >>> s2 <- DNAStringSet(seq2) >>> >>> str(s1) >>> >> Formal class 'DNAStringSet' [package "Biostrings"] with 5 slots >> ..@ pool :Formal class 'SharedRaw_Pool' [package "XVector"] >> with 2 slots >> .. .. ..@ xp_list:List of 1 >> .. .. .. ..$ : >> .. .. ..@ .link_to_cached_object_list:List of 1 >> .. .. .. ..$ : >> ..@ ranges :Formal class 'GroupedIRanges' [package "XVector"] >> with 7 slots >> .. .. ..@ group : int 1 >> .. .. ..@ start : int 1 >> .. .. ..@ width : int 5 >> .. .. ..@ NAMES : NULL >> .. .. ..@ elementType: chr "integer" >> .. .. ..@ elementMetadata: NULL >> .. .. ..@ metadata : list() >> ..@ elementType: chr "DNAString" >> ..@ elementMetadata: NULL >> ..@ metadata : list() >> >>> str(s2) >>> >> Formal class 'DNAStringSet' [package "Biostrings"] with 5 slots >> ..@ pool :Formal class 'SharedRaw_Pool' [package "XVector"] >> with 2 slots >> .. .. ..@ xp_list:List of 1 >> .. .. .. ..$ : >> .. .. ..@ .link_to_cached_object_list:List of 1 >> .. .. .. ..$ : >> >> ..@ ranges :Formal class 'GroupedIRanges' [package "XVector"] >> with 7 slots >> .. .. ..@ group : int 1 >> .. .. ..@ start : int 1 >> .. .. ..@ width : int 5 >> .. .. ..@ NAMES : NULL >> .. .. ..@ elementType: chr "integer" >> .. .. ..@ elementMetadata: NULL >> .. .. ..@ metadata : list() >> ..@ elementType: chr "DNAString" >> ..@ elementMetadata: NULL >> ..@ metadata : list() >> >>> >>> identical(seq1, seq2) >>> >> [1] FALSE >> >>> identical(s1, s2) >>> >> [1] TRUE >> >>> seq1 == seq2 >>> >> [1] FALSE >> >>> s1 == s2 >>> >> [1] FALSE >> >>> >>> sessionInfo() >>> >> R version 3.3.2 (2016-10-31) >> Platform: x86_64-pc-linux-gnu (64-bit) >> Running under: Ubuntu 16.04.2 LTS >> >> locale: >> [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C >> [3] LC_TIME=en_US.utf8LC_COLLATE=en_US.utf8 >> [5] LC_MONETARY=en_US.utf8LC_MESSAGES=en_US.utf8 >> [7] LC_PAPER=en_US.utf8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats4stats graphics grDevices utils datasets >> [8] methods base >> >> other attached packages: >> [1] Biostrings_2.42.1 XVector_0.14.1 >> [3] BBMRIomics_1.0.3 SummarizedExperiment_1.4.0 >> [5] Biobase_2.34.0 GenomicRanges_1.26.4 >> [7] GenomeInfoDb_1.10.3IRanges_2.8.2 >> [9] S4Vectors_0.12.2 BiocGenerics_0.20.0 >> >> loaded via a namespace (and not attached): >> [1] Rcpp_0.12.10 AnnotationDbi_1.36.2 >> hms_0.3 >> [4] GenomicAlignments_1.10.1 zlibbioc_1.20.0 >> BiocParallel_1.8.1 >> [7] BSgenome_1.42.0 lattice_0.20-35 >> R6_2.2.0 >> [10] httr_1.2.1 tools_3.3.2 >> grid_3.3.2 >> [13] DBI_0.6 assertthat_0.1 >> digest_0.6.12 >> [16] tibble_1.2 Matrix_1.2-8 >> readr_1.1.0 >> [19] rtracklayer_1.34.2 bitops_1.0-6 >> biomaRt_2.30.0 >> [22] RCurl_1.95-4.8 memoise_1.0.0 >> RSQLite_1.1-2 >> [25] compiler_3.3.2 GenomicFeatures_1.26.3 >> Rsamtools_1.26.1 >> [28] XML_3.98-1.5 jsonlite_1.3 >> VariantAnnotation_1.20.3 >> >>> >>> >> I don't completely understand understand why `identical` is not working >> properly is it comparing the environment address in the above example the
Re: [Bioc-devel] cbind SummarizedExperiments containing a DNAStringSet not working
Hi Maarten, identical() is not reliable on DNAStringSet objects or other objects that contain external pointers as it can return false negatives as well as false positives. We'll fix the "cbind" and "rbind" methods for SummarizedExperiment to work around this problem. Thanks for the report. H. On 04/03/2017 12:58 AM, Maarten van Iterson wrote: Dear list, Combining SummarizedExperiment object, containing a DNAStringSet in the rowData seems not to work properly. If I cbind two SummarizedExperiment objects, which I know are identical, an error is reported: Error in FUN(X[[i]], ...) (from #2) : column(s) 'sourceSeq' in ‘mcols’ are duplicated and the data do not match I think I traced the problem existing in `SummarizedExperiment:::.compare` in that `identical` is used to compare DNAStringSets which is not behaving as expected. Whereas it should return all identical it returns it is not! Here is a counter example (which was easier to construct) showing that `identical` returns FALSE where it should return TRUE. library(Biostrings) seq1 <- paste(DNA_BASES[sample(1:4,5,replace=T)], collapse="") seq2 <- paste(DNA_BASES[sample(1:4,5,replace=T)], collapse="") seq1 [1] "GACTC" seq2 [1] "GAATG" s1 <- DNAStringSet(seq1) s2 <- DNAStringSet(seq2) str(s1) Formal class 'DNAStringSet' [package "Biostrings"] with 5 slots ..@ pool :Formal class 'SharedRaw_Pool' [package "XVector"] with 2 slots .. .. ..@ xp_list:List of 1 .. .. .. ..$ : .. .. ..@ .link_to_cached_object_list:List of 1 .. .. .. ..$ : ..@ ranges :Formal class 'GroupedIRanges' [package "XVector"] with 7 slots .. .. ..@ group : int 1 .. .. ..@ start : int 1 .. .. ..@ width : int 5 .. .. ..@ NAMES : NULL .. .. ..@ elementType: chr "integer" .. .. ..@ elementMetadata: NULL .. .. ..@ metadata : list() ..@ elementType: chr "DNAString" ..@ elementMetadata: NULL ..@ metadata : list() str(s2) Formal class 'DNAStringSet' [package "Biostrings"] with 5 slots ..@ pool :Formal class 'SharedRaw_Pool' [package "XVector"] with 2 slots .. .. ..@ xp_list:List of 1 .. .. .. ..$ : .. .. ..@ .link_to_cached_object_list:List of 1 .. .. .. ..$ : ..@ ranges :Formal class 'GroupedIRanges' [package "XVector"] with 7 slots .. .. ..@ group : int 1 .. .. ..@ start : int 1 .. .. ..@ width : int 5 .. .. ..@ NAMES : NULL .. .. ..@ elementType: chr "integer" .. .. ..@ elementMetadata: NULL .. .. ..@ metadata : list() ..@ elementType: chr "DNAString" ..@ elementMetadata: NULL ..@ metadata : list() identical(seq1, seq2) [1] FALSE identical(s1, s2) [1] TRUE seq1 == seq2 [1] FALSE s1 == s2 [1] FALSE sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.2 LTS locale: [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C [3] LC_TIME=en_US.utf8LC_COLLATE=en_US.utf8 [5] LC_MONETARY=en_US.utf8LC_MESSAGES=en_US.utf8 [7] LC_PAPER=en_US.utf8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4stats graphics grDevices utils datasets [8] methods base other attached packages: [1] Biostrings_2.42.1 XVector_0.14.1 [3] BBMRIomics_1.0.3 SummarizedExperiment_1.4.0 [5] Biobase_2.34.0 GenomicRanges_1.26.4 [7] GenomeInfoDb_1.10.3IRanges_2.8.2 [9] S4Vectors_0.12.2 BiocGenerics_0.20.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.10 AnnotationDbi_1.36.2 hms_0.3 [4] GenomicAlignments_1.10.1 zlibbioc_1.20.0 BiocParallel_1.8.1 [7] BSgenome_1.42.0 lattice_0.20-35 R6_2.2.0 [10] httr_1.2.1 tools_3.3.2 grid_3.3.2 [13] DBI_0.6 assertthat_0.1 digest_0.6.12 [16] tibble_1.2 Matrix_1.2-8 readr_1.1.0 [19] rtracklayer_1.34.2 bitops_1.0-6 biomaRt_2.30.0 [22] RCurl_1.95-4.8 memoise_1.0.0 RSQLite_1.1-2 [25] compiler_3.3.2 GenomicFeatures_1.26.3 Rsamtools_1.26.1 [28] XML_3.98-1.5 jsonlite_1.3 VariantAnnotation_1.20.3 I don't completely understand understand why `identical` is not working properly is it comparing the environment address in the above example they are the same although the sequences are not? In my case the two SummarizedExperiments contained the same DNAStringSets but had a different environment address? Regards, Maarten [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=uvrEDLijSOFICTEXtDWEcJQxpbdIH_JLue85P1KkRSk&s=CiJ40v8p658EEAN
[Bioc-devel] cbind SummarizedExperiments containing a DNAStringSet not working
Dear list, Combining SummarizedExperiment object, containing a DNAStringSet in the rowData seems not to work properly. If I cbind two SummarizedExperiment objects, which I know are identical, an error is reported: Error in FUN(X[[i]], ...) (from #2) : column(s) 'sourceSeq' in ‘mcols’ are duplicated and the data do not match I think I traced the problem existing in `SummarizedExperiment:::.compare` in that `identical` is used to compare DNAStringSets which is not behaving as expected. Whereas it should return all identical it returns it is not! Here is a counter example (which was easier to construct) showing that `identical` returns FALSE where it should return TRUE. > library(Biostrings) > seq1 <- paste(DNA_BASES[sample(1:4,5,replace=T)], collapse="") > seq2 <- paste(DNA_BASES[sample(1:4,5,replace=T)], collapse="") > > seq1 [1] "GACTC" > seq2 [1] "GAATG" > > s1 <- DNAStringSet(seq1) > s2 <- DNAStringSet(seq2) > > str(s1) Formal class 'DNAStringSet' [package "Biostrings"] with 5 slots ..@ pool :Formal class 'SharedRaw_Pool' [package "XVector"] with 2 slots .. .. ..@ xp_list:List of 1 .. .. .. ..$ : .. .. ..@ .link_to_cached_object_list:List of 1 .. .. .. ..$ : ..@ ranges :Formal class 'GroupedIRanges' [package "XVector"] with 7 slots .. .. ..@ group : int 1 .. .. ..@ start : int 1 .. .. ..@ width : int 5 .. .. ..@ NAMES : NULL .. .. ..@ elementType: chr "integer" .. .. ..@ elementMetadata: NULL .. .. ..@ metadata : list() ..@ elementType: chr "DNAString" ..@ elementMetadata: NULL ..@ metadata : list() > str(s2) Formal class 'DNAStringSet' [package "Biostrings"] with 5 slots ..@ pool :Formal class 'SharedRaw_Pool' [package "XVector"] with 2 slots .. .. ..@ xp_list:List of 1 .. .. .. ..$ : .. .. ..@ .link_to_cached_object_list:List of 1 .. .. .. ..$ : ..@ ranges :Formal class 'GroupedIRanges' [package "XVector"] with 7 slots .. .. ..@ group : int 1 .. .. ..@ start : int 1 .. .. ..@ width : int 5 .. .. ..@ NAMES : NULL .. .. ..@ elementType: chr "integer" .. .. ..@ elementMetadata: NULL .. .. ..@ metadata : list() ..@ elementType: chr "DNAString" ..@ elementMetadata: NULL ..@ metadata : list() > > identical(seq1, seq2) [1] FALSE > identical(s1, s2) [1] TRUE > seq1 == seq2 [1] FALSE > s1 == s2 [1] FALSE > > sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.2 LTS locale: [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C [3] LC_TIME=en_US.utf8LC_COLLATE=en_US.utf8 [5] LC_MONETARY=en_US.utf8LC_MESSAGES=en_US.utf8 [7] LC_PAPER=en_US.utf8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4stats graphics grDevices utils datasets [8] methods base other attached packages: [1] Biostrings_2.42.1 XVector_0.14.1 [3] BBMRIomics_1.0.3 SummarizedExperiment_1.4.0 [5] Biobase_2.34.0 GenomicRanges_1.26.4 [7] GenomeInfoDb_1.10.3IRanges_2.8.2 [9] S4Vectors_0.12.2 BiocGenerics_0.20.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.10 AnnotationDbi_1.36.2 hms_0.3 [4] GenomicAlignments_1.10.1 zlibbioc_1.20.0 BiocParallel_1.8.1 [7] BSgenome_1.42.0 lattice_0.20-35 R6_2.2.0 [10] httr_1.2.1 tools_3.3.2 grid_3.3.2 [13] DBI_0.6 assertthat_0.1 digest_0.6.12 [16] tibble_1.2 Matrix_1.2-8 readr_1.1.0 [19] rtracklayer_1.34.2 bitops_1.0-6 biomaRt_2.30.0 [22] RCurl_1.95-4.8 memoise_1.0.0 RSQLite_1.1-2 [25] compiler_3.3.2 GenomicFeatures_1.26.3 Rsamtools_1.26.1 [28] XML_3.98-1.5 jsonlite_1.3 VariantAnnotation_1.20.3 > I don't completely understand understand why `identical` is not working properly is it comparing the environment address in the above example they are the same although the sequences are not? In my case the two SummarizedExperiments contained the same DNAStringSets but had a different environment address? Regards, Maarten [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel