Hi, Marcus - thanks for the self-contained example. I probably should have included something like that too!
and Herve - thank you VERY much - that's helpful. I'm beginning to understand this, but I have a little way to go still - I'm amazed at the subtleties there are this stuff. I'm glad you explained the false negatives too: just this morning I saw one of those for the first time, and I would have spent a long time worrying about it if you hadn't explained... I'll go with your ugly workaround (nothing arong with ugly for a biologist like me). That has brought up another question (as.data.frame on GRanges), but I'll send a separate email about that, as it's a separate issue. Janet On Jun 15, 2011, at 10:45 AM, Hervé Pagès wrote: > Hi Janet, > > This is because identical() is not the appropriate tool when comparing > objects that contain external pointers. It will sometimes report that > different objects are identical, and sometimes report that identical > objects are different. > > With different objects: > > > library(Biostrings) > > x <- DNAString("AC") > > y <- DNAString("GG") > > identical(x, y) > [1] TRUE > > With "identical" objects: > > > z <- subseq(DNAString("TGG"), start=2) > > z > 2-letter "DNAString" instance > seq: GG > > identical(y, z) > [1] FALSE > > The 2nd one (false negative) is a fair one: even if 'y' and 'z' > represent the same object (conceptually), they actually have different > internal representations: > > > y@shared > SharedRaw of length 2 (data starting at address 0x4833bf0) > > y@offset > [1] 0 > > z@shared > SharedRaw of length 3 (data starting at address 0x4780350) > > z@offset > [1] 1 > > Since identical() performs low-level comparison, it will report that > the objects are different. > > For the 1st one (false positive), you should blame identical() for > reporting that 2 external pointers are *always* identical: > > > xp1 <- .Call("externalptr_new", PACKAGE="IRanges") > > xp2 <- .Call("externalptr_new", PACKAGE="IRanges") > > IRanges:::address(xp1) > [1] "0x2e20178" > > IRanges:::address(xp2) > [1] "0x2e20530" > > identical(xp1, xp2) > [1] TRUE > > Actually, the addresses shouldn't matter. What should matter is the > data that the 2 external pointers are pointing at. But identical() > doesn't try to "follow" the pointers in order to get to the data > they are pointing at. It just reports that 2 external pointers are > identical! Like here, where the 2 external pointers are pointing to > data that differ in *value*: > > > identical(x@shared@xp, y@shared@xp) > [1] TRUE > > One could try to make a case on the R-devel mailing list to change > the behavior of identical() on external pointers, but that wouldn't > solve the 2nd issue (false positives). This one could be addressed > by having a new generic e.g. equal() with methods for Vector objects > and their derivations that would know how to reliably compare 2 > objects. Most of the time, it would just do the same as identical(), > but for objects that contain external pointers (like DNAString, > DNAStringSet, PhredQuality, etc...), it would use implement a > specialized logic. > > Any suggestion on this would be appreciated. > > In the meantime, the ugly workaround that I often use is to apply > as.character on the objects I want to compare: > > > identical(as.character(x), as.character(y)) > [1] FALSE > > identical(as.character(y), as.character(z)) > [1] TRUE > > Cheers, > H. > > On 11-06-14 06:58 PM, Janet Young wrote: >> Hi there, >> >> I'm investigating a mistake I made when running BWA (a little embarrassing, >> but it happens). I had initially failed to use the -I flag on BWA to tell >> it to convert our Illumina ASCII-64 qualities appropriately. Now I'm taking >> a look to see how much differnce it makes in our downstream analysis. >> >> Anyway, that's not why I'm writing - while delving into this I've found >> something unexpected about the "identical" function, and I'd love to >> understand that better if possible. >> >> I used scanBam to read in my two BWA output files - with and without the >> "-I" option when I ran BWA (same input seqs, everything else the same). As I >> expected, with the -I option BWA converted the qualities (my >> lane1read2_bwaWithI_bamscan object), and in the other it didn't (my >> lane1read2_bwaNotI_bamscan object). That makes sense to me - here's how the >> quals look after reading BWA output in with scanBam. >> >> ### when I use the -I option with BWA: >>> head(lane1read2_bwaWithI_bamscan[[1]][["qual"]]) >> A PhredQuality instance of length 6 >> width seq >> [1] 50 HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHFHHHHFHFF >> [2] 36 GFFEEHHFHHHHHHHEFEEEHHHHHFHHHEEFEE@E >> [3] 50 ################################################## >> [4] 31 HHHHHHHHHHHHHHHHHHHHHHEHHHHFHFH >> [5] 24 HHHHHHHHHHHHHHDFHHHHHHHH >> [6] 50 FFFFCD=8BGEBHGHHHHEHGFGHHDHEHFHEHHHHGHHHHHHHHHGHHH >> >> ### when I don't use the -I option with BWA: >>> head(lane1read2_bwaNotI_bamscan[[1]][["qual"]]) >> A PhredQuality instance of length 6 >> width seq >> [1] 50 gggggggggggggggggggggggggggggggggggggggggeggggegee >> [2] 36 feeddggegggggggdedddgggggegggddedd_d >> [3] 50 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB >> [4] 31 ggggggggggggggggggggggdggggegeg >> [5] 24 ggggggggggggggcegggggggg >> [6] 50 eeeebc\Wafdagfggggdgfefggcgdgegdggggfgggggggggfggg >> >> >> The part I don't understand is when I use identical to test whether those >> two qual objects are the same, it says they are identical: >> >>> identical( lane1read2_bwaWithI_bamscan[[1]][["qual"]], >>> lane1read2_bwaNotI_bamscan[[1]][["qual"]]) >> [1] TRUE >> >> In one sense they're the same (if you convert from Illumina qualities to >> normal Phred qualities), but how did identical know to do that? I might be >> misunderstanding identical - I thought that EVERYTHING about the object had >> to be identical. >> >> I want to understand this as it seems like a dangerous misunderstanding - I >> use identical a fair amount to make sure I'm not screwing things up. >> >> Really I should probably have converted the "qual" item in my >> scamBam-generated list to a SolexaQuality object to keep track of the fact >> they're offset by 64 and not 33, but that's OK. At this point I just want to >> take a look at what's changed about the BWA output when I use that -I flag. >> >> Any thoughts? >> >> thanks, >> >> Janet >> >> _______________________________________________ >> Bioc-sig-sequencing mailing list >> Bioc-sig-sequencing@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpa...@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 _______________________________________________ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing