Hi Janet,
This is because identical() is not the appropriate tool when comparing
objects that contain external pointers. It will sometimes report that
different objects are identical, and sometimes report that identical
objects are different.
With different objects:
> library(Biostrings)
> x <- DNAString("AC")
> y <- DNAString("GG")
> identical(x, y)
[1] TRUE
With "identical" objects:
> z <- subseq(DNAString("TGG"), start=2)
> z
2-letter "DNAString" instance
seq: GG
> identical(y, z)
[1] FALSE
The 2nd one (false negative) is a fair one: even if 'y' and 'z'
represent the same object (conceptually), they actually have different
internal representations:
> y@shared
SharedRaw of length 2 (data starting at address 0x4833bf0)
> y@offset
[1] 0
> z@shared
SharedRaw of length 3 (data starting at address 0x4780350)
> z@offset
[1] 1
Since identical() performs low-level comparison, it will report that
the objects are different.
For the 1st one (false positive), you should blame identical() for
reporting that 2 external pointers are *always* identical:
> xp1 <- .Call("externalptr_new", PACKAGE="IRanges")
> xp2 <- .Call("externalptr_new", PACKAGE="IRanges")
> IRanges:::address(xp1)
[1] "0x2e20178"
> IRanges:::address(xp2)
[1] "0x2e20530"
> identical(xp1, xp2)
[1] TRUE
Actually, the addresses shouldn't matter. What should matter is the
data that the 2 external pointers are pointing at. But identical()
doesn't try to "follow" the pointers in order to get to the data
they are pointing at. It just reports that 2 external pointers are
identical! Like here, where the 2 external pointers are pointing to
data that differ in *value*:
> identical(x@shared@xp, y@shared@xp)
[1] TRUE
One could try to make a case on the R-devel mailing list to change
the behavior of identical() on external pointers, but that wouldn't
solve the 2nd issue (false positives). This one could be addressed
by having a new generic e.g. equal() with methods for Vector objects
and their derivations that would know how to reliably compare 2
objects. Most of the time, it would just do the same as identical(),
but for objects that contain external pointers (like DNAString,
DNAStringSet, PhredQuality, etc...), it would use implement a
specialized logic.
Any suggestion on this would be appreciated.
In the meantime, the ugly workaround that I often use is to apply
as.character on the objects I want to compare:
> identical(as.character(x), as.character(y))
[1] FALSE
> identical(as.character(y), as.character(z))
[1] TRUE
Cheers,
H.
On 11-06-14 06:58 PM, Janet Young wrote:
Hi there,
I'm investigating a mistake I made when running BWA (a little embarrassing, but
it happens). I had initially failed to use the -I flag on BWA to tell it to
convert our Illumina ASCII-64 qualities appropriately. Now I'm taking a look
to see how much differnce it makes in our downstream analysis.
Anyway, that's not why I'm writing - while delving into this I've found something
unexpected about the "identical" function, and I'd love to understand that
better if possible.
I used scanBam to read in my two BWA output files - with and without the "-I"
option when I ran BWA (same input seqs, everything else the same). As I expected, with
the -I option BWA converted the qualities (my lane1read2_bwaWithI_bamscan object), and in
the other it didn't (my lane1read2_bwaNotI_bamscan object). That makes sense to me -
here's how the quals look after reading BWA output in with scanBam.
### when I use the -I option with BWA:
head(lane1read2_bwaWithI_bamscan[[1]][["qual"]])
A PhredQuality instance of length 6
width seq
[1] 50 HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHFHHHHFHFF
[2] 36 GFFEEHHFHHHHHHHEFEEEHHHHHFHHHEEFEE@E
[3] 50 ##################################################
[4] 31 HHHHHHHHHHHHHHHHHHHHHHEHHHHFHFH
[5] 24 HHHHHHHHHHHHHHDFHHHHHHHH
[6] 50 FFFFCD=8BGEBHGHHHHEHGFGHHDHEHFHEHHHHGHHHHHHHHHGHHH
### when I don't use the -I option with BWA:
head(lane1read2_bwaNotI_bamscan[[1]][["qual"]])
A PhredQuality instance of length 6
width seq
[1] 50 gggggggggggggggggggggggggggggggggggggggggeggggegee
[2] 36 feeddggegggggggdedddgggggegggddedd_d
[3] 50 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
[4] 31 ggggggggggggggggggggggdggggegeg
[5] 24 ggggggggggggggcegggggggg
[6] 50 eeeebc\Wafdagfggggdgfefggcgdgegdggggfgggggggggfggg
The part I don't understand is when I use identical to test whether those two
qual objects are the same, it says they are identical:
identical( lane1read2_bwaWithI_bamscan[[1]][["qual"]],
lane1read2_bwaNotI_bamscan[[1]][["qual"]])
[1] TRUE
In one sense they're the same (if you convert from Illumina qualities to normal
Phred qualities), but how did identical know to do that? I might be
misunderstanding identical - I thought that EVERYTHING about the object had to
be identical.
I want to understand this as it seems like a dangerous misunderstanding - I use
identical a fair amount to make sure I'm not screwing things up.
Really I should probably have converted the "qual" item in my scamBam-generated
list to a SolexaQuality object to keep track of the fact they're offset by 64 and not 33,
but that's OK. At this point I just want to take a look at what's changed about the BWA
output when I use that -I flag.
Any thoughts?
thanks,
Janet
_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpa...@fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing