Re: [Bioc-sig-seq] "identical" on PhredQuality objects - I'm confused

Hervé Pagès Wed, 15 Jun 2011 10:46:18 -0700

Hi Janet,

This is because identical() is not the appropriate tool when comparing
objects that contain external pointers. It will sometimes report that
different objects are identical, and sometimes report that identical
objects are different.

With different objects:

  > library(Biostrings)
  > x <- DNAString("AC")
  > y <- DNAString("GG")
  > identical(x, y)
  [1] TRUE

With "identical" objects:

  > z <- subseq(DNAString("TGG"), start=2)
  > z
    2-letter "DNAString" instance
  seq: GG
  > identical(y, z)
  [1] FALSE

The 2nd one (false negative) is a fair one: even if 'y' and 'z'
represent the same object (conceptually), they actually have different
internal representations:

  > y@shared
  SharedRaw of length 2 (data starting at address 0x4833bf0)
  > y@offset
  [1] 0
  > z@shared
  SharedRaw of length 3 (data starting at address 0x4780350)
  > z@offset
  [1] 1

Since identical() performs low-level comparison, it will report that
the objects are different.

For the 1st one (false positive), you should blame identical() for
reporting that 2 external pointers are *always* identical:

  > xp1 <- .Call("externalptr_new", PACKAGE="IRanges")
  > xp2 <- .Call("externalptr_new", PACKAGE="IRanges")
  > IRanges:::address(xp1)
  [1] "0x2e20178"
  > IRanges:::address(xp2)
  [1] "0x2e20530"
  > identical(xp1, xp2)
  [1] TRUE

Actually, the addresses shouldn't matter. What should matter is the
data that the 2 external pointers are pointing at. But identical()
doesn't try to "follow" the pointers in order to get to the data
they are pointing at. It just reports that 2 external pointers are
identical! Like here, where the 2 external pointers are pointing to
data that differ in *value*:

  > identical(x@shared@xp, y@shared@xp)
  [1] TRUE

One could try to make a case on the R-devel mailing list to change
the behavior of identical() on external pointers, but that wouldn't
solve the 2nd issue (false positives). This one could be addressed
by having a new generic e.g. equal() with methods for Vector objects
and their derivations that would know how to reliably compare 2
objects. Most of the time, it would just do the same as identical(),
but for objects that contain external pointers (like DNAString,
DNAStringSet, PhredQuality, etc...), it would use implement a
specialized logic.

Any suggestion on this would be appreciated.

In the meantime, the ugly workaround that I often use is to apply
as.character on the objects I want to compare:

  > identical(as.character(x), as.character(y))
  [1] FALSE
  > identical(as.character(y), as.character(z))
  [1] TRUE

Cheers,
H.

On 11-06-14 06:58 PM, Janet Young wrote:

Hi there,

I'm investigating a mistake I made when running BWA (a little embarrassing, but 
it happens).  I had initially failed to use the -I flag on BWA to tell it to 
convert our Illumina ASCII-64 qualities appropriately.  Now I'm taking a look 
to see how much differnce it makes in our downstream analysis.

Anyway, that's not why I'm writing - while delving into this I've found something 
unexpected about the "identical" function, and I'd love to understand that 
better if possible.

I used scanBam to read in my two BWA output files - with and without the "-I" 
option when I ran BWA (same input seqs, everything else the same). As I expected, with 
the -I option BWA converted the qualities (my lane1read2_bwaWithI_bamscan object), and in 
the other it didn't (my lane1read2_bwaNotI_bamscan object).  That makes sense to me - 
here's how the quals look after reading BWA output in with scanBam.

### when I use the -I option with BWA:

head(lane1read2_bwaWithI_bamscan[[1]][["qual"]])

   A PhredQuality instance of length 6
     width seq
[1]    50 HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHFHHHHFHFF
[2]    36 GFFEEHHFHHHHHHHEFEEEHHHHHFHHHEEFEE@E
[3]    50 ##################################################
[4]    31 HHHHHHHHHHHHHHHHHHHHHHEHHHHFHFH
[5]    24 HHHHHHHHHHHHHHDFHHHHHHHH
[6]    50 FFFFCD=8BGEBHGHHHHEHGFGHHDHEHFHEHHHHGHHHHHHHHHGHHH

### when I don't use the -I option with BWA:

head(lane1read2_bwaNotI_bamscan[[1]][["qual"]])

   A PhredQuality instance of length 6
     width seq
[1]    50 gggggggggggggggggggggggggggggggggggggggggeggggegee
[2]    36 feeddggegggggggdedddgggggegggddedd_d
[3]    50 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
[4]    31 ggggggggggggggggggggggdggggegeg
[5]    24 ggggggggggggggcegggggggg
[6]    50 eeeebc\Wafdagfggggdgfefggcgdgegdggggfgggggggggfggg


The part I don't understand is when I use identical to test whether those two 
qual objects are the same, it says they are identical:

identical( lane1read2_bwaWithI_bamscan[[1]][["qual"]], 
lane1read2_bwaNotI_bamscan[[1]][["qual"]])

[1] TRUE

In one sense they're the same (if you convert from Illumina qualities to normal 
Phred qualities), but how did identical know to do that?  I might be 
misunderstanding identical - I thought that EVERYTHING about the object had to 
be identical.

I want to understand this as it seems like a dangerous misunderstanding - I use 
identical a fair amount to make sure I'm not screwing things up.

Really I should probably have converted the "qual" item in my scamBam-generated 
list to a SolexaQuality object to keep track of the fact they're offset by 64 and not 33, 
but that's OK. At this point I just want to take a look at what's changed about the BWA 
output when I use that -I flag.

Any thoughts?

thanks,

Janet

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] "identical" on PhredQuality objects - I'm confused

Reply via email to