Hi Martin,

Thank you for your response. I checked the header and it says that it is 
coordinate sorted, so that shouldn’t be the problem. Here are the results of 
the code you provided:

> r3 = applyPileups(PileupFiles(c(fl1, fl2)), function(x) x, param=testparam)
> any(duplicated(r3[[1]][["pos"]]))
[1] TRUE
> pos = r3[[1]][["pos"]]
> table(table(pos))

   1    2
4834 3115
> udpos = unique(pos[duplicated(pos)])
> head(pos[match(pos, udpos)], 20)
 [1] 135003 135006 135007 135008 135009 135010 135011 135012 135013 135014 
135015 135016 135017 135018 135019 135020 135021 135022 135023
[20] 135024
> head(match(pos, udpos), 20)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
> table(pos)[1:50]
pos
134762 134763 134764 134765 134766 134767 134768 134769 134770 134771 134772 
134773 134774 134775 134776 134777 134778 134779 134780
     1      1      1      1      1      1      1      1      1      1      1    
  1      1      1      1      1      1      1      1
134781 134782 134783 134784 134785 134786 134787 134788 134991 134992 134993 
134994 134995 134999 135001 135002 135003 135004 135005
     1      1      1      1      1      1      1      1      1      1      1    
  1      1      1      1      1      2      1      1
135006 135007 135008 135009 135010 135011 135012 135013 135014 135015 135016 
135017
     2      2      2      2      2      2      2      2      2      2      2    
  2

I think the last one shows it well. There are duplicates throughout the file, 
anywhere that there are reads in both files. I have attached the bam files you 
requested showing the region used here.

Thanks,

Jonathon

On Apr 21, 2014, at 7:17 PM, Martin Morgan 
<mtmor...@fhcrc.org<mailto:mtmor...@fhcrc.org>> wrote:

On 04/21/2014 02:33 PM, Jonathon Hill wrote:
Hi,

I have been trying to use Rsamtools’ applyPileups function to compare two files 
position-by-position. In order to test it out, I simply ran:
minBaseQuality <- 20
minMapQuality <- 30
minDepth <- 10
maxDepth <- 1000
testparam <- PileupParam(what="seq",
                         which=GRanges(“chr20", IRanges(1, 1000000)),
                         minBaseQuality=minBaseQuality,
                         minMapQuality=minMapQuality,
                         minDepth=minDepth,
                         maxDepth=maxDepth,
)
fl1 <- "10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam"
fl2 <- "10696X4_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam"
r3 = applyPileups(PileupFiles(c(fl1, fl2)), function(x) x, param=testparam)

My understanding is that this should result in a three-dimensional array with 
“ACTGN Counts” in the first dimension, files in the second and position in the 
third. Positions with overlapping reads in both files should thus be collapsed 
into a single line in the third dimension. However, selecting one of these 
positions shows that they are duplicated:

r3[[1]][["seq"]][ , , r3[[1]][["pos"]] == 135003]

I think your understanding is basically correct.

The function is assuming that the BAM files are sorted by position (with, e.g., 
sortBam, but the files don't have to be sorted by Rsamtools).

Executing a similar command gives me

> str(r3[[1]])
List of 3
$ seqnames: Named int 211195
 ..- attr(*, "names")= chr "chr20"
$ pos     : int [1:211195] 60026 60027 60028 60029 60030 60031 60032 60033 
60034 60035 ...
$ seq     : int [1:5, 1:2, 1:211195] 0 0 0 0 0 0 0 0 0 0 ...
 ..- attr(*, "dimnames")=List of 3
 .. ..$ : chr [1:5] "A" "C" "G" "T" ...
 .. ..$ : chr [1:2] "normal_srx113635_sorted.bam" "tumor_srx036691_sorted.bam"
 .. ..$ : NULL

Do you get something similar, especially the identical seqnames, pos dimension, 
and third dimension of seq? 'pos' should apparently be unique; so

> any(duplicated(r3[[1]][["pos"]]))
[1] FALSE

If there are duplicates, I wonder how many there are and where they occur

 pos = r3[[1]][["pos"]]
 table(table(pos))
 udpos = unique(pos[duplicated(pos)])
 head(pos[match(pos, udpos)], 20)
 head(match(pos, udpos), 20)

If nothing is suggested by the above, can you make a subset of the BAM files 
available to me, e.g., the result of

 param = ScanBamParam(which=GRanges("chr20", IRanges(1, 1000000)))
 filterBam(fls[1], tempfile(), param=param)

Thanks,

Martin

yields:

, , 1

  10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam 
10696X4_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
A                                                   0                           
                        0
C                                                   0                           
                        0
G                                                  10                           
                        0
T                                                   0                           
                        0
N                                                   0                           
                        0

, , 2

  10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam 
10696X4_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
A                                                   0                           
                        0
C                                                   0                           
                        0
G                                                   0                           
                       13
T                                                   0                           
                        0
N                                                   0                           
                        0

Even though the position is the same, it is showing up twice. Each time, one of 
the files shows zeroes. This is not consistent with what happens if the files 
are identical (as in the example from the help docs).

For example,

r3 = applyPileups(PileupFiles(c(fl1, fl1)), function(x) x, param=testparam) 
#file 1 entered twice
 r3[[1]][["seq"]][ , , r3[[1]][["pos"]] == 135003]

yields:

  10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam 
10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
A                                                   0                           
                        0
C                                                   0                           
                        0
G                                                  10                           
                       10
T                                                   0                           
                        0
N                                                   0                           
                        0

Is this the expected behavior? It seems like each position should only show up 
once in the output. Is there something I am missing?

Thanks,

Jonathon Hill
Postdoc
Yost Lab, University of Utah



[[alternative HTML version deleted]]



_______________________________________________
Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to