Re: [Bioc-devel] Rsamtools applyPileups function not merging positions from multiple files if not identical

Jonathon Hill Tue, 29 Apr 2014 09:20:32 -0700

Hi Martin,

Thanks for looking into this. This is a problem that we have run into before. 
The zebrafish genome has a number of small contigs (> 1000 of them). 
Unfortunately, the novoalign aligner only includes @SQ lines if reads actually 
map to the contig. Since they are so small, each file ends up having its own 
set of small contigs included due to sporadic low-coverage alignments to the 
contigs, even when aligned using the same parameters and the same exact 
reference genome. That is what happened here. I am working on a small function 
to gather all of the contigs from each file and merge them to create a larger 
list that can be used to create a common header. However, it is taking a while 
because it looks like I have to covert the files to sam, replace the header and 
then convert back to bam, sort and index to make sure the factor levels remain 
consistent. I would love any thoughts you have on how to do this. I am not a C 
programmer, so I am resorting to system calls to samtools in my R code. It 
seems like something that would have a fairly broad appeal, as the genomes of 
many non-model organisms have numerous small contigs.


Thanks again,

Jonathon

On Apr 26, 2014, at 5:28 AM, Martin Morgan 
<mtmor...@fhcrc.org<mailto:mtmor...@fhcrc.org>> wrote:

On 04/23/2014 07:42 AM, Jonathon Hill wrote:
Thanks. I look forward to hearing from you.

Hi Jonathon --

It turns out that your BAM files have different seqlevels

> fls <- PileupFiles(dir(pattern = "_sorted.bam$", full=TRUE))
> lvls = lapply(fls, seqlevels)
> identical(lvls[[1]], lvls[[2]])
[1] FALSE

i.e., the BAM files have different reference sequences. Because of this, 
samtools thinks of 'chr20' in one file as different from 'chr20' in another, 
much as R might confuse values of factors with different level sets.

I updated Rsamtools to check that the seqlevels are identical

> applyPileups(fls, function(...) {})
Error in applyPileups(files, FUN, ..., param = plpParam(files)) :
 applyPileups 'seqlevels' must be identical(); failed when comparing
   '10696X1chr20testregion_sorted.bam' with
   '10696X4chr20testregion_sorted.bam'

so at least the problem is more apparent. I don't think you can correct your 
files using Rsamtools, maybe picard or worst-case (maybe it's appropriate 
anyway) re-aligning all samples to a common set of sequences?

Hope that sheds some light, and thanks for the report,

Martin


On Apr 23, 2014, at 6:10 AM, Martin Morgan 
<mtmor...@fhcrc.org<mailto:mtmor...@fhcrc.org>
<mailto:mtmor...@fhcrc.org>> wrote:

Thanks for the file snippets; I'm able to reproduce this bug and will let you
know of its resolution. Martin

On 04/22/2014 07:44 AM, Jonathon Hill wrote:
Hi Martin,

Thank you for your response. I checked the header and it says that it is
coordinate sorted, so that shouldnt be the problem. Here are the results of the
code you provided:

> r3 = applyPileups(PileupFiles(c(fl1, fl2)), function(x) x, param=testparam)
> any(duplicated(r3[[1]][["pos"]]))
[1] TRUE
> pos = r3[[1]][["pos"]]
> table(table(pos))

  1    2
4834 3115
> udpos = unique(pos[duplicated(pos)])
> head(pos[match(pos, udpos)], 20)
[1] 135003 135006 135007 135008 135009 135010 135011 135012 135013 135014
135015 135016 135017 135018 135019 135020 135021 135022 135023
[20] 135024
> head(match(pos, udpos), 20)
[1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
> table(pos)[1:50]
pos
134762 134763 134764 134765 134766 134767 134768 134769 134770 134771 134772
134773 134774 134775 134776 134777 134778 134779 134780
    1      1      1      1      1      1      1      1      1      1      1
  1      1      1      1      1      1      1      1
134781 134782 134783 134784 134785 134786 134787 134788 134991 134992 134993
134994 134995 134999 135001 135002 135003 135004 135005
    1      1      1      1      1      1      1      1      1      1      1
  1      1      1      1      1      2      1      1
135006 135007 135008 135009 135010 135011 135012 135013 135014 135015 135016
135017
    2      2      2      2      2      2      2      2      2      2      2
  2

I think the last one shows it well. There are duplicates throughout the file,
anywhere that there are reads in both files. I have attached the bam files you
requested showing the region used here.

Thanks,

Jonathon

On Apr 21, 2014, at 7:17 PM, Martin Morgan 
<mtmor...@fhcrc.org<mailto:mtmor...@fhcrc.org>
<mailto:mtmor...@fhcrc.org>
<mailto:mtmor...@fhcrc.org>> wrote:

On 04/21/2014 02:33 PM, Jonathon Hill wrote:
Hi,

I have been trying to use Rsamtools applyPileups function to compare two
files position-by-position. In order to test it out, I simply ran:
minBaseQuality <- 20
minMapQuality <- 30
minDepth <- 10
maxDepth <- 1000
testparam <- PileupParam(what="seq",
                       which=GRanges(chr20", IRanges(1, 1000000)),
                       minBaseQuality=minBaseQuality,
                       minMapQuality=minMapQuality,
                       minDepth=minDepth,
                       maxDepth=maxDepth,
)
fl1 <- "10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam"
fl2 <- "10696X4_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam"
r3 = applyPileups(PileupFiles(c(fl1, fl2)), function(x) x, param=testparam)

My understanding is that this should result in a three-dimensional array with
ACTGN Counts in the first dimension, files in the second and position in
the third. Positions with overlapping reads in both files should thus be
collapsed into a single line in the third dimension. However, selecting one
of these positions shows that they are duplicated:

r3[[1]][["seq"]][ , , r3[[1]][["pos"]] == 135003]

I think your understanding is basically correct.

The function is assuming that the BAM files are sorted by position (with,
e.g., sortBam, but the files don't have to be sorted by Rsamtools).

Executing a similar command gives me

> str(r3[[1]])
List of 3
$ seqnames: Named int 211195
..- attr(*, "names")= chr "chr20"
$ pos     : int [1:211195] 60026 60027 60028 60029 60030 60031 60032 60033
60034 60035 ...
$ seq     : int [1:5, 1:2, 1:211195] 0 0 0 0 0 0 0 0 0 0 ...
..- attr(*, "dimnames")=List of 3
.. ..$ : chr [1:5] "A" "C" "G" "T" ...
.. ..$ : chr [1:2] "normal_srx113635_sorted.bam" "tumor_srx036691_sorted.bam"
.. ..$ : NULL

Do you get something similar, especially the identical seqnames, pos
dimension, and third dimension of seq? 'pos' should apparently be unique; so

> any(duplicated(r3[[1]][["pos"]]))
[1] FALSE

If there are duplicates, I wonder how many there are and where they occur

pos = r3[[1]][["pos"]]
table(table(pos))
udpos = unique(pos[duplicated(pos)])
head(pos[match(pos, udpos)], 20)
head(match(pos, udpos), 20)

If nothing is suggested by the above, can you make a subset of the BAM files
available to me, e.g., the result of

param = ScanBamParam(which=GRanges("chr20", IRanges(1, 1000000)))
filterBam(fls[1], tempfile(), param=param)

Thanks,

Martin

yields:

, , 1

10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
10696X4_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
A                                                   0
                                                0
C                                                   0
                                                0
G                                                  10
                                                0
T                                                   0
                                                0
N                                                   0
                                                0

, , 2

10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
10696X4_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
A                                                   0
                                                0
C                                                   0
                                                0
G                                                   0
                                               13
T                                                   0
                                                0
N                                                   0
                                                0

Even though the position is the same, it is showing up twice. Each time, one
of the files shows zeroes. This is not consistent with what happens if the
files are identical (as in the example from the help docs).

For example,

r3 = applyPileups(PileupFiles(c(fl1, fl1)), function(x) x, param=testparam)
#file 1 entered twice
r3[[1]][["seq"]][ , , r3[[1]][["pos"]] == 135003]

yields:

10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
10696X1_140408_SN141_0782_BC4J3NACXX_2_STP1N90A.bam
A                                                   0
                                                0
C                                                   0
                                                0
G                                                  10
                                               10
T                                                   0
                                                0
N                                                   0
                                                0

Is this the expected behavior? It seems like each position should only show
up once in the output. Is there something I am missing?

Thanks,

Jonathon Hill
Postdoc
Yost Lab, University of Utah



[[alternative HTML version deleted]]



_______________________________________________
Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org>
<mailto:Bioc-devel@r-project.org><mailto:Bioc-devel@r-project.org>mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793


        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Rsamtools applyPileups function not merging positions from multiple files if not identical

Reply via email to