Re: [Bioc-devel] BamTallyParam argument 'which'

Hervé Pagès Wed, 25 Feb 2015 14:15:21 -0800

FWIW, and after checking how pileup() handles this, I thought I should
report:


  library(GenomicAlignments)
  bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools")

Then:

  > stackStringsFromBam(bamfile, param="seq1:1-6")
    A DNAStringSet instance of length 4
      width seq
  [1]     6 CACTAG
  [2]     6 ++CTAG
  [3]     6 ++++AG
  [4]     6 +++++G

  > which <- GRanges("seq1", IRanges(c(1, 5, 1), c(2, 6, 3)))
  > pileup(bamfile, scanBamParam=ScanBamParam(which=which))
    seqnames pos strand nucleotide count which_label
  1     seq1   1      +          C     1    seq1:1-2
  2     seq1   2      +          A     1    seq1:1-2
  3     seq1   5      +          A     3    seq1:5-6
  4     seq1   6      +          G     4    seq1:5-6
  5     seq1   1      +          C     1    seq1:1-3
  6     seq1   2      +          A     1    seq1:1-3
  7     seq1   3      +          C     2    seq1:1-3

This output looks clean to me: (a) Nucleotide positions that belong
to more than 1 range in 'which' are treated as distinct positions,
and (b) reads that overlap with 2 non-overlapping ranges (e.g.
read #1 and ranges #1 and #2) only contribute at most once at each
position covered by these ranges.

H.


On 02/25/2015 01:21 PM, Hervé Pagès wrote:

Hi,

On 02/25/2015 06:37 AM, Gabe Becker wrote:

I think we need to be a little careful of trying to know the users
intentions better than they do here.  Reduce is a (very) easy operation
to do on a GRanges, so if the user didn't, are we really safe assuming
they "meant to" when passing the GRanges as a which?

I would argue for "the samtools way", not because samtools does it
(though consistency is good) but because it allows the user to do more
things, while not making it that painful to do the thing that they might
want most often.

I agree with Michael that an additional argument might be a good middle
ground.


Maybe another good middle ground is to issue a warning when 'which'
contains overlapping ranges. The warning could suggest to the user
to reduce() the ranges first. Maybe the warning should also point
out that reducing doesn't completely eliminate the risk of selecting
the same record more than once (at least that's the case for the
readGAlignment* functions, but I don't know if it holds for
BamTallyParam). The risk is actually higher with paired-end reads
where the same pair is selected once for each range in 'which' that
overlaps with at least one of the 2 mates in the pair.

But as already discussed here (or maybe it was on the old bioconductor
list, don't remember), better solutions to the "duplicated selection"
problem could be achieved via one of the following:

   (1) Expose some sort of unique id for the records in a BAM file.
       I agree with Michael that "duplicated selection" is incompatible
       with summarization tools like BamTallyParam or pileup(). Having
       access to a unique id would completely solve the problem.

   (2) Introduce an argument similar to 'which' but with a slightly
       different semantic e.g. select records that *start* in one of
       the specified ranges (for single-end reads), or select pairs of
       records for which the *first* mate starts in one of the specified
       ranges.
       Advantages of this semantic:

         (a) If the ranges don't overlap, then no duplicates.

         (b) It can be used in the context of tiling the genome and
             processing the reads by tile. This plays well with
             parallelization (the semantic of 'which' does not).

       Inconvenient: the arbitrary nature of the selection criteria
       and its lack of symmetry is incompatible with summarization
       tools like BamTallyParam or pileup().

Note that (1) and (2) are not exclusive.

Cheers,
H.


~G

On Tue, Feb 24, 2015 at 7:40 PM, Leonardo Collado Torres
<lcoll...@jhu.edu <mailto:lcoll...@jhu.edu>> wrote:

    Related to my post on a separate thread

(https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html),
    I think that if 'which' is not being reduced by default, a simple
    example showing the effects of this could be included in the
functions
    that have such an argument. Also note that 'reducing' could lead to
    unintended results.

    For example, in the help page for GenomicAlignments::readGAlignments,
    after the 'gal4' example it would be nice to add something like this:


    ## Note that if overlapping ranges are provided in 'which'
    ## reads could be selected more than once. This would artificually
    ## increase the coverage or affect other downstream results.
    ## If you 'reduce' the ranges, reads that originally overlapped
    ## two disjoint segments will be included.

    which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
         seq2=IRanges(c(100, 1000), c(1000, 2000)))
    param_dups <- ScanBamParam(which=which_dups)
    param_reduced <- ScanBamParam(which=reduce(which_dups))
    gal4_dups <- readGAlignments(bamfile, param=param_dups)
    gal4_reduced <- readGAlignments(bamfile, param=param_reduced)


    length(gal4)

    ## Duplicates some reads. In this case, all the ones between
    ## bases 1000 and 2000 on seq1.
    length(gal4_dups)

    ## Includes some reads that mapped around base 1000 in seq2
    ## that were excluded in gal4.
    length(gal4_reduced)








    Here's the output:



     > library('GenomicAlignments')
     >
     > ## Code already included in ?readGAlignments
     > bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools",
    +                        mustWork=TRUE)
     > which <- RangesList(seq1=IRanges(1000, 2000),
    +                     seq2=IRanges(c(100, 1000), c(1000, 2000)))
     > param <- ScanBamParam(which=which)
     > gal4 <- readGAlignments(bamfile, param=param)
     > gal4
    GAlignments object with 2404 alignments and 0 metadata columns:
              seqnames strand       cigar    qwidth     start       end
    width     njunc
                 <Rle>  <Rle> <character> <integer> <integer> <integer>
    <integer> <integer>
          [1]     seq1      +         35M        35       970      1004
        35         0
          [2]     seq1      +         35M        35       971      1005
        35         0
          [3]     seq1      +         35M        35       972      1006
        35         0
          [4]     seq1      +         35M        35       973      1007
        35         0
          [5]     seq1      +         35M        35       974      1008
        35         0
          ...      ...    ...         ...       ...       ...       ...
       ...       ...
       [2400]     seq2      +         35M        35      1524      1558
        35         0
       [2401]     seq2      +         35M        35      1524      1558
        35         0
       [2402]     seq2      -         35M        35      1528      1562
        35         0
       [2403]     seq2      -         35M        35      1532      1566
        35         0
       [2404]     seq2      -         35M        35      1533      1567
        35         0
       -------
       seqinfo: 2 sequences from an unspecified genome
     >
     > ## Note that if overlapping ranges are provided in 'which'
     > ## reads could be selected more than once. This would artificually
     > ## increase the coverage or affect other downstream results.
     > ## If you 'reduce' the ranges, reads that originally overlapped
     > ## two disjoint segments will be included.
     >
     > which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
    +     seq2=IRanges(c(100, 1000), c(1000, 2000)))
     > param_dups <- ScanBamParam(which=which_dups)
     > param_reduced <- ScanBamParam(which=reduce(which_dups))
     > gal4_dups <- readGAlignments(bamfile, param=param_dups)
     > gal4_reduced <- readGAlignments(bamfile, param=param_reduced)
     >
     >
     > length(gal4)
    [1] 2404
     >
     > ## Duplicates some reads. In this case, all the ones between
     > ## bases 1000 and 2000 on seq1.
     > length(gal4_dups)
    [1] 3014
     >
     > ## Includes some reads that mapped around base 1000 in seq2
     > ## that were excluded in gal4.
     > length(gal4_reduced)
    [1] 2343
     >
     >
     >
     >
     >
     > options(width = 120)
     > devtools::session_info()
    Session

info-----------------------------------------------------------------------------------------------------------

      setting  value
      version  R Under development (unstable) (2014-11-01 r66923)
      system   x86_64, darwin10.8.0
      ui       AQUA
      language (EN)
      collate  en_US.UTF-8
      tz       America/New_York


Packages---------------------------------------------------------------------------------------------------------------

      package           * version date       source
      base64enc           0.1.2   2014-06-26 CRAN (R 3.2.0)
      BatchJobs           1.4     2014-09-24 CRAN (R 3.2.0)
      BBmisc              1.7     2014-06-21 CRAN (R 3.2.0)
      BiocGenerics      * 0.13.4  2014-12-31 Bioconductor
      BiocParallel        1.1.13  2015-01-27 Bioconductor
      Biostrings        * 2.35.8  2015-02-14 Bioconductor
      bitops              1.0.6   2013-08-17 CRAN (R 3.2.0)
      brew                1.0.6   2011-04-13 CRAN (R 3.2.0)
      checkmate           1.5.0   2014-10-19 CRAN (R 3.2.0)
      codetools           0.2.9   2014-08-21 CRAN (R 3.2.0)
      DBI                 0.3.1   2014-09-24 CRAN (R 3.2.0)
      devtools            1.6.1   2014-10-07 CRAN (R 3.2.0)
      digest              0.6.4   2013-12-03 CRAN (R 3.2.0)
      fail                1.2     2013-09-19 CRAN (R 3.2.0)
      foreach             1.4.2   2014-04-11 CRAN (R 3.2.0)
      GenomeInfoDb      * 1.3.13  2015-02-13 Bioconductor
      GenomicAlignments * 1.3.27  2015-01-26 Bioconductor
      GenomicRanges     * 1.19.37 2015-02-13 Bioconductor
      IRanges           * 2.1.38  2015-02-08 Bioconductor
      iterators           1.0.7   2014-04-11 CRAN (R 3.2.0)
      Rsamtools         * 1.19.27 2015-02-07 Bioconductor
      RSQLite             1.0.0   2014-10-25 CRAN (R 3.2.0)
      rstudioapi          0.1     2014-03-27 CRAN (R 3.2.0)
      S4Vectors         * 0.5.20  2015-02-19 Bioconductor
      sendmailR           1.2.1   2014-09-21 CRAN (R 3.2.0)
      stringr             0.6.2   2012-12-06 CRAN (R 3.2.0)
      XVector           * 0.7.4   2015-02-08 Bioconductor
      zlibbioc            1.13.1  2015-02-11 Bioconductor
     >





    On Mon, Feb 23, 2015 at 2:38 PM, Leonard Goldstein
    <goldstein.leon...@gene.com <mailto:goldstein.leon...@gene.com>>
wrote:
     > Sounds very sensible not to double count in the context of
tallying
     > variants. I was more concerned with reducing which as the default
     > behavior for scanBam and other functions.
     >
     > I wanted to bring up the samtools behavior as - for me at least -
     > inconsistencies between Rsamtools and samtools have been another
     > source of confusion in the past (e.g. different naming
    conventions for
     > fields like isize vs TLEN etc.)
     >
     > Leonard
     >
     >
     > On Mon, Feb 23, 2015 at 11:22 AM, Michael Lawrence
     > <lawrence.mich...@gene.com <mailto:lawrence.mich...@gene.com>>
wrote:
     >> Maybe Rsamtools would want to follow this precedent. I think
    there might be
     >> a difference between fishing out alignments from a SAM/BAM, and
    deriving a
     >> summary (tallyVariants) from a BAM. It seems like an argument
    could be made
     >> for a tally set to not contain duplicates.
     >>
     >> On Mon, Feb 23, 2015 at 11:05 AM, Leonard Goldstein
     >> <goldstein.leon...@gene.com <mailto:goldstein.leon...@gene.com>>
    wrote:
     >>>
     >>> Hi Michael and Thomas,
     >>>
     >>> I ran into the same problem in the past (i.e. when I started
    working
     >>> with functions like scanBam I expected them not to return the
same
     >>> alignment multiple times)
     >>>
     >>> One thing to consider might be that returning alignments
multiple
     >>> times is consistent with the behavior of the samtools view
command.
     >>> Quoting from the samtools manual:
     >>>
     >>> “Important note: when multiple regions are given, some
    alignments may
     >>> be output multiple times if they overlap more than one of the
     >>> specified regions.”
     >>>
     >>> Maybe there is an argument for keeping things consistent with
     >>> samtools? As you said, if documented properly, the user can
decide
     >>> whether to reduce regions specified in which or not.
     >>>
     >>> Leonard
     >>>
     >>>
     >>> On Mon, Feb 23, 2015 at 10:52 AM, Michael Lawrence
     >>> <lawrence.mich...@gene.com <mailto:lawrence.mich...@gene.com>>
    wrote:
     >>> > We should at leaast try to avoid surprising the user. Seems
    like most
     >>> > people expect "which" to be a simple restriction, so I think
    for now I
     >>> > will
     >>> > just reduce the which, and if someone has a use case for
separate
     >>> > queries,
     >>> > we can address it in the future.
     >>> >
     >>> > On Mon, Feb 23, 2015 at 10:41 AM, Thomas Sandmann
     >>> > <sandmann.tho...@gene.com <mailto:sandmann.tho...@gene.com>>
     >>> > wrote:
     >>> >
     >>> >> Personally, I don't have a use case with "meaningful loci"
worth
     >>> >> tracking,
     >>> >> so keeping it simple would work for me.
     >>> >>
     >>> >> Incidentally, would it be good to deal with the 'which'
    parameter in a
     >>> >> consistent way across different methods ? I just saw this
    recent post
     >>> >> on
     >>> >> the mailing list in which a used got confused by duplicate
    counts
     >>> >> returned
     >>> >> after passing 'which' to scanBamParam:
     >>> >>
     >>> >>
    https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html
     >>> >>
     >>> >>
     >>> >> ---
     >>> >>
     >>> >> Thomas Sandmann, PhD
     >>> >> Computational biologist
     >>> >>
     >>> >> Genentech, Inc.
     >>> >> 1 DNA Way
     >>> >> South San Francisco, CA 94080
     >>> >> USA
     >>> >>
     >>> >> Phone: +1 650 225 6273 <tel:%2B1%20650%20225%206273>
     >>> >> Fax: +1 650 225 5389 <tel:%2B1%20650%20225%205389>
     >>> >> Email: sandmann.tho...@gene.com
    <mailto:sandmann.tho...@gene.com>
     >>> >>
     >>> >> "If a man will begin with certainties, he shall end in
    doubts; but if
     >>> >> he
     >>> >> will be content to begin with doubts he shall end in
    certainties." --
     >>> >> Sir
     >>> >> Francis Bacon
     >>> >>
     >>> >>
     >>> >> On Mon, Feb 23, 2015 at 10:37 AM, Michael Lawrence <
     >>> >> lawrence.mich...@gene.com
    <mailto:lawrence.mich...@gene.com>> wrote:
     >>> >>
     >>> >>> We just have to decide which is the more useful
    interpretation of
     >>> >>> which
     >>> >>> -- as a simple restriction, or as a vector of meaningful
    locii, which
     >>> >>> will
     >>> >>> be analyzed individually? I would actually favor the first
    one (the
     >>> >>> same as
     >>> >>> yours), just because it's simpler. To keep track of the
    query ranges,
     >>> >>> we
     >>> >>> would need to add a new column to the returned object,
    which will more
     >>> >>> often than not just be clutter. I guess we could introduce
    a new
     >>> >>> parameter,
     >>> >>> "reduceWhich" which defaults to TRUE and reduces the which.
    If FALSE,
     >>> >>> it
     >>> >>> instead adds the column mapping back to the original which
    ranges.
     >>> >>>
     >>> >>>
     >>> >>> On Sun, Feb 22, 2015 at 2:36 PM, Thomas Sandmann <
     >>> >>> sandmann.tho...@gene.com <mailto:sandmann.tho...@gene.com>>
    wrote:
     >>> >>>
     >>> >>>> Hi Michael,
     >>> >>>>
     >>> >>>> ah, I see. I hadn't realized that returning the pileups
    separately
     >>> >>>> for
     >>> >>>> each region could be a desired feature, but that makes
    sense. I
     >>> >>>> agree, as
     >>> >>>> it is easy for the user to 'reduce' the ranges beforehand
    your first
     >>> >>>> option
     >>> >>>> (e.g. returning the ID of the range) would be more
flexible.
     >>> >>>>
     >>> >>>> Perhaps you would consider adding a sentence to the
    documentation of
     >>> >>>> 'which' on BamTallyParam's help page explaining that users
    might want
     >>> >>>> to
     >>> >>>> 'reduce' their ranges beforehand if they are only
    interested in a
     >>> >>>> single
     >>> >>>> tally for each base ?
     >>> >>>>
     >>> >>>> Thanks a lot !
     >>> >>>> Thomas
     >>> >>>>
     >>> >>>>
     >>> >>>
     >>> >>
     >>> >
     >>> >         [[alternative HTML version deleted]]
     >>> >
     >>> > _______________________________________________
     >>> > Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
    mailing list
     >>> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
     >>
     >>
     >
     > _______________________________________________
     > Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
    mailing list
     > https://stat.ethz.ch/mailman/listinfo/bioc-devel

    _______________________________________________
    Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing
list
    https://stat.ethz.ch/mailman/listinfo/bioc-devel




--
Gabriel Becker, Ph.D
Computational Biologist
Genentech Research


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] BamTallyParam argument 'which'

Reply via email to