[Bioc-devel] BamTallyParam argument 'which'

2015-02-20 Thread Thomas Sandmann
Hi Michael,

I noticed that when the tallyVariants function receives a 'which' arguments
(via BamTallyParam), that contains overlapping or duplicated regions,
duplicated rows are returned.

(See below for an example.)

It took me a little while to understand where I was picking duplicates.

Would it be useful to 'reduce' the 'which' GRanges/RangesList object by
default, e.g. before tallying variants, to make sure each base is only
tallied once ?

Best,
Thomas

library(VariantTools)

## 'which' is a set of non-overlapping regions
tally.param <- TallyVariantsParam(gmapR::TP53Genome(),
  high_base_quality = 23L,
  which = gmapR::TP53Which())
bams <- LungCancerLines::LungCancerBamFiles()
raw.variants <- tallyVariants(bams$H1993, tally.param)
any(duplicated( raw.variants ))  ## FALSE

## 'which' is a set of duplicated regions
tally.param <- TallyVariantsParam(
  gmapR::TP53Genome(),
  high_base_quality = 23L,
  which = c(
gmapR::TP53Which(),
gmapR::TP53Which()
)
)
raw.variants <- tallyVariants(bams$H1993, tally.param)
any(duplicated( raw.variants )) ## TRUE

sort(raw.variants)[1:4]


### SessionInfo()

R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4stats graphics  grDevices
[6] utils datasets  methods   base

other attached packages:
 [1] VariantTools_1.8.0   VariantAnnotation_1.12.9
 [3] Rsamtools_1.18.2 Biostrings_2.34.1
 [5] XVector_0.6.0GenomicRanges_1.18.4
 [7] GenomeInfoDb_1.2.4   IRanges_2.0.1
 [9] S4Vectors_0.4.0  BiocGenerics_0.12.1
[11] BiocInstaller_1.16.1 roxygen2_4.1.0
[13] devtools_1.7.0

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.28.1base64enc_0.1-2
 [3] BatchJobs_1.5   BBmisc_1.9
 [5] Biobase_2.26.0  BiocParallel_1.0.3
 [7] biomaRt_2.22.0  bitops_1.0-6
 [9] brew_1.0-6  BSgenome_1.34.1
[11] checkmate_1.5.1 codetools_0.2-10
[13] DBI_0.3.1   digest_0.6.8
[15] fail_1.2foreach_1.4.2
[17] GenomicAlignments_1.2.1 GenomicFeatures_1.18.3
[19] gmapR_1.8.0 grid_3.1.2
[21] iterators_1.0.7 lattice_0.20-29
[23] LungCancerLines_0.3.1   Matrix_1.1-5
[25] Rcpp_0.11.4 RCurl_1.95-4.5
[27] RSQLite_1.0.0   rtracklayer_1.26.2
[29] sendmailR_1.2-1 stringr_0.6.2
[31] tools_3.1.2 XML_3.98-1.1
[33] zlibbioc_1.12.0

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] BamTallyParam argument 'which'

2015-02-20 Thread Michael Lawrence
Hmm. I guess we could do at least two things:

1) Return the ID of the which range for each variant, like readVcf does
with its paramRangeID column in the rowData.

2) Do as you suggest, and reduce() the which.

Obviously, these address different use cases. The user can always achieve
#2 by just reduce()ing the which range first. #1 would be more difficult
for the user to implement.

If tallyVariants() did #2, then #1 would be painful to achieve, so if #1
sounds useful at all, then we might want to hold off on #2.

Thoughts?

Michael




On Fri, Feb 20, 2015 at 2:00 PM, Thomas Sandmann 
wrote:

> Hi Michael,
>
> I noticed that when the tallyVariants function receives a 'which'
> arguments (via BamTallyParam), that contains overlapping or duplicated
> regions, duplicated rows are returned.
>
> (See below for an example.)
>
> It took me a little while to understand where I was picking duplicates.
>
> Would it be useful to 'reduce' the 'which' GRanges/RangesList object by
> default, e.g. before tallying variants, to make sure each base is only
> tallied once ?
>
> Best,
> Thomas
>
> library(VariantTools)
>
> ## 'which' is a set of non-overlapping regions
> tally.param <- TallyVariantsParam(gmapR::TP53Genome(),
>   high_base_quality = 23L,
>   which = gmapR::TP53Which())
> bams <- LungCancerLines::LungCancerBamFiles()
> raw.variants <- tallyVariants(bams$H1993, tally.param)
> any(duplicated( raw.variants ))  ## FALSE
>
> ## 'which' is a set of duplicated regions
> tally.param <- TallyVariantsParam(
>   gmapR::TP53Genome(),
>   high_base_quality = 23L,
>   which = c(
> gmapR::TP53Which(),
> gmapR::TP53Which()
> )
> )
> raw.variants <- tallyVariants(bams$H1993, tally.param)
> any(duplicated( raw.variants )) ## TRUE
>
> sort(raw.variants)[1:4]
>
>
> ### SessionInfo()
>
> R version 3.1.2 (2014-10-31)
> Platform: x86_64-apple-darwin13.4.0 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] parallel  stats4stats graphics  grDevices
> [6] utils datasets  methods   base
>
> other attached packages:
>  [1] VariantTools_1.8.0   VariantAnnotation_1.12.9
>  [3] Rsamtools_1.18.2 Biostrings_2.34.1
>  [5] XVector_0.6.0GenomicRanges_1.18.4
>  [7] GenomeInfoDb_1.2.4   IRanges_2.0.1
>  [9] S4Vectors_0.4.0  BiocGenerics_0.12.1
> [11] BiocInstaller_1.16.1 roxygen2_4.1.0
> [13] devtools_1.7.0
>
> loaded via a namespace (and not attached):
>  [1] AnnotationDbi_1.28.1base64enc_0.1-2
>  [3] BatchJobs_1.5   BBmisc_1.9
>  [5] Biobase_2.26.0  BiocParallel_1.0.3
>  [7] biomaRt_2.22.0  bitops_1.0-6
>  [9] brew_1.0-6  BSgenome_1.34.1
> [11] checkmate_1.5.1 codetools_0.2-10
> [13] DBI_0.3.1   digest_0.6.8
> [15] fail_1.2foreach_1.4.2
> [17] GenomicAlignments_1.2.1 GenomicFeatures_1.18.3
> [19] gmapR_1.8.0 grid_3.1.2
> [21] iterators_1.0.7 lattice_0.20-29
> [23] LungCancerLines_0.3.1   Matrix_1.1-5
> [25] Rcpp_0.11.4 RCurl_1.95-4.5
> [27] RSQLite_1.0.0   rtracklayer_1.26.2
> [29] sendmailR_1.2-1 stringr_0.6.2
> [31] tools_3.1.2 XML_3.98-1.1
> [33] zlibbioc_1.12.0
>
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] BamTallyParam argument 'which'

2015-02-22 Thread Thomas Sandmann
Hi Michael,

ah, I see. I hadn't realized that returning the pileups separately for each
region could be a desired feature, but that makes sense. I agree, as it is
easy for the user to 'reduce' the ranges beforehand your first option (e.g.
returning the ID of the range) would be more flexible.

Perhaps you would consider adding a sentence to the documentation of
'which' on BamTallyParam's help page explaining that users might want to
'reduce' their ranges beforehand if they are only interested in a single
tally for each base ?

Thanks a lot !
Thomas

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] BamTallyParam argument 'which'

2015-02-23 Thread Leonard Goldstein
Hi Michael and Thomas,

I ran into the same problem in the past (i.e. when I started working
with functions like scanBam I expected them not to return the same
alignment multiple times)

One thing to consider might be that returning alignments multiple
times is consistent with the behavior of the samtools view command.
Quoting from the samtools manual:

“Important note: when multiple regions are given, some alignments may
be output multiple times if they overlap more than one of the
specified regions.”

Maybe there is an argument for keeping things consistent with
samtools? As you said, if documented properly, the user can decide
whether to reduce regions specified in which or not.

Leonard


On Mon, Feb 23, 2015 at 10:52 AM, Michael Lawrence
 wrote:
> We should at leaast try to avoid surprising the user. Seems like most
> people expect "which" to be a simple restriction, so I think for now I will
> just reduce the which, and if someone has a use case for separate queries,
> we can address it in the future.
>
> On Mon, Feb 23, 2015 at 10:41 AM, Thomas Sandmann 
> wrote:
>
>> Personally, I don't have a use case with "meaningful loci" worth tracking,
>> so keeping it simple would work for me.
>>
>> Incidentally, would it be good to deal with the 'which' parameter in a
>> consistent way across different methods ? I just saw this recent post on
>> the mailing list in which a used got confused by duplicate counts returned
>> after passing 'which' to scanBamParam:
>>
>> https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html
>>
>>
>> ---
>>
>> Thomas Sandmann, PhD
>> Computational biologist
>>
>> Genentech, Inc.
>> 1 DNA Way
>> South San Francisco, CA 94080
>> USA
>>
>> Phone: +1 650 225 6273
>> Fax: +1 650 225 5389
>> Email: sandmann.tho...@gene.com
>>
>> "If a man will begin with certainties, he shall end in doubts; but if he
>> will be content to begin with doubts he shall end in certainties." -- Sir
>> Francis Bacon
>>
>>
>> On Mon, Feb 23, 2015 at 10:37 AM, Michael Lawrence <
>> lawrence.mich...@gene.com> wrote:
>>
>>> We just have to decide which is the more useful interpretation of which
>>> -- as a simple restriction, or as a vector of meaningful locii, which will
>>> be analyzed individually? I would actually favor the first one (the same as
>>> yours), just because it's simpler. To keep track of the query ranges, we
>>> would need to add a new column to the returned object, which will more
>>> often than not just be clutter. I guess we could introduce a new parameter,
>>> "reduceWhich" which defaults to TRUE and reduces the which. If FALSE, it
>>> instead adds the column mapping back to the original which ranges.
>>>
>>>
>>> On Sun, Feb 22, 2015 at 2:36 PM, Thomas Sandmann <
>>> sandmann.tho...@gene.com> wrote:
>>>
 Hi Michael,

 ah, I see. I hadn't realized that returning the pileups separately for
 each region could be a desired feature, but that makes sense. I agree, as
 it is easy for the user to 'reduce' the ranges beforehand your first option
 (e.g. returning the ID of the range) would be more flexible.

 Perhaps you would consider adding a sentence to the documentation of
 'which' on BamTallyParam's help page explaining that users might want to
 'reduce' their ranges beforehand if they are only interested in a single
 tally for each base ?

 Thanks a lot !
 Thomas


>>>
>>
>
> [[alternative HTML version deleted]]
>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] BamTallyParam argument 'which'

2015-02-23 Thread Hervé Pagès

On 02/23/2015 11:05 AM, Leonard Goldstein wrote:

Hi Michael and Thomas,

I ran into the same problem in the past (i.e. when I started working
with functions like scanBam I expected them not to return the same
alignment multiple times)

One thing to consider might be that returning alignments multiple
times is consistent with the behavior of the samtools view command.
Quoting from the samtools manual:

“Important note: when multiple regions are given, some alignments may
be output multiple times if they overlap more than one of the
specified regions.”


Thanks Leonard for pointing this out. This is indeed the reason why all
the functions in Rsamtools and GenomicAlignments that take a 'which'
argument (via a ScanBamParam object) treat it "the samtools way", that
is, as a vector of meaningful loci.

Most of them track the index of the individual loci via a "which_label"
metadata column. See for example Rsamtools::pileup() and all the
readGAlignment*() functions in the GenomicAlignments package.
FWIW the man page for the readGAlignment*() functions contains the
following note:

 Note that a given record is loaded one time for each region it
 belongs to (this is a scanBam() feature, readGAlignments()
 is based on scanBam()).

but maybe this should be emphasized a little bit more.

Cheers,
H.



Maybe there is an argument for keeping things consistent with
samtools? As you said, if documented properly, the user can decide
whether to reduce regions specified in which or not.

Leonard


On Mon, Feb 23, 2015 at 10:52 AM, Michael Lawrence
 wrote:

We should at leaast try to avoid surprising the user. Seems like most
people expect "which" to be a simple restriction, so I think for now I will
just reduce the which, and if someone has a use case for separate queries,
we can address it in the future.

On Mon, Feb 23, 2015 at 10:41 AM, Thomas Sandmann 
wrote:


Personally, I don't have a use case with "meaningful loci" worth tracking,
so keeping it simple would work for me.

Incidentally, would it be good to deal with the 'which' parameter in a
consistent way across different methods ? I just saw this recent post on
the mailing list in which a used got confused by duplicate counts returned
after passing 'which' to scanBamParam:

https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html


---

Thomas Sandmann, PhD
Computational biologist

Genentech, Inc.
1 DNA Way
South San Francisco, CA 94080
USA

Phone: +1 650 225 6273
Fax: +1 650 225 5389
Email: sandmann.tho...@gene.com

"If a man will begin with certainties, he shall end in doubts; but if he
will be content to begin with doubts he shall end in certainties." -- Sir
Francis Bacon


On Mon, Feb 23, 2015 at 10:37 AM, Michael Lawrence <
lawrence.mich...@gene.com> wrote:


We just have to decide which is the more useful interpretation of which
-- as a simple restriction, or as a vector of meaningful locii, which will
be analyzed individually? I would actually favor the first one (the same as
yours), just because it's simpler. To keep track of the query ranges, we
would need to add a new column to the returned object, which will more
often than not just be clutter. I guess we could introduce a new parameter,
"reduceWhich" which defaults to TRUE and reduces the which. If FALSE, it
instead adds the column mapping back to the original which ranges.


On Sun, Feb 22, 2015 at 2:36 PM, Thomas Sandmann <
sandmann.tho...@gene.com> wrote:


Hi Michael,

ah, I see. I hadn't realized that returning the pileups separately for
each region could be a desired feature, but that makes sense. I agree, as
it is easy for the user to 'reduce' the ranges beforehand your first option
(e.g. returning the ID of the range) would be more flexible.

Perhaps you would consider adding a sentence to the documentation of
'which' on BamTallyParam's help page explaining that users might want to
'reduce' their ranges beforehand if they are only interested in a single
tally for each base ?

Thanks a lot !
Thomas








 [[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] BamTallyParam argument 'which'

2015-02-23 Thread Michael Lawrence
Maybe Rsamtools would want to follow this precedent. I think there might be
a difference between fishing out alignments from a SAM/BAM, and deriving a
summary (tallyVariants) from a BAM. It seems like an argument could be made
for a tally set to not contain duplicates.

On Mon, Feb 23, 2015 at 11:05 AM, Leonard Goldstein <
goldstein.leon...@gene.com> wrote:

> Hi Michael and Thomas,
>
> I ran into the same problem in the past (i.e. when I started working
> with functions like scanBam I expected them not to return the same
> alignment multiple times)
>
> One thing to consider might be that returning alignments multiple
> times is consistent with the behavior of the samtools view command.
> Quoting from the samtools manual:
>
> “Important note: when multiple regions are given, some alignments may
> be output multiple times if they overlap more than one of the
> specified regions.”
>
> Maybe there is an argument for keeping things consistent with
> samtools? As you said, if documented properly, the user can decide
> whether to reduce regions specified in which or not.
>
> Leonard
>
>
> On Mon, Feb 23, 2015 at 10:52 AM, Michael Lawrence
>  wrote:
> > We should at leaast try to avoid surprising the user. Seems like most
> > people expect "which" to be a simple restriction, so I think for now I
> will
> > just reduce the which, and if someone has a use case for separate
> queries,
> > we can address it in the future.
> >
> > On Mon, Feb 23, 2015 at 10:41 AM, Thomas Sandmann <
> sandmann.tho...@gene.com>
> > wrote:
> >
> >> Personally, I don't have a use case with "meaningful loci" worth
> tracking,
> >> so keeping it simple would work for me.
> >>
> >> Incidentally, would it be good to deal with the 'which' parameter in a
> >> consistent way across different methods ? I just saw this recent post on
> >> the mailing list in which a used got confused by duplicate counts
> returned
> >> after passing 'which' to scanBamParam:
> >>
> >> https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html
> >>
> >>
> >> ---
> >>
> >> Thomas Sandmann, PhD
> >> Computational biologist
> >>
> >> Genentech, Inc.
> >> 1 DNA Way
> >> South San Francisco, CA 94080
> >> USA
> >>
> >> Phone: +1 650 225 6273
> >> Fax: +1 650 225 5389
> >> Email: sandmann.tho...@gene.com
> >>
> >> "If a man will begin with certainties, he shall end in doubts; but if he
> >> will be content to begin with doubts he shall end in certainties." --
> Sir
> >> Francis Bacon
> >>
> >>
> >> On Mon, Feb 23, 2015 at 10:37 AM, Michael Lawrence <
> >> lawrence.mich...@gene.com> wrote:
> >>
> >>> We just have to decide which is the more useful interpretation of which
> >>> -- as a simple restriction, or as a vector of meaningful locii, which
> will
> >>> be analyzed individually? I would actually favor the first one (the
> same as
> >>> yours), just because it's simpler. To keep track of the query ranges,
> we
> >>> would need to add a new column to the returned object, which will more
> >>> often than not just be clutter. I guess we could introduce a new
> parameter,
> >>> "reduceWhich" which defaults to TRUE and reduces the which. If FALSE,
> it
> >>> instead adds the column mapping back to the original which ranges.
> >>>
> >>>
> >>> On Sun, Feb 22, 2015 at 2:36 PM, Thomas Sandmann <
> >>> sandmann.tho...@gene.com> wrote:
> >>>
>  Hi Michael,
> 
>  ah, I see. I hadn't realized that returning the pileups separately for
>  each region could be a desired feature, but that makes sense. I
> agree, as
>  it is easy for the user to 'reduce' the ranges beforehand your first
> option
>  (e.g. returning the ID of the range) would be more flexible.
> 
>  Perhaps you would consider adding a sentence to the documentation of
>  'which' on BamTallyParam's help page explaining that users might want
> to
>  'reduce' their ranges beforehand if they are only interested in a
> single
>  tally for each base ?
> 
>  Thanks a lot !
>  Thomas
> 
> 
> >>>
> >>
> >
> > [[alternative HTML version deleted]]
> >
> > ___
> > Bioc-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] BamTallyParam argument 'which'

2015-02-23 Thread Leonard Goldstein
Sounds very sensible not to double count in the context of tallying
variants. I was more concerned with reducing which as the default
behavior for scanBam and other functions.

I wanted to bring up the samtools behavior as - for me at least -
inconsistencies between Rsamtools and samtools have been another
source of confusion in the past (e.g. different naming conventions for
fields like isize vs TLEN etc.)

Leonard


On Mon, Feb 23, 2015 at 11:22 AM, Michael Lawrence
 wrote:
> Maybe Rsamtools would want to follow this precedent. I think there might be
> a difference between fishing out alignments from a SAM/BAM, and deriving a
> summary (tallyVariants) from a BAM. It seems like an argument could be made
> for a tally set to not contain duplicates.
>
> On Mon, Feb 23, 2015 at 11:05 AM, Leonard Goldstein
>  wrote:
>>
>> Hi Michael and Thomas,
>>
>> I ran into the same problem in the past (i.e. when I started working
>> with functions like scanBam I expected them not to return the same
>> alignment multiple times)
>>
>> One thing to consider might be that returning alignments multiple
>> times is consistent with the behavior of the samtools view command.
>> Quoting from the samtools manual:
>>
>> “Important note: when multiple regions are given, some alignments may
>> be output multiple times if they overlap more than one of the
>> specified regions.”
>>
>> Maybe there is an argument for keeping things consistent with
>> samtools? As you said, if documented properly, the user can decide
>> whether to reduce regions specified in which or not.
>>
>> Leonard
>>
>>
>> On Mon, Feb 23, 2015 at 10:52 AM, Michael Lawrence
>>  wrote:
>> > We should at leaast try to avoid surprising the user. Seems like most
>> > people expect "which" to be a simple restriction, so I think for now I
>> > will
>> > just reduce the which, and if someone has a use case for separate
>> > queries,
>> > we can address it in the future.
>> >
>> > On Mon, Feb 23, 2015 at 10:41 AM, Thomas Sandmann
>> > 
>> > wrote:
>> >
>> >> Personally, I don't have a use case with "meaningful loci" worth
>> >> tracking,
>> >> so keeping it simple would work for me.
>> >>
>> >> Incidentally, would it be good to deal with the 'which' parameter in a
>> >> consistent way across different methods ? I just saw this recent post
>> >> on
>> >> the mailing list in which a used got confused by duplicate counts
>> >> returned
>> >> after passing 'which' to scanBamParam:
>> >>
>> >> https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html
>> >>
>> >>
>> >> ---
>> >>
>> >> Thomas Sandmann, PhD
>> >> Computational biologist
>> >>
>> >> Genentech, Inc.
>> >> 1 DNA Way
>> >> South San Francisco, CA 94080
>> >> USA
>> >>
>> >> Phone: +1 650 225 6273
>> >> Fax: +1 650 225 5389
>> >> Email: sandmann.tho...@gene.com
>> >>
>> >> "If a man will begin with certainties, he shall end in doubts; but if
>> >> he
>> >> will be content to begin with doubts he shall end in certainties." --
>> >> Sir
>> >> Francis Bacon
>> >>
>> >>
>> >> On Mon, Feb 23, 2015 at 10:37 AM, Michael Lawrence <
>> >> lawrence.mich...@gene.com> wrote:
>> >>
>> >>> We just have to decide which is the more useful interpretation of
>> >>> which
>> >>> -- as a simple restriction, or as a vector of meaningful locii, which
>> >>> will
>> >>> be analyzed individually? I would actually favor the first one (the
>> >>> same as
>> >>> yours), just because it's simpler. To keep track of the query ranges,
>> >>> we
>> >>> would need to add a new column to the returned object, which will more
>> >>> often than not just be clutter. I guess we could introduce a new
>> >>> parameter,
>> >>> "reduceWhich" which defaults to TRUE and reduces the which. If FALSE,
>> >>> it
>> >>> instead adds the column mapping back to the original which ranges.
>> >>>
>> >>>
>> >>> On Sun, Feb 22, 2015 at 2:36 PM, Thomas Sandmann <
>> >>> sandmann.tho...@gene.com> wrote:
>> >>>
>>  Hi Michael,
>> 
>>  ah, I see. I hadn't realized that returning the pileups separately
>>  for
>>  each region could be a desired feature, but that makes sense. I
>>  agree, as
>>  it is easy for the user to 'reduce' the ranges beforehand your first
>>  option
>>  (e.g. returning the ID of the range) would be more flexible.
>> 
>>  Perhaps you would consider adding a sentence to the documentation of
>>  'which' on BamTallyParam's help page explaining that users might want
>>  to
>>  'reduce' their ranges beforehand if they are only interested in a
>>  single
>>  tally for each base ?
>> 
>>  Thanks a lot !
>>  Thomas
>> 
>> 
>> >>>
>> >>
>> >
>> > [[alternative HTML version deleted]]
>> >
>> > ___
>> > Bioc-devel@r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-d

Re: [Bioc-devel] BamTallyParam argument 'which'

2015-02-24 Thread Leonardo Collado Torres
Related to my post on a separate thread
(https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html),
I think that if 'which' is not being reduced by default, a simple
example showing the effects of this could be included in the functions
that have such an argument. Also note that 'reducing' could lead to
unintended results.

For example, in the help page for GenomicAlignments::readGAlignments,
after the 'gal4' example it would be nice to add something like this:


## Note that if overlapping ranges are provided in 'which'
## reads could be selected more than once. This would artificually
## increase the coverage or affect other downstream results.
## If you 'reduce' the ranges, reads that originally overlapped
## two disjoint segments will be included.

which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
seq2=IRanges(c(100, 1000), c(1000, 2000)))
param_dups <- ScanBamParam(which=which_dups)
param_reduced <- ScanBamParam(which=reduce(which_dups))
gal4_dups <- readGAlignments(bamfile, param=param_dups)
gal4_reduced <- readGAlignments(bamfile, param=param_reduced)


length(gal4)

## Duplicates some reads. In this case, all the ones between
## bases 1000 and 2000 on seq1.
length(gal4_dups)

## Includes some reads that mapped around base 1000 in seq2
## that were excluded in gal4.
length(gal4_reduced)








Here's the output:



> library('GenomicAlignments')
>
> ## Code already included in ?readGAlignments
> bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools",
+mustWork=TRUE)
> which <- RangesList(seq1=IRanges(1000, 2000),
+ seq2=IRanges(c(100, 1000), c(1000, 2000)))
> param <- ScanBamParam(which=which)
> gal4 <- readGAlignments(bamfile, param=param)
> gal4
GAlignments object with 2404 alignments and 0 metadata columns:
 seqnames strand   cigarqwidth start   end
width njunc
  
 
 [1] seq1  + 35M35   970  1004
   35 0
 [2] seq1  + 35M35   971  1005
   35 0
 [3] seq1  + 35M35   972  1006
   35 0
 [4] seq1  + 35M35   973  1007
   35 0
 [5] seq1  + 35M35   974  1008
   35 0
 ...  ...... ...   ...   ...   ...
  ...   ...
  [2400] seq2  + 35M35  1524  1558
   35 0
  [2401] seq2  + 35M35  1524  1558
   35 0
  [2402] seq2  - 35M35  1528  1562
   35 0
  [2403] seq2  - 35M35  1532  1566
   35 0
  [2404] seq2  - 35M35  1533  1567
   35 0
  ---
  seqinfo: 2 sequences from an unspecified genome
>
> ## Note that if overlapping ranges are provided in 'which'
> ## reads could be selected more than once. This would artificually
> ## increase the coverage or affect other downstream results.
> ## If you 'reduce' the ranges, reads that originally overlapped
> ## two disjoint segments will be included.
>
> which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
+ seq2=IRanges(c(100, 1000), c(1000, 2000)))
> param_dups <- ScanBamParam(which=which_dups)
> param_reduced <- ScanBamParam(which=reduce(which_dups))
> gal4_dups <- readGAlignments(bamfile, param=param_dups)
> gal4_reduced <- readGAlignments(bamfile, param=param_reduced)
>
>
> length(gal4)
[1] 2404
>
> ## Duplicates some reads. In this case, all the ones between
> ## bases 1000 and 2000 on seq1.
> length(gal4_dups)
[1] 3014
>
> ## Includes some reads that mapped around base 1000 in seq2
> ## that were excluded in gal4.
> length(gal4_reduced)
[1] 2343
>
>
>
>
>
> options(width = 120)
> devtools::session_info()
Session 
info---
 setting  value
 version  R Under development (unstable) (2014-11-01 r66923)
 system   x86_64, darwin10.8.0
 ui   AQUA
 language (EN)
 collate  en_US.UTF-8
 tz   America/New_York

Packages---
 package   * version date   source
 base64enc   0.1.2   2014-06-26 CRAN (R 3.2.0)
 BatchJobs   1.4 2014-09-24 CRAN (R 3.2.0)
 BBmisc  1.7 2014-06-21 CRAN (R 3.2.0)
 BiocGenerics  * 0.13.4  2014-12-31 Bioconductor
 BiocParallel1.1.13  2015-01-27 Bioconductor
 Biostrings* 2.35.8  2015-02-14 Bioconductor
 bitops  1.0.6   2013-08-17 CRAN (R 3.2.0)
 brew1.0.6   2011-04-13 CRAN (R 3.2.0)
 checkmate   1.5.0   2014-10-19 CRAN (R 3.2.0)
 codetools   0.2.9   2014-08-21 CRAN (R 3.2.0)
 DBI 0.3.1   2014-09-24 CRAN (R 3.2.0)
 devtools1.6.1   20

Re: [Bioc-devel] BamTallyParam argument 'which'

2015-02-25 Thread Gabe Becker
I think we need to be a little careful of trying to know the users
intentions better than they do here.  Reduce is a (very) easy operation to
do on a GRanges, so if the user didn't, are we really safe assuming they
"meant to" when passing the GRanges as a which?

I would argue for "the samtools way", not because samtools does it (though
consistency is good) but because it allows the user to do more things,
while not making it that painful to do the thing that they might want most
often.

I agree with Michael that an additional argument might be a good middle
ground.

~G

On Tue, Feb 24, 2015 at 7:40 PM, Leonardo Collado Torres 
wrote:

> Related to my post on a separate thread
> (https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html),
> I think that if 'which' is not being reduced by default, a simple
> example showing the effects of this could be included in the functions
> that have such an argument. Also note that 'reducing' could lead to
> unintended results.
>
> For example, in the help page for GenomicAlignments::readGAlignments,
> after the 'gal4' example it would be nice to add something like this:
>
>
> ## Note that if overlapping ranges are provided in 'which'
> ## reads could be selected more than once. This would artificually
> ## increase the coverage or affect other downstream results.
> ## If you 'reduce' the ranges, reads that originally overlapped
> ## two disjoint segments will be included.
>
> which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
> seq2=IRanges(c(100, 1000), c(1000, 2000)))
> param_dups <- ScanBamParam(which=which_dups)
> param_reduced <- ScanBamParam(which=reduce(which_dups))
> gal4_dups <- readGAlignments(bamfile, param=param_dups)
> gal4_reduced <- readGAlignments(bamfile, param=param_reduced)
>
>
> length(gal4)
>
> ## Duplicates some reads. In this case, all the ones between
> ## bases 1000 and 2000 on seq1.
> length(gal4_dups)
>
> ## Includes some reads that mapped around base 1000 in seq2
> ## that were excluded in gal4.
> length(gal4_reduced)
>
>
>
>
>
>
>
>
> Here's the output:
>
>
>
> > library('GenomicAlignments')
> >
> > ## Code already included in ?readGAlignments
> > bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools",
> +mustWork=TRUE)
> > which <- RangesList(seq1=IRanges(1000, 2000),
> + seq2=IRanges(c(100, 1000), c(1000, 2000)))
> > param <- ScanBamParam(which=which)
> > gal4 <- readGAlignments(bamfile, param=param)
> > gal4
> GAlignments object with 2404 alignments and 0 metadata columns:
>  seqnames strand   cigarqwidth start   end
> width njunc
>   
>  
>  [1] seq1  + 35M35   970  1004
>35 0
>  [2] seq1  + 35M35   971  1005
>35 0
>  [3] seq1  + 35M35   972  1006
>35 0
>  [4] seq1  + 35M35   973  1007
>35 0
>  [5] seq1  + 35M35   974  1008
>35 0
>  ...  ...... ...   ...   ...   ...
>   ...   ...
>   [2400] seq2  + 35M35  1524  1558
>35 0
>   [2401] seq2  + 35M35  1524  1558
>35 0
>   [2402] seq2  - 35M35  1528  1562
>35 0
>   [2403] seq2  - 35M35  1532  1566
>35 0
>   [2404] seq2  - 35M35  1533  1567
>35 0
>   ---
>   seqinfo: 2 sequences from an unspecified genome
> >
> > ## Note that if overlapping ranges are provided in 'which'
> > ## reads could be selected more than once. This would artificually
> > ## increase the coverage or affect other downstream results.
> > ## If you 'reduce' the ranges, reads that originally overlapped
> > ## two disjoint segments will be included.
> >
> > which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
> + seq2=IRanges(c(100, 1000), c(1000, 2000)))
> > param_dups <- ScanBamParam(which=which_dups)
> > param_reduced <- ScanBamParam(which=reduce(which_dups))
> > gal4_dups <- readGAlignments(bamfile, param=param_dups)
> > gal4_reduced <- readGAlignments(bamfile, param=param_reduced)
> >
> >
> > length(gal4)
> [1] 2404
> >
> > ## Duplicates some reads. In this case, all the ones between
> > ## bases 1000 and 2000 on seq1.
> > length(gal4_dups)
> [1] 3014
> >
> > ## Includes some reads that mapped around base 1000 in seq2
> > ## that were excluded in gal4.
> > length(gal4_reduced)
> [1] 2343
> >
> >
> >
> >
> >
> > options(width = 120)
> > devtools::session_info()
> Session
> info---
>  setting  value
>  version  R Under development (unstable) (2014-11-01 r66923)
>  system   x86_64, darwin10.8.0
>  ui

Re: [Bioc-devel] BamTallyParam argument 'which'

2015-02-25 Thread Michael Lawrence
I'm not convinced the samtools way lets the user do anything useful with
tallyVariants. At least, I can't think of one. The arguments for
consistency with samtools do not really hold at the summary level.
Ultimately, you want a summary of each position. Having the summaries
duplicated in the output does not really help, unless there is an easy way
to map back to the original locii. But I'm not going to add that complexity
until there is an actual use case. Right now, users are universally
surprised by the repetition of the results. A parameter could be added when
the need arises.

On Wed, Feb 25, 2015 at 6:37 AM, Gabe Becker  wrote:

> I think we need to be a little careful of trying to know the users
> intentions better than they do here.  Reduce is a (very) easy operation to
> do on a GRanges, so if the user didn't, are we really safe assuming they
> "meant to" when passing the GRanges as a which?
>
> I would argue for "the samtools way", not because samtools does it (though
> consistency is good) but because it allows the user to do more things,
> while not making it that painful to do the thing that they might want most
> often.
>
> I agree with Michael that an additional argument might be a good middle
> ground.
>
> ~G
>
> On Tue, Feb 24, 2015 at 7:40 PM, Leonardo Collado Torres  > wrote:
>
>> Related to my post on a separate thread
>> (https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html),
>> I think that if 'which' is not being reduced by default, a simple
>> example showing the effects of this could be included in the functions
>> that have such an argument. Also note that 'reducing' could lead to
>> unintended results.
>>
>> For example, in the help page for GenomicAlignments::readGAlignments,
>> after the 'gal4' example it would be nice to add something like this:
>>
>>
>> ## Note that if overlapping ranges are provided in 'which'
>> ## reads could be selected more than once. This would artificually
>> ## increase the coverage or affect other downstream results.
>> ## If you 'reduce' the ranges, reads that originally overlapped
>> ## two disjoint segments will be included.
>>
>> which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
>> seq2=IRanges(c(100, 1000), c(1000, 2000)))
>> param_dups <- ScanBamParam(which=which_dups)
>> param_reduced <- ScanBamParam(which=reduce(which_dups))
>> gal4_dups <- readGAlignments(bamfile, param=param_dups)
>> gal4_reduced <- readGAlignments(bamfile, param=param_reduced)
>>
>>
>> length(gal4)
>>
>> ## Duplicates some reads. In this case, all the ones between
>> ## bases 1000 and 2000 on seq1.
>> length(gal4_dups)
>>
>> ## Includes some reads that mapped around base 1000 in seq2
>> ## that were excluded in gal4.
>> length(gal4_reduced)
>>
>>
>>
>>
>>
>>
>>
>>
>> Here's the output:
>>
>>
>>
>> > library('GenomicAlignments')
>> >
>> > ## Code already included in ?readGAlignments
>> > bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools",
>> +mustWork=TRUE)
>> > which <- RangesList(seq1=IRanges(1000, 2000),
>> + seq2=IRanges(c(100, 1000), c(1000, 2000)))
>> > param <- ScanBamParam(which=which)
>> > gal4 <- readGAlignments(bamfile, param=param)
>> > gal4
>> GAlignments object with 2404 alignments and 0 metadata columns:
>>  seqnames strand   cigarqwidth start   end
>> width njunc
>>   
>>  
>>  [1] seq1  + 35M35   970  1004
>>35 0
>>  [2] seq1  + 35M35   971  1005
>>35 0
>>  [3] seq1  + 35M35   972  1006
>>35 0
>>  [4] seq1  + 35M35   973  1007
>>35 0
>>  [5] seq1  + 35M35   974  1008
>>35 0
>>  ...  ...... ...   ...   ...   ...
>>   ...   ...
>>   [2400] seq2  + 35M35  1524  1558
>>35 0
>>   [2401] seq2  + 35M35  1524  1558
>>35 0
>>   [2402] seq2  - 35M35  1528  1562
>>35 0
>>   [2403] seq2  - 35M35  1532  1566
>>35 0
>>   [2404] seq2  - 35M35  1533  1567
>>35 0
>>   ---
>>   seqinfo: 2 sequences from an unspecified genome
>> >
>> > ## Note that if overlapping ranges are provided in 'which'
>> > ## reads could be selected more than once. This would artificually
>> > ## increase the coverage or affect other downstream results.
>> > ## If you 'reduce' the ranges, reads that originally overlapped
>> > ## two disjoint segments will be included.
>> >
>> > which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
>> + seq2=IRanges(c(100, 1000), c(1000, 2000)))
>> > param_dups <- ScanBamParam(which=which_dups)
>> > param_reduced <- ScanBamParam(which=reduce(

Re: [Bioc-devel] BamTallyParam argument 'which'

2015-02-25 Thread Hervé Pagès

Thanks Leonardo. I'll add this to the man page. This will definitely
help clarify the "duplicated selection" issue.

H.

On 02/24/2015 07:40 PM, Leonardo Collado Torres wrote:

Related to my post on a separate thread
(https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html),
I think that if 'which' is not being reduced by default, a simple
example showing the effects of this could be included in the functions
that have such an argument. Also note that 'reducing' could lead to
unintended results.

For example, in the help page for GenomicAlignments::readGAlignments,
after the 'gal4' example it would be nice to add something like this:


## Note that if overlapping ranges are provided in 'which'
## reads could be selected more than once. This would artificually
## increase the coverage or affect other downstream results.
## If you 'reduce' the ranges, reads that originally overlapped
## two disjoint segments will be included.

which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
 seq2=IRanges(c(100, 1000), c(1000, 2000)))
param_dups <- ScanBamParam(which=which_dups)
param_reduced <- ScanBamParam(which=reduce(which_dups))
gal4_dups <- readGAlignments(bamfile, param=param_dups)
gal4_reduced <- readGAlignments(bamfile, param=param_reduced)


length(gal4)

## Duplicates some reads. In this case, all the ones between
## bases 1000 and 2000 on seq1.
length(gal4_dups)

## Includes some reads that mapped around base 1000 in seq2
## that were excluded in gal4.
length(gal4_reduced)








Here's the output:




library('GenomicAlignments')

## Code already included in ?readGAlignments
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools",

+mustWork=TRUE)

which <- RangesList(seq1=IRanges(1000, 2000),

+ seq2=IRanges(c(100, 1000), c(1000, 2000)))

param <- ScanBamParam(which=which)
gal4 <- readGAlignments(bamfile, param=param)
gal4

GAlignments object with 2404 alignments and 0 metadata columns:
  seqnames strand   cigarqwidth start   end
width njunc
   
 
  [1] seq1  + 35M35   970  1004
35 0
  [2] seq1  + 35M35   971  1005
35 0
  [3] seq1  + 35M35   972  1006
35 0
  [4] seq1  + 35M35   973  1007
35 0
  [5] seq1  + 35M35   974  1008
35 0
  ...  ...... ...   ...   ...   ...
   ...   ...
   [2400] seq2  + 35M35  1524  1558
35 0
   [2401] seq2  + 35M35  1524  1558
35 0
   [2402] seq2  - 35M35  1528  1562
35 0
   [2403] seq2  - 35M35  1532  1566
35 0
   [2404] seq2  - 35M35  1533  1567
35 0
   ---
   seqinfo: 2 sequences from an unspecified genome


## Note that if overlapping ranges are provided in 'which'
## reads could be selected more than once. This would artificually
## increase the coverage or affect other downstream results.
## If you 'reduce' the ranges, reads that originally overlapped
## two disjoint segments will be included.

which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),

+ seq2=IRanges(c(100, 1000), c(1000, 2000)))

param_dups <- ScanBamParam(which=which_dups)
param_reduced <- ScanBamParam(which=reduce(which_dups))
gal4_dups <- readGAlignments(bamfile, param=param_dups)
gal4_reduced <- readGAlignments(bamfile, param=param_reduced)


length(gal4)

[1] 2404


## Duplicates some reads. In this case, all the ones between
## bases 1000 and 2000 on seq1.
length(gal4_dups)

[1] 3014


## Includes some reads that mapped around base 1000 in seq2
## that were excluded in gal4.
length(gal4_reduced)

[1] 2343






options(width = 120)
devtools::session_info()

Session 
info---
  setting  value
  version  R Under development (unstable) (2014-11-01 r66923)
  system   x86_64, darwin10.8.0
  ui   AQUA
  language (EN)
  collate  en_US.UTF-8
  tz   America/New_York

Packages---
  package   * version date   source
  base64enc   0.1.2   2014-06-26 CRAN (R 3.2.0)
  BatchJobs   1.4 2014-09-24 CRAN (R 3.2.0)
  BBmisc  1.7 2014-06-21 CRAN (R 3.2.0)
  BiocGenerics  * 0.13.4  2014-12-31 Bioconductor
  BiocParallel1.1.13  2015-01-27 Bioconductor
  Biostrings* 2.35.8  2015-02-14 Bioconductor
  bitops  1.0.6   2013-08-17 CRAN (R 3.2.0)
  brew1.0.6   2011-04-13 CRAN (R 3.2.0)
  checkmate   1

Re: [Bioc-devel] BamTallyParam argument 'which'

2015-02-25 Thread Hervé Pagès

FWIW, and after checking how pileup() handles this, I thought I should
report:

  library(GenomicAlignments)
  bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools")

Then:

  > stackStringsFromBam(bamfile, param="seq1:1-6")
A DNAStringSet instance of length 4
  width seq
  [1] 6 CACTAG
  [2] 6 ++CTAG
  [3] 6 AG
  [4] 6 +G

  > which <- GRanges("seq1", IRanges(c(1, 5, 1), c(2, 6, 3)))
  > pileup(bamfile, scanBamParam=ScanBamParam(which=which))
seqnames pos strand nucleotide count which_label
  1 seq1   1  +  C 1seq1:1-2
  2 seq1   2  +  A 1seq1:1-2
  3 seq1   5  +  A 3seq1:5-6
  4 seq1   6  +  G 4seq1:5-6
  5 seq1   1  +  C 1seq1:1-3
  6 seq1   2  +  A 1seq1:1-3
  7 seq1   3  +  C 2seq1:1-3

This output looks clean to me: (a) Nucleotide positions that belong
to more than 1 range in 'which' are treated as distinct positions,
and (b) reads that overlap with 2 non-overlapping ranges (e.g.
read #1 and ranges #1 and #2) only contribute at most once at each
position covered by these ranges.

H.


On 02/25/2015 01:21 PM, Hervé Pagès wrote:

Hi,

On 02/25/2015 06:37 AM, Gabe Becker wrote:

I think we need to be a little careful of trying to know the users
intentions better than they do here.  Reduce is a (very) easy operation
to do on a GRanges, so if the user didn't, are we really safe assuming
they "meant to" when passing the GRanges as a which?

I would argue for "the samtools way", not because samtools does it
(though consistency is good) but because it allows the user to do more
things, while not making it that painful to do the thing that they might
want most often.

I agree with Michael that an additional argument might be a good middle
ground.


Maybe another good middle ground is to issue a warning when 'which'
contains overlapping ranges. The warning could suggest to the user
to reduce() the ranges first. Maybe the warning should also point
out that reducing doesn't completely eliminate the risk of selecting
the same record more than once (at least that's the case for the
readGAlignment* functions, but I don't know if it holds for
BamTallyParam). The risk is actually higher with paired-end reads
where the same pair is selected once for each range in 'which' that
overlaps with at least one of the 2 mates in the pair.

But as already discussed here (or maybe it was on the old bioconductor
list, don't remember), better solutions to the "duplicated selection"
problem could be achieved via one of the following:

   (1) Expose some sort of unique id for the records in a BAM file.
   I agree with Michael that "duplicated selection" is incompatible
   with summarization tools like BamTallyParam or pileup(). Having
   access to a unique id would completely solve the problem.

   (2) Introduce an argument similar to 'which' but with a slightly
   different semantic e.g. select records that *start* in one of
   the specified ranges (for single-end reads), or select pairs of
   records for which the *first* mate starts in one of the specified
   ranges.
   Advantages of this semantic:

 (a) If the ranges don't overlap, then no duplicates.

 (b) It can be used in the context of tiling the genome and
 processing the reads by tile. This plays well with
 parallelization (the semantic of 'which' does not).

   Inconvenient: the arbitrary nature of the selection criteria
   and its lack of symmetry is incompatible with summarization
   tools like BamTallyParam or pileup().

Note that (1) and (2) are not exclusive.

Cheers,
H.



~G

On Tue, Feb 24, 2015 at 7:40 PM, Leonardo Collado Torres
mailto:lcoll...@jhu.edu>> wrote:

Related to my post on a separate thread

(https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html),
I think that if 'which' is not being reduced by default, a simple
example showing the effects of this could be included in the
functions
that have such an argument. Also note that 'reducing' could lead to
unintended results.

For example, in the help page for GenomicAlignments::readGAlignments,
after the 'gal4' example it would be nice to add something like this:


## Note that if overlapping ranges are provided in 'which'
## reads could be selected more than once. This would artificually
## increase the coverage or affect other downstream results.
## If you 'reduce' the ranges, reads that originally overlapped
## two disjoint segments will be included.

which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
 seq2=IRanges(c(100, 1000), c(1000, 2000)))
param_dups <- ScanBamParam(which=which_dups)
param_reduced <- ScanBamParam(which=reduce(which_dups))
gal4_dups <- readGAlignments(bamfile, param=param_dups)
gal4_reduced <- 

Re: [Bioc-devel] BamTallyParam argument 'which'

2015-02-25 Thread Hervé Pagès

Hi,

On 02/25/2015 06:37 AM, Gabe Becker wrote:

I think we need to be a little careful of trying to know the users
intentions better than they do here.  Reduce is a (very) easy operation
to do on a GRanges, so if the user didn't, are we really safe assuming
they "meant to" when passing the GRanges as a which?

I would argue for "the samtools way", not because samtools does it
(though consistency is good) but because it allows the user to do more
things, while not making it that painful to do the thing that they might
want most often.

I agree with Michael that an additional argument might be a good middle
ground.


Maybe another good middle ground is to issue a warning when 'which'
contains overlapping ranges. The warning could suggest to the user
to reduce() the ranges first. Maybe the warning should also point
out that reducing doesn't completely eliminate the risk of selecting
the same record more than once (at least that's the case for the
readGAlignment* functions, but I don't know if it holds for
BamTallyParam). The risk is actually higher with paired-end reads
where the same pair is selected once for each range in 'which' that
overlaps with at least one of the 2 mates in the pair.

But as already discussed here (or maybe it was on the old bioconductor
list, don't remember), better solutions to the "duplicated selection"
problem could be achieved via one of the following:

  (1) Expose some sort of unique id for the records in a BAM file.
  I agree with Michael that "duplicated selection" is incompatible
  with summarization tools like BamTallyParam or pileup(). Having
  access to a unique id would completely solve the problem.

  (2) Introduce an argument similar to 'which' but with a slightly
  different semantic e.g. select records that *start* in one of
  the specified ranges (for single-end reads), or select pairs of
  records for which the *first* mate starts in one of the specified
  ranges.
  Advantages of this semantic:

(a) If the ranges don't overlap, then no duplicates.

(b) It can be used in the context of tiling the genome and
processing the reads by tile. This plays well with
parallelization (the semantic of 'which' does not).

  Inconvenient: the arbitrary nature of the selection criteria
  and its lack of symmetry is incompatible with summarization
  tools like BamTallyParam or pileup().

Note that (1) and (2) are not exclusive.

Cheers,
H.



~G

On Tue, Feb 24, 2015 at 7:40 PM, Leonardo Collado Torres
mailto:lcoll...@jhu.edu>> wrote:

Related to my post on a separate thread
(https://stat.ethz.ch/pipermail/bioc-devel/2015-February/006978.html),
I think that if 'which' is not being reduced by default, a simple
example showing the effects of this could be included in the functions
that have such an argument. Also note that 'reducing' could lead to
unintended results.

For example, in the help page for GenomicAlignments::readGAlignments,
after the 'gal4' example it would be nice to add something like this:


## Note that if overlapping ranges are provided in 'which'
## reads could be selected more than once. This would artificually
## increase the coverage or affect other downstream results.
## If you 'reduce' the ranges, reads that originally overlapped
## two disjoint segments will be included.

which_dups <- RangesList(seq1=rep(IRanges(1000, 2000), 2),
 seq2=IRanges(c(100, 1000), c(1000, 2000)))
param_dups <- ScanBamParam(which=which_dups)
param_reduced <- ScanBamParam(which=reduce(which_dups))
gal4_dups <- readGAlignments(bamfile, param=param_dups)
gal4_reduced <- readGAlignments(bamfile, param=param_reduced)


length(gal4)

## Duplicates some reads. In this case, all the ones between
## bases 1000 and 2000 on seq1.
length(gal4_dups)

## Includes some reads that mapped around base 1000 in seq2
## that were excluded in gal4.
length(gal4_reduced)








Here's the output:



 > library('GenomicAlignments')
 >
 > ## Code already included in ?readGAlignments
 > bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools",
+mustWork=TRUE)
 > which <- RangesList(seq1=IRanges(1000, 2000),
+ seq2=IRanges(c(100, 1000), c(1000, 2000)))
 > param <- ScanBamParam(which=which)
 > gal4 <- readGAlignments(bamfile, param=param)
 > gal4
GAlignments object with 2404 alignments and 0 metadata columns:
  seqnames strand   cigarqwidth start   end
width njunc
   
 
  [1] seq1  + 35M35   970  1004
35 0
  [2] seq1  + 35M35   971  1005
35 0
  [3] seq1  + 35M35   972  1006
35 0
  [4] seq