Re: [Bioc-devel] Please bump version number when committing changes

2014-09-08 Thread Gabe Becker
Michael,

Tags could work. Another approach  would be to update the repository and
then look in the log to see if the version number was changed in the most
recent commit. In a sense this is the converse of what our GRANBase package
does when locating historical package versions within the Bioc SVN.

This has the benefit of not requiring the authors do anything they didn't
already need to do in order to "flag" a release (i.e., bump the version
number).

I should note also to Dan's later point that while the Bioc repository
always builds whatever is in the latest commit currently, that need not be
the case. Our internal package repository only builds on  version bumps (of
a package or any of it's recursive dependencies).

~
G

On Fri, Sep 5, 2014 at 6:48 PM, Michael Lawrence 
wrote:

> As Pete and Ryan have pointed out, it seems that the version control system
> should somehow ease the burden of the developer here.
>
> Let's look at this from the github perspective, since it is likely to be
> the primary hosting mechanism for the foreseeable future. Just thinking out
> loud, if R could somehow dynamically ascertain the version of a package at
> build time, it could query the git checkout for a version. A simple
> algorithm that I have found effective in non-R projects is to consider git
> tags, which on github equate to releases. If the repository state is *at*
> the tag, then use the tag as the version. If the state is ahead of the most
> recent tag, then use the tag + latest commit hash. I wonder if R could
> support this by allowing a path to an R script in the version field?
>
>
>
> On Fri, Sep 5, 2014 at 6:27 PM, Vincent Carey 
> wrote:
>
> > On Fri, Sep 5, 2014 at 7:50 PM, Peter Haverty 
> > wrote:
> >
> > > Hi all,
> > >
> > > I respectfully disagree.  One should certainly check in each discrete
> > unit
> > > of work.  These will often not result in something that is ready to be
> > used
> > > by someone else.  Bumping the version number constitutes a new release
> > and
> > > carries the implicit promise that the package works again.  This is why
> > >
> >
> > Here I would respectfully disagree.  Code in the devel branch carries no
> > guarantees.
> > I think we have been pretty loose with respect to package version number
> > bumping in devel
> > branch; the svn tracking can be used to deal with isolation of code for
> > rollbacks.
> >
> > In this informal regime the package version number is a simple marker of
> > package state.
> > I think it has served us pretty well in past years but the developer
> > community was smaller
> > and had fairly homogeneous habits.
> >
> > Clearly there is room for more regimentation in this area but at the
> moment
> > I agree with
> > Dan that version numbers are cheap and should be bumped when new code is
> > committed.
> > And the recognition by all that a devel image may not work and may change
> > fairly dramatically
> > while in devel should be general; whether we need to alter that is open
> to
> > question but I would
> > think not.
> >
> >
> > > continuous integration systems do a build when the version number
> > changes.
> > >
> > > One should expect working software when installing a pre-build package
> > (the
> > > tests passed, right?).  Checking out from SVN is for developers of that
> > > package and nothing should be assumed about the current state of the
> > code.
> > >
> > > To keep everyone happy, one could add a commit hook to our SVN setup
> that
> > > would add the SVN revision number to the version string.  This would be
> > for
> > > dev only and hopefully not sufficient to trigger a build.
> > >
> > > That's my two cents.  Happy weekend all.
> > >
> > > Regards,
> > >
> > >
> > >
> > > Pete
> > >
> > > 
> > > Peter M. Haverty, Ph.D.
> > > Genentech, Inc.
> > > phave...@gene.com
> > >
> > >
> > > On Fri, Sep 5, 2014 at 4:30 PM, Dan Tenenbaum 
> > wrote:
> > >
> > > >
> > > >
> > > > - Original Message -
> > > > > From: "Stephanie M. Gogarten" 
> > > > > To: "Dan Tenenbaum" , "bioc-devel" <
> > > > bioc-devel@r-project.org>
> > > > > Sent: Friday, September 5, 2014 4:27:13 PM
> > > > > Subject: Re: [Bioc-devel] Please bump version number when
> committing
> > > > changes
> > > > >
> > > > > I am guilty of doing this today, but I have (I think) a good
> reason.
> > > > > I'm making a bunch of changes that are all related to each other,
> but
> > > > > are being implemented and tested in stages.  I'd like to use svn to
> > > > > commit when I've made a set of changes that works, so I can roll
> back
> > > > > if
> > > > > I break something in the next step, but I'd like the users to see
> > > > > them
> > > > > all at once as a single version update.  Perhaps others are doing
> > > > > something similar?
> > > > >
> > > >
> > > > I understand the motivation but this still results in an ambiguous
> > state
> > > > if two different people check out your package from svn at different
> > > times
> > > > today (before

[Bioc-devel] should genome() be so complicated?/add genome report to GRanges show method

2014-09-08 Thread Vincent Carey
For GRanges x, my naive expectation is that genome(x) returns a length-

one tag identifying the genome to which chromosomal coordinates

correspond.  The genome() method seems to have sequence-specific

semantics, which makes sense, but when we identify sequence

with chromosome, it seems too complicated.  Is there a use case for

a GRanges with sequences from several different genomes?


One reason I am inquiring is that I feel it would be nice to have the
GRanges show() method report, prominently, the genome in use (or NA

if unspecified).  This could be accomplished by reporting
unique(genome(x)), and perhaps that would be satisfactory.

after example(genome) :

> seqinfo(txdb)

Seqinfo of length 15

seqnames seqlengths isCircular genome

CH2L   23011544  FALSEdm3

CH2R   21146708  FALSEdm3

CH3L   24543557  FALSEdm3

CH3R   27905053  FALSEdm3

CH4 1351857  FALSEdm3

... .........

CH3LHet 2555491  FALSEdm3

CH3RHet 2517507  FALSEdm3

CHXHet   204112  FALSEdm3

CHYHet   347038  FALSEdm3

CHUextra   29004656  FALSEdm3

> genome(seqinfo(txdb))

CH2L CH2R CH3L CH3R  CH4  CHX  CHUM

   "dm3""dm3""dm3""dm3""dm3""dm3""dm3""dm3"

 CH2LHet  CH2RHet  CH3LHet  CH3RHet   CHXHet   CHYHet CHUextra

   "dm3""dm3""dm3""dm3""dm3""dm3""dm3"

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] writeVcf performance

2014-09-08 Thread Valerie Obenchain

The new writeVcf code is in 1.11.28.

Using the illumina file you suggested, geno fields only, writing now 
takes about 17 minutes.


> hdr
class: VCFHeader
samples(1): NA12877
meta(6): fileformat ApplyRecalibration ... reference source
fixed(1): FILTER
info(22): AC AF ... culprit set
geno(8): GT GQX ... PL VF

> param = ScanVcfParam(info=NA)
> vcf = readVcf(fl, "", param=param)
> dim(vcf)
[1] 516127621

> system.time(writeVcf(vcf, "out.vcf"))
user   system  elapsed
 971.0326.568 1004.593

In 1.11.28, parsing of geno data was moved to C. If this didn't speed 
things up enough we were planning to implement 'chunking' through the 
VCF and/or move the parsing of info to C, however, it looks like geno 
was the bottleneck.


I've tested a number of samples/fields combinations in files with >= .5 
million rows and the improvement over writeVcf() in release is ~ 90%.


Valerie



On 09/04/14 15:28, Valerie Obenchain wrote:

Thanks Gabe. I should have something for you on Monday.

Val


On 09/04/2014 01:56 PM, Gabe Becker wrote:

Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files make a good test
case, assuming you strip out all the info (info=NA when reading it into
R) stuff.

ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz
took about ~4.2 hrs to write out, and is about 1.5x the size of the
files we are actually dealing with (~50M ranges vs our ~30M).

Looking forward a new vastly improved writeVcf :).

~G


On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
mailto:lawrence.mich...@gene.com>> wrote:

Yes, it's very clear that the scaling is non-linear, and Gabe has
been experimenting with a chunk-wise + parallel algorithm.
Unfortunately there is some frustrating overhead with the
parallelism. But I'm glad Val is arriving at something quicker.

Michael


On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan mailto:mtmor...@fhcrc.org>> wrote:

On 08/27/2014 11:56 AM, Gabe Becker wrote:

The profiling I attached in my previous email is for 24 geno
fields, as I said,
but our typical usecase involves only ~4-6 fields, and is
faster but still on
the order of dozens of minutes.


I think Val is arriving at a (much) more efficient
implementation, but...

I wanted to share my guess that the poor _scaling_ is because
the garbage collector runs multiple times as the different
strings are pasted together, and has to traverse, in linear
time, increasing numbers of allocated SEXPs. So times scale
approximately quadratically with the number of rows in the VCF

An efficiency is to reduce the number of SEXPs in play by
writing out in chunks -- as each chunk is written, the SEXPs
become available for collection and are re-used. Here's my toy
example

time.R
==
splitIndices <- function (nx, ncl)
{
 i <- seq_len(nx)
 if (ncl == 0L)
 list()
 else if (ncl == 1L || nx == 1L)
 list(i)
 else {
 fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
 breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
 structure(split(i, cut(i, breaks, labels=FALSE)), names
= NULL)
 }
}

x = as.character(seq_len(1e7)); y = sample(x)
if (!is.na (Sys.getenv("SPLIT", NA))) {
 idx <- splitIndices(length(x), 20)
 system.time(for (i in idx) paste(x[i], y[i], sep=":"))
} else {
 system.time(paste(x, y, sep=":"))
}


running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f
time.R the relevant time is

user  system elapsed
  15.320   0.064  15.381

versus with $ R --no-save --quiet -f time.R it is

user  system elapsed
  95.360   0.164  95.511

I think this is likely an overall strategy when dealing with
character data -- processing in independent chunks of moderate
(1M?) size (enabling as a consequence parallel evaluation in
modest memory) that are sufficient to benefit from
vectorization, but that do not entail allocation of large
numbers of in-use SEXPs.

Martin


Sorry for the confusion.
~G


On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker
mailto:becke...@gene.com>
>> wrote:

 Martin and Val.

 I re-ran writeVcf on our (G)VCF data (34790518 ranges,
24 geno fields) with
 profiling enabled. The results of summaryRprof for that
run are attached,
 though for a variety of reasons they are pretty
misleading.

 It took ove

Re: [Bioc-devel] writeVcf performance

2014-09-08 Thread Gabe Becker
Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain 
wrote:

> The new writeVcf code is in 1.11.28.
>
> Using the illumina file you suggested, geno fields only, writing now takes
> about 17 minutes.
>
> > hdr
> class: VCFHeader
> samples(1): NA12877
> meta(6): fileformat ApplyRecalibration ... reference source
> fixed(1): FILTER
> info(22): AC AF ... culprit set
> geno(8): GT GQX ... PL VF
>
> > param = ScanVcfParam(info=NA)
> > vcf = readVcf(fl, "", param=param)
> > dim(vcf)
> [1] 516127621
>
> > system.time(writeVcf(vcf, "out.vcf"))
> user   system  elapsed
>  971.0326.568 1004.593
>
> In 1.11.28, parsing of geno data was moved to C. If this didn't speed
> things up enough we were planning to implement 'chunking' through the VCF
> and/or move the parsing of info to C, however, it looks like geno was the
> bottleneck.
>
> I've tested a number of samples/fields combinations in files with >= .5
> million rows and the improvement over writeVcf() in release is ~ 90%.
>
> Valerie
>
>
>
>
> On 09/04/14 15:28, Valerie Obenchain wrote:
>
>> Thanks Gabe. I should have something for you on Monday.
>>
>> Val
>>
>>
>> On 09/04/2014 01:56 PM, Gabe Becker wrote:
>>
>>> Val and Martin,
>>>
>>> Apologies for the delay.
>>>
>>> We realized that the Illumina platinum genome vcf files make a good test
>>> case, assuming you strip out all the info (info=NA when reading it into
>>> R) stuff.
>>>
>>> ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz
>>> took about ~4.2 hrs to write out, and is about 1.5x the size of the
>>> files we are actually dealing with (~50M ranges vs our ~30M).
>>>
>>> Looking forward a new vastly improved writeVcf :).
>>>
>>> ~G
>>>
>>>
>>> On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
>>> mailto:lawrence.mich...@gene.com>> wrote:
>>>
>>> Yes, it's very clear that the scaling is non-linear, and Gabe has
>>> been experimenting with a chunk-wise + parallel algorithm.
>>> Unfortunately there is some frustrating overhead with the
>>> parallelism. But I'm glad Val is arriving at something quicker.
>>>
>>> Michael
>>>
>>>
>>> On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan >> > wrote:
>>>
>>> On 08/27/2014 11:56 AM, Gabe Becker wrote:
>>>
>>> The profiling I attached in my previous email is for 24 geno
>>> fields, as I said,
>>> but our typical usecase involves only ~4-6 fields, and is
>>> faster but still on
>>> the order of dozens of minutes.
>>>
>>>
>>> I think Val is arriving at a (much) more efficient
>>> implementation, but...
>>>
>>> I wanted to share my guess that the poor _scaling_ is because
>>> the garbage collector runs multiple times as the different
>>> strings are pasted together, and has to traverse, in linear
>>> time, increasing numbers of allocated SEXPs. So times scale
>>> approximately quadratically with the number of rows in the VCF
>>>
>>> An efficiency is to reduce the number of SEXPs in play by
>>> writing out in chunks -- as each chunk is written, the SEXPs
>>> become available for collection and are re-used. Here's my toy
>>> example
>>>
>>> time.R
>>> ==
>>> splitIndices <- function (nx, ncl)
>>> {
>>>  i <- seq_len(nx)
>>>  if (ncl == 0L)
>>>  list()
>>>  else if (ncl == 1L || nx == 1L)
>>>  list(i)
>>>  else {
>>>  fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
>>>  breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
>>>  structure(split(i, cut(i, breaks, labels=FALSE)), names
>>> = NULL)
>>>  }
>>> }
>>>
>>> x = as.character(seq_len(1e7)); y = sample(x)
>>> if (!is.na (Sys.getenv("SPLIT", NA))) {
>>>  idx <- splitIndices(length(x), 20)
>>>  system.time(for (i in idx) paste(x[i], y[i], sep=":"))
>>> } else {
>>>  system.time(paste(x, y, sep=":"))
>>> }
>>>
>>>
>>> running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f
>>> time.R the relevant time is
>>>
>>> user  system elapsed
>>>   15.320   0.064  15.381
>>>
>>> versus with $ R --no-save --quiet -f time.R it is
>>>
>>> user  system elapsed
>>>   95.360   0.164  95.511
>>>
>>> I think this is likely an overall strategy when dealing with
>>> character data -- processing in independent chunks of moderate
>>> (1M?) size (enabling as a consequence parallel evaluation in
>>> modest memory) that are sufficient to benefit from
>>> vectorization, but that do not entail allocation of large
>>> numbers of in-use SEXPs.
>>>
>>> Martin
>

Re: [Bioc-devel] should genome() be so complicated?/add genome report to GRanges show method

2014-09-08 Thread Hervé Pagès

Hi Vince,

Yes it would make sense to have the "show" method report the genome
when genome(x) contains a unique non-NA value. I think the main
use case for having the genome defined at the sequence level instead
of the whole object level is metagenomics. Maybe Michael has some other
good use cases to share since IIRC he requested the addition of the
genome field a couple of years ago and made the case for having it
defined at the sequence level.

Cheers,
H.

On 09/08/2014 07:21 AM, Vincent Carey wrote:

For GRanges x, my naive expectation is that genome(x) returns a length-

one tag identifying the genome to which chromosomal coordinates

correspond.  The genome() method seems to have sequence-specific

semantics, which makes sense, but when we identify sequence

with chromosome, it seems too complicated.  Is there a use case for

a GRanges with sequences from several different genomes?


One reason I am inquiring is that I feel it would be nice to have the
GRanges show() method report, prominently, the genome in use (or NA

if unspecified).  This could be accomplished by reporting
unique(genome(x)), and perhaps that would be satisfactory.

after example(genome) :


seqinfo(txdb)


Seqinfo of length 15

seqnames seqlengths isCircular genome

CH2L   23011544  FALSEdm3

CH2R   21146708  FALSEdm3

CH3L   24543557  FALSEdm3

CH3R   27905053  FALSEdm3

CH4 1351857  FALSEdm3

... .........

CH3LHet 2555491  FALSEdm3

CH3RHet 2517507  FALSEdm3

CHXHet   204112  FALSEdm3

CHYHet   347038  FALSEdm3

CHUextra   29004656  FALSEdm3


genome(seqinfo(txdb))


 CH2L CH2R CH3L CH3R  CH4  CHX  CHUM

"dm3""dm3""dm3""dm3""dm3""dm3""dm3""dm3"

  CH2LHet  CH2RHet  CH3LHet  CH3RHet   CHXHet   CHYHet CHUextra

"dm3""dm3""dm3""dm3""dm3""dm3""dm3"

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] should genome() be so complicated?/add genome report to GRanges show method

2014-09-08 Thread Vincent Carey
On Mon, Sep 8, 2014 at 1:50 PM, Hervé Pagès  wrote:

> Hi Vince,
>
> Yes it would make sense to have the "show" method report the genome
> when genome(x) contains a unique non-NA value. I think the main
>

i would propose that it show selectSome(unique(genome(x))) -- seems
consistent with the multiple genomes in use when relevant

if NA, IHMO that is an important lack of metadata and people should
see that then as well



> use case for having the genome defined at the sequence level instead
> of the whole object level is metagenomics. Maybe Michael has some other
> good use cases to share since IIRC he requested the addition of the
> genome field a couple of years ago and made the case for having it
> defined at the sequence level.
>
> Cheers,
> H.
>
>
> On 09/08/2014 07:21 AM, Vincent Carey wrote:
>
>> For GRanges x, my naive expectation is that genome(x) returns a length-
>>
>> one tag identifying the genome to which chromosomal coordinates
>>
>> correspond.  The genome() method seems to have sequence-specific
>>
>> semantics, which makes sense, but when we identify sequence
>>
>> with chromosome, it seems too complicated.  Is there a use case for
>>
>> a GRanges with sequences from several different genomes?
>>
>>
>> One reason I am inquiring is that I feel it would be nice to have the
>> GRanges show() method report, prominently, the genome in use (or NA
>>
>> if unspecified).  This could be accomplished by reporting
>> unique(genome(x)), and perhaps that would be satisfactory.
>>
>> after example(genome) :
>>
>>  seqinfo(txdb)
>>>
>>
>> Seqinfo of length 15
>>
>> seqnames seqlengths isCircular genome
>>
>> CH2L   23011544  FALSEdm3
>>
>> CH2R   21146708  FALSEdm3
>>
>> CH3L   24543557  FALSEdm3
>>
>> CH3R   27905053  FALSEdm3
>>
>> CH4 1351857  FALSEdm3
>>
>> ... .........
>>
>> CH3LHet 2555491  FALSEdm3
>>
>> CH3RHet 2517507  FALSEdm3
>>
>> CHXHet   204112  FALSEdm3
>>
>> CHYHet   347038  FALSEdm3
>>
>> CHUextra   29004656  FALSEdm3
>>
>>  genome(seqinfo(txdb))
>>>
>>
>>  CH2L CH2R CH3L CH3R  CH4  CHX  CHUM
>>
>> "dm3""dm3""dm3""dm3""dm3""dm3""dm3""dm3"
>>
>>   CH2LHet  CH2RHet  CH3LHet  CH3RHet   CHXHet   CHYHet CHUextra
>>
>> "dm3""dm3""dm3""dm3""dm3""dm3""dm3"
>>
>> [[alternative HTML version deleted]]
>>
>> ___
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpa...@fhcrc.org
> Phone:  (206) 667-5791
> Fax:(206) 667-1319
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] should genome() be so complicated?/add genome report to GRanges show method

2014-09-08 Thread Michael Lawrence
I might have requested the genome annotation, but I'm pretty sure it wasn't
me who decide on tracking it on a per-sequence basis. I could imagine use
cases for that though, e.g., when diagnosing sequencing contamination (like
human vs. mouse). But most other tools and file formats expect a single
genome per "track", so, for example, rtracklayer has an internal function
singleGenome() to take care of this.

On Mon, Sep 8, 2014 at 10:50 AM, Hervé Pagès  wrote:

> Hi Vince,
>
> Yes it would make sense to have the "show" method report the genome
> when genome(x) contains a unique non-NA value. I think the main
> use case for having the genome defined at the sequence level instead
> of the whole object level is metagenomics. Maybe Michael has some other
> good use cases to share since IIRC he requested the addition of the
> genome field a couple of years ago and made the case for having it
> defined at the sequence level.
>
> Cheers,
> H.
>
>
> On 09/08/2014 07:21 AM, Vincent Carey wrote:
>
>> For GRanges x, my naive expectation is that genome(x) returns a length-
>>
>> one tag identifying the genome to which chromosomal coordinates
>>
>> correspond.  The genome() method seems to have sequence-specific
>>
>> semantics, which makes sense, but when we identify sequence
>>
>> with chromosome, it seems too complicated.  Is there a use case for
>>
>> a GRanges with sequences from several different genomes?
>>
>>
>> One reason I am inquiring is that I feel it would be nice to have the
>> GRanges show() method report, prominently, the genome in use (or NA
>>
>> if unspecified).  This could be accomplished by reporting
>> unique(genome(x)), and perhaps that would be satisfactory.
>>
>> after example(genome) :
>>
>>  seqinfo(txdb)
>>>
>>
>> Seqinfo of length 15
>>
>> seqnames seqlengths isCircular genome
>>
>> CH2L   23011544  FALSEdm3
>>
>> CH2R   21146708  FALSEdm3
>>
>> CH3L   24543557  FALSEdm3
>>
>> CH3R   27905053  FALSEdm3
>>
>> CH4 1351857  FALSEdm3
>>
>> ... .........
>>
>> CH3LHet 2555491  FALSEdm3
>>
>> CH3RHet 2517507  FALSEdm3
>>
>> CHXHet   204112  FALSEdm3
>>
>> CHYHet   347038  FALSEdm3
>>
>> CHUextra   29004656  FALSEdm3
>>
>>  genome(seqinfo(txdb))
>>>
>>
>>  CH2L CH2R CH3L CH3R  CH4  CHX  CHUM
>>
>> "dm3""dm3""dm3""dm3""dm3""dm3""dm3""dm3"
>>
>>   CH2LHet  CH2RHet  CH3LHet  CH3RHet   CHXHet   CHYHet CHUextra
>>
>> "dm3""dm3""dm3""dm3""dm3""dm3""dm3"
>>
>> [[alternative HTML version deleted]]
>>
>> ___
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpa...@fhcrc.org
> Phone:  (206) 667-5791
> Fax:(206) 667-1319
>
>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] writeVcf performance

2014-09-08 Thread Valerie Obenchain
fyi Martin found a bug with the treatment of list data (ie, Number = 
'.') in the header. Working on a fix ...


Val


On 09/08/2014 08:43 AM, Gabe Becker wrote:

Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain mailto:voben...@fhcrc.org>> wrote:

The new writeVcf code is in 1.11.28.

Using the illumina file you suggested, geno fields only, writing now
takes about 17 minutes.

 > hdr
class: VCFHeader
samples(1): NA12877
meta(6): fileformat ApplyRecalibration ... reference source
fixed(1): FILTER
info(22): AC AF ... culprit set
geno(8): GT GQX ... PL VF

 > param = ScanVcfParam(info=NA)
 > vcf = readVcf(fl, "", param=param)
 > dim(vcf)
[1] 516127621

 > system.time(writeVcf(vcf, "out.vcf"))
 user   system  elapsed
  971.0326.568 1004.593

In 1.11.28, parsing of geno data was moved to C. If this didn't
speed things up enough we were planning to implement 'chunking'
through the VCF and/or move the parsing of info to C, however, it
looks like geno was the bottleneck.

I've tested a number of samples/fields combinations in files with >=
.5 million rows and the improvement over writeVcf() in release is ~ 90%.

Valerie




On 09/04/14 15:28, Valerie Obenchain wrote:

Thanks Gabe. I should have something for you on Monday.

Val


On 09/04/2014 01:56 PM, Gabe Becker wrote:

Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files make
a good test
case, assuming you strip out all the info (info=NA when
reading it into
R) stuff.


ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz


took about ~4.2 hrs to write out, and is about 1.5x the size
of the
files we are actually dealing with (~50M ranges vs our ~30M).

Looking forward a new vastly improved writeVcf :).

~G


On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
mailto:lawrence.mich...@gene.com>
>> wrote:

 Yes, it's very clear that the scaling is non-linear,
and Gabe has
 been experimenting with a chunk-wise + parallel algorithm.
 Unfortunately there is some frustrating overhead with the
 parallelism. But I'm glad Val is arriving at something
quicker.

 Michael


 On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
mailto:mtmor...@fhcrc.org>
 >> wrote:

 On 08/27/2014 11:56 AM, Gabe Becker wrote:

 The profiling I attached in my previous email
is for 24 geno
 fields, as I said,
 but our typical usecase involves only ~4-6
fields, and is
 faster but still on
 the order of dozens of minutes.


 I think Val is arriving at a (much) more efficient
 implementation, but...

 I wanted to share my guess that the poor _scaling_
is because
 the garbage collector runs multiple times as the
different
 strings are pasted together, and has to traverse,
in linear
 time, increasing numbers of allocated SEXPs. So
times scale
 approximately quadratically with the number of rows
in the VCF

 An efficiency is to reduce the number of SEXPs in
play by
 writing out in chunks -- as each chunk is written,
the SEXPs
 become available for collection and are re-used.
Here's my toy
 example

 time.R
 ==
 splitIndices <- function (nx, ncl)
 {
  i <- seq_len(nx)
  if (ncl == 0L)
  list()
  else if (ncl == 1L || nx == 1L)
  list(i)
  else {
  fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
  breaks <- seq(1 - fuzz, nx + fuzz, length
= ncl + 1L)
  structure(split(i, cut(i, breaks,
labels=FALSE)), names
 = NULL)
  }
  

Re: [Bioc-devel] should genome() be so complicated?/add genome report to GRanges show method

2014-09-08 Thread Hervé Pagès

On 09/08/2014 11:41 AM, Michael Lawrence wrote:

I might have requested the genome annotation, but I'm pretty sure it
wasn't me who decide on tracking it on a per-sequence basis.


OK, maybe. Don't trust my memory too much on this. No regrets though.
I think it was the right thing to do ;-) Just because the SAM/BAM
format does it is a good enough reason for us to do it too. According
to the SAM Spec, the header can look like:

  @HD   VN:1.3  SO:coordinate
  @SQ   SN:chr1_hg19LN:45   AS:hg19
  @SQ   SN:chr1_mm10LN:42   AS:mm10

The only problem is that seqinfo(BamFile("test.bam")) seems to ignore
the AS tag (genome assembly identifier) at the moment:

  > seqinfo(BamFile("test.bam"))
  Seqinfo of length 2
  seqnames seqlengths isCircular genome
  chr1_hg1945 NA   
  chr1_mm1042 NA   

Hopefully that can be addressed. But that's a separate issue...

Cheers,
H.



I could
imagine use cases for that though, e.g., when diagnosing sequencing
contamination (like human vs. mouse). But most other tools and file
formats expect a single genome per "track", so, for example, rtracklayer
has an internal function singleGenome() to take care of this.

On Mon, Sep 8, 2014 at 10:50 AM, Hervé Pagès mailto:hpa...@fhcrc.org>> wrote:

Hi Vince,

Yes it would make sense to have the "show" method report the genome
when genome(x) contains a unique non-NA value. I think the main
use case for having the genome defined at the sequence level instead
of the whole object level is metagenomics. Maybe Michael has some other
good use cases to share since IIRC he requested the addition of the
genome field a couple of years ago and made the case for having it
defined at the sequence level.

Cheers,
H.


On 09/08/2014 07:21 AM, Vincent Carey wrote:

For GRanges x, my naive expectation is that genome(x) returns a
length-

one tag identifying the genome to which chromosomal coordinates

correspond.  The genome() method seems to have sequence-specific

semantics, which makes sense, but when we identify sequence

with chromosome, it seems too complicated.  Is there a use case for

a GRanges with sequences from several different genomes?


One reason I am inquiring is that I feel it would be nice to
have the
GRanges show() method report, prominently, the genome in use (or NA

if unspecified).  This could be accomplished by reporting
unique(genome(x)), and perhaps that would be satisfactory.

after example(genome) :

seqinfo(txdb)


Seqinfo of length 15

seqnames seqlengths isCircular genome

CH2L   23011544  FALSEdm3

CH2R   21146708  FALSEdm3

CH3L   24543557  FALSEdm3

CH3R   27905053  FALSEdm3

CH4 1351857  FALSEdm3

... .........

CH3LHet 2555491  FALSEdm3

CH3RHet 2517507  FALSEdm3

CHXHet   204112  FALSEdm3

CHYHet   347038  FALSEdm3

CHUextra   29004656  FALSEdm3

genome(seqinfo(txdb))


  CH2L CH2R CH3L CH3R  CH4  CHX
CHUM

 "dm3""dm3""dm3""dm3""dm3""dm3"
"dm3""dm3"

   CH2LHet  CH2RHet  CH3LHet  CH3RHet   CHXHet   CHYHet CHUextra

 "dm3""dm3""dm3""dm3""dm3""dm3""dm3"

 [[alternative HTML version deleted]]

_
Bioc-devel@r-project.org 
mailing list
https://stat.ethz.ch/mailman/__listinfo/bioc-devel



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org 
Phone: (206) 667-5791 
Fax: (206) 667-1319 


_
Bioc-devel@r-project.org  mailing list
https://stat.ethz.ch/mailman/__listinfo/bioc-devel





--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] should genome() be so complicated?/add genome report to GRanges show method

2014-09-08 Thread Peter Hickey
Just a vote for still allowing for multiple genomes in a Seqinfo object (in a 
GRanges object). My use case is in bisulfite-sequencing experiments where there 
is often a spike-in of a lambda phage genome along with the genome of interest 
(human or mouse). It's often useful to keep all data from a single library 
together in the same objet but process according to genome(x) for each seqlevel.

FWIW, I like Vincent's proposal of selectSome(unique(genome(x))) in the show 
method.

Cheers,
Pete


> I might have requested the genome annotation, but I'm pretty sure it wasn't
> me who decide on tracking it on a per-sequence basis. I could imagine use
> cases for that though, e.g., when diagnosing sequencing contamination (like
> human vs. mouse). But most other tools and file formats expect a single
> genome per "track", so, for example, rtracklayer has an internal function
> singleGenome() to take care of this.
> 
> On Mon, Sep 8, 2014 at 10:50 AM, Herv? Pag?s  wrote:
> 
>> Hi Vince,
>> 
>> Yes it would make sense to have the "show" method report the genome
>> when genome(x) contains a unique non-NA value. I think the main
>> use case for having the genome defined at the sequence level instead
>> of the whole object level is metagenomics. Maybe Michael has some other
>> good use cases to share since IIRC he requested the addition of the
>> genome field a couple of years ago and made the case for having it
>> defined at the sequence level.
>> 
>> Cheers,
>> H.
>> 
>> 
>> On 09/08/2014 07:21 AM, Vincent Carey wrote:
>> 
>>> For GRanges x, my naive expectation is that genome(x) returns a length-
>>> 
>>> one tag identifying the genome to which chromosomal coordinates
>>> 
>>> correspond.  The genome() method seems to have sequence-specific
>>> 
>>> semantics, which makes sense, but when we identify sequence
>>> 
>>> with chromosome, it seems too complicated.  Is there a use case for
>>> 
>>> a GRanges with sequences from several different genomes?
>>> 
>>> 
>>> One reason I am inquiring is that I feel it would be nice to have the
>>> GRanges show() method report, prominently, the genome in use (or NA
>>> 
>>> if unspecified).  This could be accomplished by reporting
>>> unique(genome(x)), and perhaps that would be satisfactory.
>>> 
>>> after example(genome) :
>>> 
>>> seqinfo(txdb)
 
>>> 
>>> Seqinfo of length 15
>>> 
>>> seqnames seqlengths isCircular genome
>>> 
>>> CH2L   23011544  FALSEdm3
>>> 
>>> CH2R   21146708  FALSEdm3
>>> 
>>> CH3L   24543557  FALSEdm3
>>> 
>>> CH3R   27905053  FALSEdm3
>>> 
>>> CH4 1351857  FALSEdm3
>>> 
>>> ... .........
>>> 
>>> CH3LHet 2555491  FALSEdm3
>>> 
>>> CH3RHet 2517507  FALSEdm3
>>> 
>>> CHXHet   204112  FALSEdm3
>>> 
>>> CHYHet   347038  FALSEdm3
>>> 
>>> CHUextra   29004656  FALSEdm3
>>> 
>>> genome(seqinfo(txdb))
 
>>> 
>>> CH2L CH2R CH3L CH3R  CH4  CHX  CHUM
>>> 
>>>"dm3""dm3""dm3""dm3""dm3""dm3""dm3""dm3"
>>> 
>>>  CH2LHet  CH2RHet  CH3LHet  CH3RHet   CHXHet   CHYHet CHUextra
>>> 
>>>"dm3""dm3""dm3""dm3""dm3""dm3""dm3"
>>> 
>>>[[alternative HTML version deleted]]
>>> 
>>> ___
>>> Bioc-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>> 
>>> 
>> --
>> Herv? Pag?s
>> 
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>> 
>> E-mail: hpa...@fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:(206) 667-1319
>> 
>> 
>> ___
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> 


Peter Hickey,
PhD Student/Research Assistant,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
Ph: +613 9345 2324

hic...@wehi.edu.au
http://www.wehi.edu.au

__
The information in this email is confidential and intend...{{dropped:6}}

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] should genome() be so complicated?/add genome report to GRanges show method

2014-09-08 Thread Hervé Pagès

On 09/08/2014 02:28 PM, Peter Hickey wrote:

Just a vote for still allowing for multiple genomes in a Seqinfo object (in a 
GRanges object). My use case is in bisulfite-sequencing experiments where there 
is often a spike-in of a lambda phage genome along with the genome of interest 
(human or mouse). It's often useful to keep all data from a single library 
together in the same objet but process according to genome(x) for each seqlevel.


Note taken. Thanks Pete! It's always great to know about concrete use
cases.



FWIW, I like Vincent's proposal of selectSome(unique(genome(x))) in the show 
method.


Or what about displaying the genome next to the seqlevel it's
associated with? Like e.g.:

  > gr
  GRanges with 3 ranges and 0 metadata columns:
seqnames   ranges strand
 
[1]chr14 [19069583, 19069654]  +
[2]chr14 [19363738, 19363809]  +
[3]chr14 [19363755, 19363826]  -
[4]chr14 [19369799, 19369870]  +
---
seqinfo:
  seqlevels seqlengths isCircular genome
  chr1   249250621  hg19
  chr10  135534747  hg19
  chr11  135006516  hg19
  ...  .........
  chrUn_gl000249 38502  hg19
  chrX   155270560  hg19
  chrY59373566  hg19

That way, we also raise awareness about the isCircular field.
The current choice to only display the seqlengths pre-dates the
existence of the seqinfo slot but might be a little bit misleading
those days since it only exposes some arbitrary seqinfo fields.

H.



Cheers,
Pete



I might have requested the genome annotation, but I'm pretty sure it wasn't
me who decide on tracking it on a per-sequence basis. I could imagine use
cases for that though, e.g., when diagnosing sequencing contamination (like
human vs. mouse). But most other tools and file formats expect a single
genome per "track", so, for example, rtracklayer has an internal function
singleGenome() to take care of this.

On Mon, Sep 8, 2014 at 10:50 AM, Herv? Pag?s  wrote:


Hi Vince,

Yes it would make sense to have the "show" method report the genome
when genome(x) contains a unique non-NA value. I think the main
use case for having the genome defined at the sequence level instead
of the whole object level is metagenomics. Maybe Michael has some other
good use cases to share since IIRC he requested the addition of the
genome field a couple of years ago and made the case for having it
defined at the sequence level.

Cheers,
H.


On 09/08/2014 07:21 AM, Vincent Carey wrote:


For GRanges x, my naive expectation is that genome(x) returns a length-

one tag identifying the genome to which chromosomal coordinates

correspond.  The genome() method seems to have sequence-specific

semantics, which makes sense, but when we identify sequence

with chromosome, it seems too complicated.  Is there a use case for

a GRanges with sequences from several different genomes?


One reason I am inquiring is that I feel it would be nice to have the
GRanges show() method report, prominently, the genome in use (or NA

if unspecified).  This could be accomplished by reporting
unique(genome(x)), and perhaps that would be satisfactory.

after example(genome) :

seqinfo(txdb)




Seqinfo of length 15

seqnames seqlengths isCircular genome

CH2L   23011544  FALSEdm3

CH2R   21146708  FALSEdm3

CH3L   24543557  FALSEdm3

CH3R   27905053  FALSEdm3

CH4 1351857  FALSEdm3

... .........

CH3LHet 2555491  FALSEdm3

CH3RHet 2517507  FALSEdm3

CHXHet   204112  FALSEdm3

CHYHet   347038  FALSEdm3

CHUextra   29004656  FALSEdm3

genome(seqinfo(txdb))




 CH2L CH2R CH3L CH3R  CH4  CHX  CHUM

"dm3""dm3""dm3""dm3""dm3""dm3""dm3""dm3"

  CH2LHet  CH2RHet  CH3LHet  CH3RHet   CHXHet   CHYHet CHUextra

"dm3""dm3""dm3""dm3""dm3""dm3""dm3"

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:(206) 667-1319


___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel




Peter Hickey,
PhD Student/Research Assistant,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.

[Bioc-devel] Changing the name of a package

2014-09-08 Thread Thomas J Hardcastle
I have a package on the development build called riboSeq; my biological 
collaborators have since informed me that this name might cause 
confusion as this is also the standard name for particular sequencing 
techniques. Is it possible to easily change the name of this package?


Best regards,

Tom Hardcastle



--
Dr. Thomas J. Hardcastle

Department of Plant Sciences
University of Cambridge
Downing Street
Cambridge, CB2 3EA
United Kingdom

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] should genome() be so complicated?/add genome report to GRanges show method

2014-09-08 Thread Peter Hickey
Perhaps it might be useful to have some way of highlighting if any of the 
chromosomes are circular or highlighting if there are multiple genomes present? 
Otherwise this information might be hidden in the "�"

Cheers,
Pete


On 09/09/2014, at 9:44 AM, Herv� Pag�s  wrote:

> On 09/08/2014 02:28 PM, Peter Hickey wrote:
>> Just a vote for still allowing for multiple genomes in a Seqinfo object (in 
>> a GRanges object). My use case is in bisulfite-sequencing experiments where 
>> there is often a spike-in of a lambda phage genome along with the genome of 
>> interest (human or mouse). It's often useful to keep all data from a single 
>> library together in the same objet but process according to genome(x) for 
>> each seqlevel.
> 
> Note taken. Thanks Pete! It's always great to know about concrete use
> cases.
> 
>> 
>> FWIW, I like Vincent's proposal of selectSome(unique(genome(x))) in the show 
>> method.
> 
> Or what about displaying the genome next to the seqlevel it's
> associated with? Like e.g.:
> 
>  > gr
>  GRanges with 3 ranges and 0 metadata columns:
>seqnames   ranges strand
> 
>[1]chr14 [19069583, 19069654]  +
>[2]chr14 [19363738, 19363809]  +
>[3]chr14 [19363755, 19363826]  -
>[4]chr14 [19369799, 19369870]  +
>---
>seqinfo:
>  seqlevels seqlengths isCircular genome
>  chr1   249250621  hg19
>  chr10  135534747  hg19
>  chr11  135006516  hg19
>  ...  .........
>  chrUn_gl000249 38502  hg19
>  chrX   155270560  hg19
>  chrY59373566  hg19
> 
> That way, we also raise awareness about the isCircular field.
> The current choice to only display the seqlengths pre-dates the
> existence of the seqinfo slot but might be a little bit misleading
> those days since it only exposes some arbitrary seqinfo fields.
> 
> H.
> 
>> 
>> Cheers,
>> Pete
>> 
>> 
>>> I might have requested the genome annotation, but I'm pretty sure it wasn't
>>> me who decide on tracking it on a per-sequence basis. I could imagine use
>>> cases for that though, e.g., when diagnosing sequencing contamination (like
>>> human vs. mouse). But most other tools and file formats expect a single
>>> genome per "track", so, for example, rtracklayer has an internal function
>>> singleGenome() to take care of this.
>>> 
>>> On Mon, Sep 8, 2014 at 10:50 AM, Herv? Pag?s  wrote:
>>> 
 Hi Vince,
 
 Yes it would make sense to have the "show" method report the genome
 when genome(x) contains a unique non-NA value. I think the main
 use case for having the genome defined at the sequence level instead
 of the whole object level is metagenomics. Maybe Michael has some other
 good use cases to share since IIRC he requested the addition of the
 genome field a couple of years ago and made the case for having it
 defined at the sequence level.
 
 Cheers,
 H.
 
 
 On 09/08/2014 07:21 AM, Vincent Carey wrote:
 
> For GRanges x, my naive expectation is that genome(x) returns a length-
> 
> one tag identifying the genome to which chromosomal coordinates
> 
> correspond.  The genome() method seems to have sequence-specific
> 
> semantics, which makes sense, but when we identify sequence
> 
> with chromosome, it seems too complicated.  Is there a use case for
> 
> a GRanges with sequences from several different genomes?
> 
> 
> One reason I am inquiring is that I feel it would be nice to have the
> GRanges show() method report, prominently, the genome in use (or NA
> 
> if unspecified).  This could be accomplished by reporting
> unique(genome(x)), and perhaps that would be satisfactory.
> 
> after example(genome) :
> 
> seqinfo(txdb)
>> 
> 
> Seqinfo of length 15
> 
> seqnames seqlengths isCircular genome
> 
> CH2L   23011544  FALSEdm3
> 
> CH2R   21146708  FALSEdm3
> 
> CH3L   24543557  FALSEdm3
> 
> CH3R   27905053  FALSEdm3
> 
> CH4 1351857  FALSEdm3
> 
> ... .........
> 
> CH3LHet 2555491  FALSEdm3
> 
> CH3RHet 2517507  FALSEdm3
> 
> CHXHet   204112  FALSEdm3
> 
> CHYHet   347038  FALSEdm3
> 
> CHUextra   29004656  FALSEdm3
> 
> genome(seqinfo(txdb))
>> 
> 
> CH2L CH2R CH3L CH3R  CH4  CHX  CHUM
> 
>"dm3""dm3""dm3""dm3""dm3""dm3""dm3""dm3"
> 
>  CH2LHet  CH2RHet  CH3LHet  CH3RHet   CHXHet   CHYHet CHUextra
> 
>"dm3""dm3""dm3"

Re: [Bioc-devel] should genome() be so complicated?/add genome report to GRanges show method

2014-09-08 Thread Michael Lawrence
Instead of printing out multiple lines of a table that is rarely of
interest, could we develop Peter's idea toward something like:

hg19:chr1  hg19:chr2 ...
[lengths ...]

Not sure what condensed notation would be useful for circularity.


On Mon, Sep 8, 2014 at 5:21 PM, Peter Hickey  wrote:

> Perhaps it might be useful to have some way of highlighting if any of the
> chromosomes are circular or highlighting if there are multiple genomes
> present? Otherwise this information might be hidden in the "…"
>
> Cheers,
> Pete
>
>
> On 09/09/2014, at 9:44 AM, Hervé Pagès  wrote:
>
> > On 09/08/2014 02:28 PM, Peter Hickey wrote:
> >> Just a vote for still allowing for multiple genomes in a Seqinfo object
> (in a GRanges object). My use case is in bisulfite-sequencing experiments
> where there is often a spike-in of a lambda phage genome along with the
> genome of interest (human or mouse). It's often useful to keep all data
> from a single library together in the same objet but process according to
> genome(x) for each seqlevel.
> >
> > Note taken. Thanks Pete! It's always great to know about concrete use
> > cases.
> >
> >>
> >> FWIW, I like Vincent's proposal of selectSome(unique(genome(x))) in the
> show method.
> >
> > Or what about displaying the genome next to the seqlevel it's
> > associated with? Like e.g.:
> >
> >  > gr
> >  GRanges with 3 ranges and 0 metadata columns:
> >seqnames   ranges strand
> > 
> >[1]chr14 [19069583, 19069654]  +
> >[2]chr14 [19363738, 19363809]  +
> >[3]chr14 [19363755, 19363826]  -
> >[4]chr14 [19369799, 19369870]  +
> >---
> >seqinfo:
> >  seqlevels seqlengths isCircular genome
> >  chr1   249250621  hg19
> >  chr10  135534747  hg19
> >  chr11  135006516  hg19
> >  ...  .........
> >  chrUn_gl000249 38502  hg19
> >  chrX   155270560  hg19
> >  chrY59373566  hg19
> >
> > That way, we also raise awareness about the isCircular field.
> > The current choice to only display the seqlengths pre-dates the
> > existence of the seqinfo slot but might be a little bit misleading
> > those days since it only exposes some arbitrary seqinfo fields.
> >
> > H.
> >
> >>
> >> Cheers,
> >> Pete
> >>
> >>
> >>> I might have requested the genome annotation, but I'm pretty sure it
> wasn't
> >>> me who decide on tracking it on a per-sequence basis. I could imagine
> use
> >>> cases for that though, e.g., when diagnosing sequencing contamination
> (like
> >>> human vs. mouse). But most other tools and file formats expect a single
> >>> genome per "track", so, for example, rtracklayer has an internal
> function
> >>> singleGenome() to take care of this.
> >>>
> >>> On Mon, Sep 8, 2014 at 10:50 AM, Herv? Pag?s  wrote:
> >>>
>  Hi Vince,
> 
>  Yes it would make sense to have the "show" method report the genome
>  when genome(x) contains a unique non-NA value. I think the main
>  use case for having the genome defined at the sequence level instead
>  of the whole object level is metagenomics. Maybe Michael has some
> other
>  good use cases to share since IIRC he requested the addition of the
>  genome field a couple of years ago and made the case for having it
>  defined at the sequence level.
> 
>  Cheers,
>  H.
> 
> 
>  On 09/08/2014 07:21 AM, Vincent Carey wrote:
> 
> > For GRanges x, my naive expectation is that genome(x) returns a
> length-
> >
> > one tag identifying the genome to which chromosomal coordinates
> >
> > correspond.  The genome() method seems to have sequence-specific
> >
> > semantics, which makes sense, but when we identify sequence
> >
> > with chromosome, it seems too complicated.  Is there a use case for
> >
> > a GRanges with sequences from several different genomes?
> >
> >
> > One reason I am inquiring is that I feel it would be nice to have the
> > GRanges show() method report, prominently, the genome in use (or NA
> >
> > if unspecified).  This could be accomplished by reporting
> > unique(genome(x)), and perhaps that would be satisfactory.
> >
> > after example(genome) :
> >
> > seqinfo(txdb)
> >>
> >
> > Seqinfo of length 15
> >
> > seqnames seqlengths isCircular genome
> >
> > CH2L   23011544  FALSEdm3
> >
> > CH2R   21146708  FALSEdm3
> >
> > CH3L   24543557  FALSEdm3
> >
> > CH3R   27905053  FALSEdm3
> >
> > CH4 1351857  FALSEdm3
> >
> > ... .........
> >
> > CH3LHet 2555491  FALSEdm3
> >
> > CH3RHet 2517507  FAL

Re: [Bioc-devel] should genome() be so complicated?/add genome report to GRanges show method

2014-09-08 Thread Hervé Pagès

On 09/08/2014 06:42 PM, Michael Lawrence wrote:

Instead of printing out multiple lines of a table that is rarely of
interest, could we develop Peter's idea toward something like:

hg19:chr1  hg19:chr2 ...
[lengths ...]

Not sure what condensed notation would be useful for circularity.


I don't know either. I'm worried that this would make the seqinfo
stuff look like a named vector and that the user would expect
hg19:chr1, hg19:chr2, etc... to be valid names.

With the table-like layout, some screen real estate can always be
saved by printing less lines:

  > gr
  GRanges with 3 ranges and 0 metadata columns:
seqnames   ranges strand
 
[1]chr14 [19069583, 19069654]  +
[2]chr14 [19363738, 19363809]  +
[3]chr14 [19363755, 19363826]  -
[4]chr14 [19369799, 19369870]  +
--- seqinfo: 60 seqlevels (2 circulars) on 2 genomes (hg19, mm10) ---
seqlevelsseqlengths isCircular genome
chr1  249250621  hg19
chr10 135534747  hg19
... .........
chrX  155270560  hg19
chrY   59373566  hg19

I agree that the exact content of the seqinfo table itself is rarely
of interest so printing only 3 or 4 lines is OK. IMO it's important
to make the user aware of the existence of this hidden table and to
display it like what it really is (i.e. a table). Also displaying the
column names is a well established tradition and serves the purpose
of providing a quick summary of the accessors that are available to
access those fields.

H.




On Mon, Sep 8, 2014 at 5:21 PM, Peter Hickey mailto:hic...@wehi.edu.au>> wrote:

Perhaps it might be useful to have some way of highlighting if any
of the chromosomes are circular or highlighting if there are
multiple genomes present? Otherwise this information might be hidden
in the "…"

Cheers,
Pete


On 09/09/2014, at 9:44 AM, Hervé Pagès mailto:hpa...@fhcrc.org>> wrote:

 > On 09/08/2014 02:28 PM, Peter Hickey wrote:
 >> Just a vote for still allowing for multiple genomes in a Seqinfo
object (in a GRanges object). My use case is in bisulfite-sequencing
experiments where there is often a spike-in of a lambda phage genome
along with the genome of interest (human or mouse). It's often
useful to keep all data from a single library together in the same
objet but process according to genome(x) for each seqlevel.
 >
 > Note taken. Thanks Pete! It's always great to know about concrete use
 > cases.
 >
 >>
 >> FWIW, I like Vincent's proposal of selectSome(unique(genome(x)))
in the show method.
 >
 > Or what about displaying the genome next to the seqlevel it's
 > associated with? Like e.g.:
 >
 >  > gr
 >  GRanges with 3 ranges and 0 metadata columns:
 >seqnames   ranges strand
 > 
 >[1]chr14 [19069583, 19069654]  +
 >[2]chr14 [19363738, 19363809]  +
 >[3]chr14 [19363755, 19363826]  -
 >[4]chr14 [19369799, 19369870]  +
 >---
 >seqinfo:
 >  seqlevels seqlengths isCircular genome
 >  chr1   249250621  hg19
 >  chr10  135534747  hg19
 >  chr11  135006516  hg19
 >  ...  .........
 >  chrUn_gl000249 38502  hg19
 >  chrX   155270560  hg19
 >  chrY59373566  hg19
 >
 > That way, we also raise awareness about the isCircular field.
 > The current choice to only display the seqlengths pre-dates the
 > existence of the seqinfo slot but might be a little bit misleading
 > those days since it only exposes some arbitrary seqinfo fields.
 >
 > H.
 >
 >>
 >> Cheers,
 >> Pete
 >>
 >>
 >>> I might have requested the genome annotation, but I'm pretty
sure it wasn't
 >>> me who decide on tracking it on a per-sequence basis. I could
imagine use
 >>> cases for that though, e.g., when diagnosing sequencing
contamination (like
 >>> human vs. mouse). But most other tools and file formats expect
a single
 >>> genome per "track", so, for example, rtracklayer has an
internal function
 >>> singleGenome() to take care of this.
 >>>
 >>> On Mon, Sep 8, 2014 at 10:50 AM, Herv? Pag?s mailto:hpa...@fhcrc.org>> wrote:
 >>>
  Hi Vince,
 
  Yes it would make sense to have the "show" method report the
genome
  when genome(x) contains a unique n

Re: [Bioc-devel] should genome() be so complicated?/add genome report to GRanges show method

2014-09-08 Thread Vincent Carey
On Tue, Sep 9, 2014 at 12:30 AM, Hervé Pagès  wrote:

> On 09/08/2014 06:42 PM, Michael Lawrence wrote:
>
>> Instead of printing out multiple lines of a table that is rarely of
>> interest, could we develop Peter's idea toward something like:
>>
>> hg19:chr1  hg19:chr2 ...
>> [lengths ...]
>>
>> Not sure what condensed notation would be useful for circularity.
>>
>
> I don't know either. I'm worried that this would make the seqinfo
> stuff look like a named vector and that the user would expect
> hg19:chr1, hg19:chr2, etc... to be valid names.
>
> With the table-like layout, some screen real estate can always be
> saved by printing less lines:
>
>
What I had in mind was


>   > gr
>   GRanges with 3 ranges and 0 metadata columns:
>
  genome: hg19

> seqnames   ranges strand
>  
> [1]chr14 [19069583, 19069654]  +
> [2]chr14 [19363738, 19363809]  +
> [3]chr14 [19363755, 19363826]  -
> [4]chr14 [19369799, 19369870]  +
>


you could then probably dispense with the seqlengths.  i have
never found them too useful except as a key to the  genome.

if there are multiple genomes, we have something like

genomes: hg19, mm9

the point is to make it prominent, particularly at a time of transition.



> --- seqinfo: 60 seqlevels (2 circulars) on 2 genomes (hg19, mm10) ---
> seqlevelsseqlengths isCircular genome
> chr1  249250621  hg19
> chr10 135534747  hg19
> ... .........
> chrX  155270560  hg19
> chrY   59373566  hg19
>
> I agree that the exact content of the seqinfo table itself is rarely
> of interest so printing only 3 or 4 lines is OK. IMO it's important
> to make the user aware of the existence of this hidden table and to
> display it like what it really is (i.e. a table). Also displaying the
> column names is a well established tradition and serves the purpose
> of providing a quick summary of the accessors that are available to
> access those fields.
>
> H.
>
>
>>
>> On Mon, Sep 8, 2014 at 5:21 PM, Peter Hickey > > wrote:
>>
>> Perhaps it might be useful to have some way of highlighting if any
>> of the chromosomes are circular or highlighting if there are
>> multiple genomes present? Otherwise this information might be hidden
>> in the "…"
>>
>> Cheers,
>> Pete
>>
>>
>> On 09/09/2014, at 9:44 AM, Hervé Pagès > > wrote:
>>
>>  > On 09/08/2014 02:28 PM, Peter Hickey wrote:
>>  >> Just a vote for still allowing for multiple genomes in a Seqinfo
>> object (in a GRanges object). My use case is in bisulfite-sequencing
>> experiments where there is often a spike-in of a lambda phage genome
>> along with the genome of interest (human or mouse). It's often
>> useful to keep all data from a single library together in the same
>> objet but process according to genome(x) for each seqlevel.
>>  >
>>  > Note taken. Thanks Pete! It's always great to know about concrete
>> use
>>  > cases.
>>  >
>>  >>
>>  >> FWIW, I like Vincent's proposal of selectSome(unique(genome(x)))
>> in the show method.
>>  >
>>  > Or what about displaying the genome next to the seqlevel it's
>>  > associated with? Like e.g.:
>>  >
>>  >  > gr
>>  >  GRanges with 3 ranges and 0 metadata columns:
>>  >seqnames   ranges strand
>>  > 
>>  >[1]chr14 [19069583, 19069654]  +
>>  >[2]chr14 [19363738, 19363809]  +
>>  >[3]chr14 [19363755, 19363826]  -
>>  >[4]chr14 [19369799, 19369870]  +
>>  >---
>>  >seqinfo:
>>  >  seqlevels seqlengths isCircular genome
>>  >  chr1   249250621  hg19
>>  >  chr10  135534747  hg19
>>  >  chr11  135006516  hg19
>>  >  ...  .........
>>  >  chrUn_gl000249 38502  hg19
>>  >  chrX   155270560  hg19
>>  >  chrY59373566  hg19
>>  >
>>  > That way, we also raise awareness about the isCircular field.
>>  > The current choice to only display the seqlengths pre-dates the
>>  > existence of the seqinfo slot but might be a little bit misleading
>>  > those days since it only exposes some arbitrary seqinfo fields.
>>  >
>>  > H.
>>  >
>>  >>
>>  >> Cheers,
>>  >> Pete
>>  >>
>>  >>
>>  >>> I might have requested the genome annotation, but I'm pretty
>> sure i