[Bioc-devel] transfer maintainer status of gCrisprTools

2019-11-22 Thread Peter Haverty
I'd like to transfer gCrisprTools to Russ Bainer (russ.bai...@gmail.com,
rbai...@mazetx.com). How do I do that?

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] Please un-deprecate gCrisprTools

2019-04-09 Thread Peter Haverty via Bioc-devel
Dear Lori Shephard,

Yesterday it came to my attention that gCrisprTools is slated for
deprecation. Apparently, I've missed some related emails. Sorry about that.
I'm not sure why I didn't get them. I have pushed a fix for the bug to
master. Would it be possible to remove gCrisprTools from the deprecation
list?

Regards,
Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] switching to git

2017-11-07 Thread Peter Haverty
I'm working on using the new git setup. I've provided my GitHub user name
to BioC's form, so that my public key can be used with BioC's git. I
believe I was supposed to receive an email after that was set up, but I
have not. Is there something else I need to do?

Regards,
Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] is.unsorted method for GRanges objects

2015-11-02 Thread Peter Haverty
genoset has an is.unsorted for GenomicRanges. I profiled a bit and found a
pretty quick way to do it.  My version ignores strand, so it is only proper
in some cases.  But, maybe it's a head start on a fully general function.

Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Mon, Nov 2, 2015 at 5:35 PM, Michael Lawrence 
wrote:

> The notion of sortedness is already formally defined, which is why we have
> an order method, etc.
>
> The base is.unsorted implementation for "objects" ends up calling
> base::.gt() for each adjacent pair of elements, which is likely too slow to
> be practical, so we probably should add a custom method.
>
> This does bring up the tangential question of whether GenomicRanges should
> have an anyNA method that returns FALSE (and similarly an is.na() method),
> although we have never defined the concept of a "missing range".
>
> Michael
>
> On Mon, Nov 2, 2015 at 4:55 PM, Gabe Becker  wrote:
>
> > Pete,
> >
> > What does sorted mean for granges? If the starts  are sorted but the ends
> > aren't does that count? What if only the ends are but the ranges are on
> the
> > negative strand?
> >
> > Do we consider seqlevels to be ordinal in the order the levels are
> returned
> > from seqlevels ()? That usually makes sense, but does it always?
> >
> > In essence I'm asking if sortedness is a well enough defined term for an
> > is.sorted method to make sense.
> >
> > Best,
> > ~G
> > On Nov 2, 2015 4:27 PM, "Peter Hickey"  wrote:
> >
> > > Hi all,
> > >
> > > I sometimes want to test whether a GRanges object (or some object with
> > > a GRanges slot, e.g., a SummarizedExperiment object) is (un)sorted.
> > > There is no is.unsorted,GRanges-method or, rather, it defers to
> > > is.unsorted,ANY-method. I'm unsure that deferring to the
> > > is.unsorted,ANY-method is what is really desired when a user calls
> > > is.unsorted on a GRanges object, and it will certainly return a
> > > (possibly unrelated) warning - "In is.na(x) : is.na() applied to
> > > non-(list or vector) of type 'S4'".
> > >
> > >
> > > For this reason, I tend to use is.unsorted(order(x)) when x is a
> > > GRanges object. This workaround is also used, for example, by minfi
> > > (
> https://github.com/kasperdanielhansen/minfi/blob/master/R/blocks.R#L121
> > ).
> > > However, this is slow because it essentially sorts the object to test
> > > whether it is already sorted.
> > >
> > >
> > > So, to my questions:
> > >
> > > 1. Have I overlooked a fast way to test whether a GRanges object is
> > sorted?
> > > 2a. Could a is.unsorted,GenomicRanges-method be added to the
> > > GenomicRanges package? Side note, I'm unsure at which level to define
> > > this method, e.g., GRanges vs. GenomicRanges.
> > > 2b. Is it possible to have a sensible definition and implementation
> > > for is.unsorted,GRangesList-method?
> > > 2c. Could a is.unsorted,RangedSummarizedExperiment-method be added to
> > > the SummarizedExperiment package?
> > >
> > > I started working on a patch for 2a/2c, but wanted to ensure I hadn't
> > > overlooked something obvious. Also, I'm sure 2a/2b/2c could be written
> > > much more efficiently at the C-level but I'm afraid this might be a
> > > bit beyond my abilities to integrate nicely with the existing code.
> > >
> > > Thanks,
> > > Pete
> > >
> > > ___
> > > Bioc-devel@r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > >
> >
> > [[alternative HTML version deleted]]
> >
> > ___
> > Bioc-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >
>
> [[alternative HTML version deleted]]
>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] SummarizedExperiment with alternate back end

2015-09-18 Thread Peter Haverty
While we are on the topic, my GenoSet class will become a subclass of
RangedSummarizedExperiment, rather than eSet, after this upcoming release.
For this release both APIs work (colnames and sampleNames, etc.)

I think the range-free SummarizedExperiment will be great. I've seen a lot
of ExpressionSets with random, non-exprs stuff in the exprs slot for lack
of something more appropriate.

Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Fri, Sep 18, 2015 at 6:09 PM, Ryan  wrote:

> In the dev version, SummarizedExperiment has been split into
> RangedSummarizedExperiment (equivalent to the current
> SummarizedExperiement, with rowRanges) and SummarizedExperiment (kind of
> like eSet, no rowRanges). Given that eSet objects also support multiple
> assayData elements, I believe the new SummarizedExperiment is pretty close
> to being eSet with different method names. In fact, I wonder if eSet
> could/should be reimplemented as a subclass of the new SummarizedExperiment
> class.
>
>
> On 9/18/15 5:36 PM, Kasper Daniel Hansen wrote:
>
>> Interesting, thanks for the pointer.
>>
>> In light of the existing (and future) work on this, may I suggest an eSet
>> like class, but build using the technologies in SummarizedExperiment.  Ie.
>> a SummarizedExperiment without the rowRanges.  I would very much like this
>> for modern work using eSet like containers.  Not everything has ranges.
>>
>> Vince: I am not claiming that it is easy to work with; we have pains as
>> well.  But am I missing something or is the assay matrix only 2.3Gb?
>>
>> Best,
>> Kasper
>>
>> On Fri, Sep 18, 2015 at 6:28 PM, Peter Haverty 
>> wrote:
>>
>> Yes, bigmemoryExtras::BigMatrix and genoset::RleDataFrame() are good
>>> tricks
>>> for reducing the size of your eSets and SummarizedExperiments.  Both
>>> object
>>> types can go into assayData or assays. In fact, that's what they were
>>> designed for.
>>>
>>> At Genentech, we use these for our 2.5e6 x 1e3 rectangular data from
>>> Illumina SNP arrays.  We typically have ~6 such rectangular objects in
>>> one
>>> eSet.  With a mix of BigMatrix object for point estimates and
>>> RleDataFrames
>>> for segmented data, readRDS times are quite reasonable.
>>>
>>>
>>> Pete
>>>
>>> 
>>> Peter M. Haverty, Ph.D.
>>> Genentech, Inc.
>>> phave...@gene.com
>>>
>>> On Fri, Sep 18, 2015 at 1:56 PM, Tim Triche, Jr. 
>>> wrote:
>>>
>>> bigmemoryExtras (Peter Haverty's extensions to bigMemory/bigMatrix) can
>>>>
>>> be
>>>
>>>> handy for this, as it works well as a backend, especially if you go
>>>> about
>>>> splitting by chromosome as for CNV segmentation, DMR finding, etc.
>>>>  It's
>>>> not as seamless as one might like, but it's the closest thing I've
>>>> found.
>>>>
>>>> SciDb tries to implement a similar API, but for a distributed version of
>>>> this where the data itself is in a columnar database and served on
>>>>
>>> demand.
>>>
>>>> I tried getting that up and running as a SummarizedExperiment backend,
>>>>
>>> but
>>>
>>>> did not succeed.  I have previously shoveled all of the TCGA 450k data
>>>>
>>> into
>>>
>>>> one 7,000+ column bigMatrix which serializes to about 14GB on disk.
>>>>
>>>> If you have any replicates in your 700+ samples, it's a good idea to
>>>> keep
>>>> their SNP calls in metadata(yourSE), although if you change names it
>>>>
>>> needs
>>>
>>>> to propagate into the dependent metadata.  This is why I started
>>>>
>>> monkeying
>>>
>>>> around with linkedExperiments where those mappings are enforced; it's
>>>> becoming more of an issue with the TARGET pediatric AML study, where
>>>>
>>> there
>>>
>>>> are numerous diagnosis-remission-relapse trios whose identity I wish to
>>>> verify periodically.  The SNPs on the 450k array are great for this
>>>> purpose, but minfi doesn't really have a slot for them per se, so live
>>>> in
>>>> metadata().
>>>>
>>>>
>>>> --t
>>>>
>>>> On Fri, Sep 18, 2015 at 1:29 PM, Vincent Carey <
>>>>
>>> st...@

Re: [Bioc-devel] SummarizedExperiment with alternate back end

2015-09-18 Thread Peter Haverty
Yes, bigmemoryExtras::BigMatrix and genoset::RleDataFrame() are good tricks
for reducing the size of your eSets and SummarizedExperiments.  Both object
types can go into assayData or assays. In fact, that's what they were
designed for.

At Genentech, we use these for our 2.5e6 x 1e3 rectangular data from
Illumina SNP arrays.  We typically have ~6 such rectangular objects in one
eSet.  With a mix of BigMatrix object for point estimates and RleDataFrames
for segmented data, readRDS times are quite reasonable.


Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Fri, Sep 18, 2015 at 1:56 PM, Tim Triche, Jr. 
wrote:

> bigmemoryExtras (Peter Haverty's extensions to bigMemory/bigMatrix) can be
> handy for this, as it works well as a backend, especially if you go about
> splitting by chromosome as for CNV segmentation, DMR finding, etc.   It's
> not as seamless as one might like, but it's the closest thing I've found.
>
> SciDb tries to implement a similar API, but for a distributed version of
> this where the data itself is in a columnar database and served on demand.
> I tried getting that up and running as a SummarizedExperiment backend, but
> did not succeed.  I have previously shoveled all of the TCGA 450k data into
> one 7,000+ column bigMatrix which serializes to about 14GB on disk.
>
> If you have any replicates in your 700+ samples, it's a good idea to keep
> their SNP calls in metadata(yourSE), although if you change names it needs
> to propagate into the dependent metadata.  This is why I started monkeying
> around with linkedExperiments where those mappings are enforced; it's
> becoming more of an issue with the TARGET pediatric AML study, where there
> are numerous diagnosis-remission-relapse trios whose identity I wish to
> verify periodically.  The SNPs on the 450k array are great for this
> purpose, but minfi doesn't really have a slot for them per se, so live in
> metadata().
>
>
> --t
>
> On Fri, Sep 18, 2015 at 1:29 PM, Vincent Carey  >
> wrote:
>
> > i am dealing with ~700 450k arrays
> >
> > they are derived from one study, so it makes sense to think of
> >
> > them holistically.
> >
> > both the load time and the memory consumption are not satisfactory.
> >
> > has anyone worked on an object type that implements the rangedSE API but
> > has
> >
> > the assay data out of memory?
> >
> > > unix.time(load("wbmse.rda"))
> >
> >user  system elapsed
> >
> >  30.131   2.396  61.036
> >
> > > object.size(wbmse)
> >
> > 124031032 bytes
> >
> > > dim(wbmse)
> >
> > [1] 485577690
> >
> > > object.size(assays(wbmse))
> >
> > 2680430992 bytes
> >
> > [[alternative HTML version deleted]]
> >
> > ___
> > Bioc-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >
>
> [[alternative HTML version deleted]]
>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] as.character method for GenomicRanges?

2015-04-24 Thread Peter Haverty
Those are all good reasons for keeping the strand by default.  I'm on board.

Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Fri, Apr 24, 2015 at 11:26 AM, Herv� Pag�s  wrote:

> On 04/24/2015 11:08 AM, Peter Haverty wrote:
>
>> Good catch. We'll want the strand in case we need to go back to a GRanges.
>> I would make the strand addition optional with the default of FALSE.
>> It's nice to have a column of strings you can paste right into a genome
>> browser (sorry Michael :-) ).  I often pass my bench collaborators a
>> spreadsheet with such a column.
>>
>
> as.character(unstrand(gr)) ?
>
> 3 reasons I'm not too keen about 'ignore.strand=TRUE' being the default:
>
> (1) Many functions and methods in GenomicRanges/GenomicAlignments
> have an 'ignore.strand' argument. For consistency, the default
> value has been set to FALSE everywhere. Note that this was done
> even if this default doesn't reflect the most common use case
> (e.g. summarizeOverlaps).
>
> (2) I think it's good to have the default behavior of as.character()
> allow going back and forth between GRanges and character vector
> without losing the strand information.
>
> (3) The "table" method for Vector would break if as.character was
> ignoring the strand by default. Can be worked-around by
> implementing a method for GenomicRanges objects but...
>
> Hope that makes sense.
>
> H.
>
>
>
>> Pete
>>
>> 
>> Peter M. Haverty, Ph.D.
>> Genentech, Inc.
>> phave...@gene.com <mailto:phave...@gene.com>
>>
>> On Fri, Apr 24, 2015 at 10:50 AM, Herv� Pag�s > <mailto:hpa...@fredhutch.org>> wrote:
>>
>> On 04/24/2015 10:21 AM, Michael Lawrence wrote:
>>
>> Sorry, one more concern, if you're thinking of using as a range
>> key, you
>> will need the strand, but many use cases might not want the
>> strand on
>> there. Like for pasting into a genome browser.
>>
>>
>> What about appending the strand only for GRanges objects that
>> have at least one range that is not on *?
>>
>> setMethod("as.character", "GenomicRanges",
>>  function(x)
>>  {
>>  if (length(x) == 0L)
>>  return(character(0))
>>  ans <- paste0(seqnames(x), ":", start(x), "-", end(x))
>>  if (any(strand(x) != "*"))
>>ans <- paste0(ans, ":", strand(x))
>>  ans
>>  }
>> )
>>
>>  > as.character(gr)
>>   [1] "chr1:1-10"  "chr2:2-10"  "chr2:3-10"  "chr2:4-10"  "chr1:5-10"
>>   [6] "chr1:6-10"  "chr3:7-10"  "chr3:8-10"  "chr3:9-10"  "chr3:10-10"
>>
>>  > strand(gr)[2:3] <- c("-", "+")
>>  > as.character(gr)
>>   [1] "chr1:1-10:*"  "chr2:2-10:-"  "chr2:3-10:+"  "chr2:4-10:*"
>> "chr1:5-10:*"
>>   [6] "chr1:6-10:*"  "chr3:7-10:*"  "chr3:8-10:*"  "chr3:9-10:*"
>> "chr3:10-10:*"
>>
>> H.
>>
>>
>> On Fri, Apr 24, 2015 at 10:18 AM, Michael Lawrence
>> mailto:micha...@gene.com>
>> <mailto:micha...@gene.com <mailto:micha...@gene.com>>> wrote:
>>
>>      It is a great idea, but I'm not sure I would use it to
>> implement
>>  table(). Allocating those strings will be costly. Don't we
>> already
>>  have the 4-way int hash? Of course, my intuition might be
>> completely
>>  off here.
>>
>>
>>  On Fri, Apr 24, 2015 at 9:59 AM, Herv� Pag�s
>> mailto:hpa...@fredhutch.org>
>>  <mailto:hpa...@fredhutch.org
>>
>> <mailto:hpa...@fredhutch.org>>> wrote:
>>
>>  Hi Pete,
>>
>>  Excellent idea. That will make things like table() work
>>  out-of-the-box
>>  on GenomicRanges objects. I'll add that.
>>
>>  Thanks,
>>  H.
>>
>>
>>
>>  On 04/24/2015 09:43 AM, Peter Haverty wrote:
>&

Re: [Bioc-devel] as.character method for GenomicRanges?

2015-04-24 Thread Peter Haverty
Going the other way can look like this:

##' Parse one or more location strings and return as a GRanges



##'



##' Parse one or more location strings and return as a GRanges. GRanges
will get the names from the location.strings.


##' @param location.string character



##' @export



##' @return GRanges



##' @family location strings



locstring2GRanges <- function(location.string) {



  #  Take a location string, "chr11:123-127" or "11:123..456 +" and
return a list with chr, start, end elements


  location.string = sub("\\s+","",location.string)
  location.string = sub(",","",location.string)
  #location.string = sub("\\.\\.","-",location.string)  # TWU style
location strings


  if (any(! grepl("^(chr){0,1}.+:\\d+-\\d+$", location.string))) {
stop("Some location strings do not look like chr1:123-456.") }
  start = as.integer(sub("^.+:(\\d+)-.+$", "\\1", location.string))
  stop = as.integer(sub("^.+-(\\d+)", "\\1", location.string))
  gr = GRanges( IRanges(
start=pmin(start, stop),
end=pmax(start, stop),
names=names(location.string))
, seqnames=sub("^chr{0,1}(.*):.*$", "\\1", location.string) )
  return(gr)
}

Surprisingly the repeated subs are faster than splitting.  Some people,
such as GSNAP author Tom Wu, use the format "chr1:1234..1235", which we
might want to support. The pmin/pmax stuff handles cases where the negative
strand is expressed by flipping start and stop. We might not need that.



Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Fri, Apr 24, 2015 at 11:08 AM, Peter Haverty  wrote:

> Good catch. We'll want the strand in case we need to go back to a GRanges.
> I would make the strand addition optional with the default of FALSE. It's
> nice to have a column of strings you can paste right into a genome browser
> (sorry Michael :-) ).  I often pass my bench collaborators a spreadsheet
> with such a column.
>
> Pete
>
> 
> Peter M. Haverty, Ph.D.
> Genentech, Inc.
> phave...@gene.com
>
> On Fri, Apr 24, 2015 at 10:50 AM, Herv� Pag�s 
> wrote:
>
>> On 04/24/2015 10:21 AM, Michael Lawrence wrote:
>>
>>> Sorry, one more concern, if you're thinking of using as a range key, you
>>> will need the strand, but many use cases might not want the strand on
>>> there. Like for pasting into a genome browser.
>>>
>>
>> What about appending the strand only for GRanges objects that
>> have at least one range that is not on *?
>>
>> setMethod("as.character", "GenomicRanges",
>> function(x)
>> {
>> if (length(x) == 0L)
>> return(character(0))
>> ans <- paste0(seqnames(x), ":", start(x), "-", end(x))
>> if (any(strand(x) != "*"))
>>   ans <- paste0(ans, ":", strand(x))
>> ans
>> }
>> )
>>
>> > as.character(gr)
>>  [1] "chr1:1-10"  "chr2:2-10"  "chr2:3-10"  "chr2:4-10"  "chr1:5-10"
>>  [6] "chr1:6-10"  "chr3:7-10"  "chr3:8-10"  "chr3:9-10"  "chr3:10-10"
>>
>> > strand(gr)[2:3] <- c("-", "+")
>> > as.character(gr)
>>  [1] "chr1:1-10:*"  "chr2:2-10:-"  "chr2:3-10:+"  "chr2:4-10:*"
>> "chr1:5-10:*"
>>  [6] "chr1:6-10:*"  "chr3:7-10:*"  "chr3:8-10:*"  "chr3:9-10:*"
>> "chr3:10-10:*"
>>
>> H.
>>
>>
>>> On Fri, Apr 24, 2015 at 10:18 AM, Michael Lawrence >> <mailto:micha...@gene.com>> wrote:
>>>
>>> It is a great idea, but I'm not sure I would use it to implement
>>> table(). Allocating those strings will be costly. Don't we already
>>> have the 4-way int hash? Of course, my intuition might be completely
>>> off here.
>>>
>>>
>>> On Fri, Apr 24, 2015 at 9:59 AM, Herv� Pag�s >> <mailto:hpa...@fredhutch.org>> wrote:
>>>
>>> Hi Pete,
>>>
>>> Excellent idea. That will make things like table() work
>>> out-of-the-box
>>> on GenomicRanges objects. I'll add that.
>>>
>>> Thanks,
>>> H.
>>>
>>>
>>>
>>> On 04/24/2015 09:43 AM, Peter Haverty wrote:
>>>
>

Re: [Bioc-devel] as.character method for GenomicRanges?

2015-04-24 Thread Peter Haverty
Good catch. We'll want the strand in case we need to go back to a GRanges.
I would make the strand addition optional with the default of FALSE. It's
nice to have a column of strings you can paste right into a genome browser
(sorry Michael :-) ).  I often pass my bench collaborators a spreadsheet
with such a column.

Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Fri, Apr 24, 2015 at 10:50 AM, Herv� Pag�s  wrote:

> On 04/24/2015 10:21 AM, Michael Lawrence wrote:
>
>> Sorry, one more concern, if you're thinking of using as a range key, you
>> will need the strand, but many use cases might not want the strand on
>> there. Like for pasting into a genome browser.
>>
>
> What about appending the strand only for GRanges objects that
> have at least one range that is not on *?
>
> setMethod("as.character", "GenomicRanges",
> function(x)
> {
> if (length(x) == 0L)
> return(character(0))
> ans <- paste0(seqnames(x), ":", start(x), "-", end(x))
> if (any(strand(x) != "*"))
>   ans <- paste0(ans, ":", strand(x))
> ans
> }
> )
>
> > as.character(gr)
>  [1] "chr1:1-10"  "chr2:2-10"  "chr2:3-10"  "chr2:4-10"  "chr1:5-10"
>  [6] "chr1:6-10"  "chr3:7-10"  "chr3:8-10"  "chr3:9-10"  "chr3:10-10"
>
> > strand(gr)[2:3] <- c("-", "+")
> > as.character(gr)
>  [1] "chr1:1-10:*"  "chr2:2-10:-"  "chr2:3-10:+"  "chr2:4-10:*"
> "chr1:5-10:*"
>  [6] "chr1:6-10:*"  "chr3:7-10:*"  "chr3:8-10:*"  "chr3:9-10:*"
> "chr3:10-10:*"
>
> H.
>
>
>> On Fri, Apr 24, 2015 at 10:18 AM, Michael Lawrence > <mailto:micha...@gene.com>> wrote:
>>
>> It is a great idea, but I'm not sure I would use it to implement
>> table(). Allocating those strings will be costly. Don't we already
>> have the 4-way int hash? Of course, my intuition might be completely
>> off here.
>>
>>
>> On Fri, Apr 24, 2015 at 9:59 AM, Herv� Pag�s > <mailto:hpa...@fredhutch.org>> wrote:
>>
>> Hi Pete,
>>
>> Excellent idea. That will make things like table() work
>> out-of-the-box
>> on GenomicRanges objects. I'll add that.
>>
>> Thanks,
>> H.
>>
>>
>>
>> On 04/24/2015 09:43 AM, Peter Haverty wrote:
>>
>> Would people be interested in having this:
>>
>> setMethod("as.character", "GenomicRanges",
>> function(x) {
>> paste0(seqnames(x), ":", start(x), "-",
>> end(x))
>> })
>>
>> ?
>>
>> I find myself doing that a lot to make unique names or for
>> output that
>> goes to collaborators.  I suppose we might want to tack on
>> the strand if it
>> isn't "*".  I have some code for going the other direction
>> too, if there is
>> interest.
>>
>>
>>
>> Pete
>>
>> 
>> Peter M. Haverty, Ph.D.
>> Genentech, Inc.
>> phave...@gene.com <mailto:phave...@gene.com>
>>
>>  [[alternative HTML version deleted]]
>>
>> ___
>> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
>> mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>> --
>> Herv� Pag�s
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpa...@fredhutch.org <mailto:hpa...@fredhutch.org>
>> Phone: (206) 667-5791 
>> Fax: (206) 667-1319 
>>
>>
>> ___
>> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
>> mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>>
>>
> --
> Herv� Pag�s
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpa...@fredhutch.org
> Phone:  (206) 667-5791
> Fax:(206) 667-1319
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] as.character method for GenomicRanges?

2015-04-24 Thread Peter Haverty
Would people be interested in having this:

setMethod("as.character", "GenomicRanges",
  function(x) {
  paste0(seqnames(x), ":", start(x), "-", end(x))
  })

?

I find myself doing that a lot to make unique names or for output that
goes to collaborators.  I suppose we might want to tack on the strand if it
isn't "*".  I have some code for going the other direction too, if there is
interest.



Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Error after upgrading to R 3.2.0 with loading of GO.db: 2 arguments passed to .Internal(ls) which requires 3

2015-04-17 Thread Peter Haverty
Somewhere in GO.db a direct call to the ls .Internal is being made
(probably a speed hack) apparently.  The API for .Internal(ls()) has
changed in R 3.2.0. This call should be replaced with "names", which now
works on environments in R 3.2.0, and is way faster than the direct call to
.Internal(ls()) too.

Regards,

Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Fri, Apr 17, 2015 at 1:53 PM, Sergei Ryazansky 
wrote:

> Hello everyone,
>
> after updating to fresh R 3.2.0 from R 3.1.3, GO.db, org.Dm.eg.db and
> org.Hs.eg.db are failed to load (sorry for non-english environment):
>
> >library("GO.db")Error : .onLoad �� ��� � loadNamespace() ��� 'GO.db',
> ���:
>   �: ls(envir, all.names = TRUE)
>   ��: 2 �  � .Internal(ls), � � 3��:
> �� ���  �� ���   ��� 'GO.db'
>
>
> > traceback()2: stop(gettextf("package or namespace load failed for %s",
> sQuote(package)),
>call. = FALSE, domain = NA)
> 1: library("GO.db")
>
>
> >sessionInfo()R version 3.2.0 (2015-04-16)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 14.10
>
> locale:
>  [1] LC_CTYPE=ru_RU.UTF-8   LC_NUMERIC=C
> LC_TIME=ru_RU.UTF-8
>  [4] LC_COLLATE=ru_RU.UTF-8 LC_MONETARY=ru_RU.UTF-8
> LC_MESSAGES=ru_RU.UTF-8
>  [7] LC_PAPER=ru_RU.UTF-8   LC_NAME=C
> LC_ADDRESS=C
> [10] LC_TELEPHONE=C LC_MEASUREMENT=ru_RU.UTF-8
> LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats4parallel  stats graphics  grDevices utils
> datasets  methods
> [9] base
>
> other attached packages:
>  [1] BiocInstaller_1.16.2 RSQLite_1.0.0DBI_0.3.1
> AnnotationDbi_1.28.2
>  [5] GenomeInfoDb_1.2.5   IRanges_2.0.1S4Vectors_0.4.0
> GEOquery_2.32.0
>  [9] Biobase_2.26.0   BiocGenerics_0.12.1
>
> loaded via a namespace (and not attached):
>  [1] Rcpp_0.11.5  MASS_7.3-40  munsell_0.4.2
> colorspace_1.2-6 stringr_0.6.2
>  [6] plyr_1.8.1   tools_3.2.0  grid_3.2.0   gtable_0.1.2
>   digest_0.6.8
> [11] reshape2_1.4.1   ggplot2_1.0.1bitops_1.0-6 RCurl_1.95-4.5
>   scales_0.2.4
> [16] XML_3.98-1.1 proto_0.3-10
>
>
> Are there any way to fix this?
>
>
> --
> *Sincerely,*
> *Sergei Ryazansky, PhD*
> *IMG RAS, Moscow*
>
> [[alternative HTML version deleted]]
>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-04 Thread Peter Haverty
Clarification:  the complexity of the full BioC class universe, not the
SE/eSet part. GenomicRanges, GRanges, GRangesList, RangesView,
RangesViewsList, ... I think all of that intimidates new people.  Maybe
that's not generally the case.  Sorry, I've taken this thread way off
topic.  I'll stop now.

Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Wed, Mar 4, 2015 at 10:08 AM, Tim Triche, Jr. 
wrote:

> What complexity?  The Nature Methods paper laid it out: for most people,
> most of the time, use an SE.
>
> That way, the organization of metadata and covariates is enforced for you,
> like an ExpressionSet (another winning data structure) but without its
> baggage.
>
> Maybe the "Summarized" in the name isn't such a bad idea after all.
>  "AfterTheDataMungingIsDone" doesn't have the same ring to it.
>
> What would be equally awesome IMHO is to have a similarly unifying
> structure for integrative work.
>
> But that's just, like, my opinion.  I've taken a whack at it when I knew
> even less than I do now, and it's hard.  However, data management for
> expression arrays was hard, too.  If I'm not mistaken, there were benefits
> to solving that data management problem, too.  Some sort of a software
> project.  I think it was called "MADMAN".  I'll have to go look.  ;-)
>
>
>
> Statistics is the grammar of science.
> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>
> On Wed, Mar 4, 2015 at 10:03 AM, Peter Haverty 
> wrote:
>
>>  Michael has a good point. The complexity of the BioC universe of
>> classes hurts our ability to attract new users. More classes would be a
>> minus there ... but a small set of common, explicit APIs would simplify
>> things.  Rectangular things implement the matrix Interface.  :-)
>> Deprecating old stuff, like eSet, might help more than it hurts, on the
>> simplicity front.
>>
>>  P.S. apropos of understanding this universe of classes, I *love* the
>> methods(class=x) thing Vincent mentioned.
>>
>>  Pete
>>
>> 
>> Peter M. Haverty, Ph.D.
>> Genentech, Inc.
>> phave...@gene.com
>>
>> On Wed, Mar 4, 2015 at 9:38 AM, Michael Lawrence <
>> lawrence.mich...@gene.com> wrote:
>>
>>> I think we need to make sure that there are enough benefits of something
>>> like GRangesFrame before we introduce yet another complicated and
>>> overlapping data structure into the framework. Prior to summarization, the
>>> ranges seem primary, after summarization, it may often make sense for them
>>> to be secondary. But I'm just not sure what we gain from a new data
>>> structure.
>>>
>>> On Wed, Mar 4, 2015 at 12:28 AM, Herv� Pag�s 
>>> wrote:
>>>
>>>> GRangesFrame is an interesting idea and I gave it some thoughts.
>>>>
>>>> There is this nice symmetry between GRanges and GRangesFrame:
>>>>
>>>> - GRanges = a naked GRanges + a DataFrame accessible via mcols()
>>>>
>>>> - GRangesFrame = a DataFrame + a naked GRanges accessible via
>>>>  some accessor (e.g. rowRanges())
>>>>
>>>> So GRanges and GRangesFrame are equivalent in terms of what they
>>>> can hold, but different in terms of API: the former has the ranges
>>>> API as primary API and the DataFrame API on its mcols() component,
>>>> and the latter has the DataFrame API as primary API and the ranges
>>>> API on its rowRanges() component. Nice switch!
>>>>
>>>> What does this API switch bring us? A GRangesFrame object is now
>>>> an object that fully behaves like a DataFrame and people can also
>>>> perform range-based operations on its rowRanges() component.
>>>> Here is what I'm afraid is going to happen: people will also want
>>>> to be able to perform range-based operations *directly* on
>>>> these objects, i.e. without having to call rowRanges() first.
>>>> So for example when they do subsetByOverlaps(), subsetting
>>>> happens vertically. Also the Hits object returned by findOverlaps()
>>>> would contain row indices. Problem with this is that these objects
>>>> now start to suffer from the "dual personality syndrome". For
>>>> example, it's not clear anymore what their length should be.
>>>> Strictly speaking it should be their number of columns (that's
>>>> what the length of a DataFrame is), but the ranges A

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-04 Thread Peter Haverty
Michael has a good point. The complexity of the BioC universe of classes
hurts our ability to attract new users. More classes would be a minus there
... but a small set of common, explicit APIs would simplify things.
Rectangular things implement the matrix Interface.  :-) Deprecating old
stuff, like eSet, might help more than it hurts, on the simplicity front.

P.S. apropos of understanding this universe of classes, I *love* the
methods(class=x) thing Vincent mentioned.

Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Wed, Mar 4, 2015 at 9:38 AM, Michael Lawrence 
wrote:

> I think we need to make sure that there are enough benefits of something
> like GRangesFrame before we introduce yet another complicated and
> overlapping data structure into the framework. Prior to summarization, the
> ranges seem primary, after summarization, it may often make sense for them
> to be secondary. But I'm just not sure what we gain from a new data
> structure.
>
> On Wed, Mar 4, 2015 at 12:28 AM, Herv� Pag�s  wrote:
>
>> GRangesFrame is an interesting idea and I gave it some thoughts.
>>
>> There is this nice symmetry between GRanges and GRangesFrame:
>>
>> - GRanges = a naked GRanges + a DataFrame accessible via mcols()
>>
>> - GRangesFrame = a DataFrame + a naked GRanges accessible via
>>  some accessor (e.g. rowRanges())
>>
>> So GRanges and GRangesFrame are equivalent in terms of what they
>> can hold, but different in terms of API: the former has the ranges
>> API as primary API and the DataFrame API on its mcols() component,
>> and the latter has the DataFrame API as primary API and the ranges
>> API on its rowRanges() component. Nice switch!
>>
>> What does this API switch bring us? A GRangesFrame object is now
>> an object that fully behaves like a DataFrame and people can also
>> perform range-based operations on its rowRanges() component.
>> Here is what I'm afraid is going to happen: people will also want
>> to be able to perform range-based operations *directly* on
>> these objects, i.e. without having to call rowRanges() first.
>> So for example when they do subsetByOverlaps(), subsetting
>> happens vertically. Also the Hits object returned by findOverlaps()
>> would contain row indices. Problem with this is that these objects
>> now start to suffer from the "dual personality syndrome". For
>> example, it's not clear anymore what their length should be.
>> Strictly speaking it should be their number of columns (that's
>> what the length of a DataFrame is), but the ranges API that
>> we're trying to put on them also makes them feel like vectors
>> along the vertical dimension so it also feels that their length
>> should be their number of rows. Same thing with 1D subsetting.
>> Why does it subset the columns and not the rows? Most people
>> are now confused.
>>
>> It's interesting to note that the same thing happens with GRanges
>> objects, but in the opposite direction: people wish they could
>> do DataFrame operations directly on them without calling mcols()
>> first. But in order to preserve the good health of GRanges objects,
>> we've not done that (except for $, a shortcut for mcols(x)$,
>> the pressure was just too strong).
>>
>> H.
>>
>>
>>
>> On 03/03/2015 04:35 PM, Michael Lawrence wrote:
>>
>>> Should be possible for the annotations to be of any type, as long as they
>>> satisfy a simple contract of NROW() and 2D "[". Then, you could have a
>>> DataFrame, GRanges, or whatever in there. But it would be nice to have a
>>> special class for the container with range information. The contract for
>>> the range annotation would be to have a granges() method.
>>>
>>> I agree it would be nice if there was a way with the methods package to
>>> easily assert such contracts. For example, one could define an interface
>>> with a set of generics (and optionally the relevant position in the
>>> generic
>>> signature). Then, once all of the methods have been assigned for a
>>> particular class, it is made to inherit from that contract class. There
>>> are
>>> lots of gotchas though. Not sure how useful it would be in practice.
>>>
>>>
>>> On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty 
>>> wrote:
>>>
>>>  There are some nice similarities in these new imaginary types.  A
>>>> "GRangesFrame" is a list of dimensionally identical things (columns) and
>>>> some row meta-data (the GRanges).  The

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-03 Thread Peter Haverty
There are some nice similarities in these new imaginary types.  A
"GRangesFrame" is a list of dimensionally identical things (columns) and
some row meta-data (the GRanges).  The SE-like object is similarly a list
of dimensionally like things (matrices, RleDataFrames, BigMatrix objects,
HDF5-backed things) with some row meta-data (a DataFrame or GRangesFrame).
Elegant?  Maybe they would actually be relatives in the class tree.

I wonder if this kind of thing would be easier if we had Java-style
Interfaces or duck-typing.  The "x" slot of "y" holds something that
implements this set of methods ...

Oh, and kinda apropos, the genoset class will probably go away or become an
extension to this new SE-like thing.  The extra stuff that comes along with
genoset will still be available.

Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr. 
wrote:

> This.
>
> It would be damned near perfect as a return value for assays coming out of
> an object that held several such assays at several time points in a
> population, where there are both assay-wise and covariate-wise "holes" that
> could nonetheless be usefully imputed across assays.
>
>
> Statistics is the grammar of science.
> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>
> On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty 
> wrote:
>
> > >
> > >
> > >
> > >  I still think GRanges should be a subclass of DataFrame,
> > >> which would make this easy, but I don't seem to be winning that
> > argument.
> > >>
> > >
> > > Just impossible. As Michael mentioned back in November, they have
> > > conflicting APIs.
> >
> >
> > Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges
> > (without mcols) as an index?
> >
> >
> > [[alternative HTML version deleted]]
> >
> > ___
> > Bioc-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >
>
> [[alternative HTML version deleted]]
>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-03 Thread Peter Haverty
>
>
>
>  I still think GRanges should be a subclass of DataFrame,
>> which would make this easy, but I don't seem to be winning that argument.
>>
>
> Just impossible. As Michael mentioned back in November, they have
> conflicting APIs.


Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges
(without mcols) as an index?


[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-03 Thread Peter Haverty
I'd like to see a basic class that takes a DataFrame and a sub-class that
takes a GRanges.  I still think GRanges should be a subclass of DataFrame,
which would make this easy, but I don't seem to be winning that argument.

While the hood is up, can we try some different names?
SummarizedExperiment never seemed like a great fit to me because it doesn't
necessarily contain experiments or summaries thereof.  It's a collection of
like-sized rectangular things with metadata on the two dimensions.  Maybe
the name could reflect what it holds rather than a common use case?
AnnotatedMatrixList?

 Anyway, I'm excited to see a version on the way that takes a DataFrame as
rowData.  I'm glad you guys are working on that.

Regards,

Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Tue, Mar 3, 2015 at 2:57 PM, Michael Lawrence 
wrote:

> Seems like rowData could be made to work universallly through coercion.
> rowRanges would not, however, and one would like a convenient mechanism to
> condition on whether range information is available. One way is to
> introduce a new class and rely on dispatch. But that adds complexity.
>
> On Tue, Mar 3, 2015 at 2:44 PM, Gabe Becker  wrote:
>
> > Jim et al.,
> >
> > Why have two accessors (rowRanges, rowData), each of which are less
> > flexible than the underlying structure and thus will fail (return NULL?
> or
> > GRanges()/DataFrame() ?) in some proportion of valid objects?
> >
> > ~G
> >
> > On Tue, Mar 3, 2015 at 2:37 PM, Jim Hester 
> > wrote:
> >
> > > Motivated by the discussion thread from November (
> https://stat.ethz.ch/
> > > pipermail/bioc-devel/2014-November/006686.html) the Bioconductor core
> > team
> > > is planning on making changes to the SummarizedExperiment class.  Our
> end
> > > goal is to allow the @rowData slot to become more flexible and hold
> > either
> > > a DataFrame or GRanges type object.
> > >
> > > To this end we have currently deprecated the current rowData accessor
> in
> > > favor of a rowRanges accessor.  This change has resulted in a few
> broken
> > > builds in devel, which we are in the process of fixing now.  We will
> > > contact any package authors directly if needed for this migration.
> > >
> > > The rowData accessor will be deprecated in this release, however
> > eventually
> > > the plan is to re-purpose this function to serve as an accessor for
> > > DataFrame data on the rows.
> > >
> > > Please let us know if you have any questions with the above and if you
> > need
> > > any assistance with the transition.
> > >
> > > [[alternative HTML version deleted]]
> > >
> > > ___
> > > Bioc-devel@r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > >
> >
> >
> >
> > --
> > Gabriel Becker, Ph.D
> > Computational Biologist
> > Genentech Research
> >
> > [[alternative HTML version deleted]]
> >
> > ___
> > Bioc-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >
>
> [[alternative HTML version deleted]]
>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] unit tests for C code inside a package

2015-01-26 Thread Peter Haverty
My favorite solution to this would be to use Rcpp attributes to add R-level 
functions for each C function. You can use these just for testing and skip the 
exporting and manual pages. That's not quite what you asked for but it would 
work. 
BTW transcription factors are my thing so I'm eager to try out the package. 
Thanks!

Typed with thumbs.

> On Jan 26, 2015, at 1:14 AM, Elena Grassi  wrote:
> 
> Hi,
> 
> I'm writing a package that calculates total affinity (see PMID
> 21335606 and 16873464 if transcription factors are your thing):
> up until now in our lab we've used a pure C tool that needs fasta and
> tabular formatted PFM-PWM files but we are willing to produce
> something more comfortable that stems from some related Bioc packages
> (TFBSTools and JASPAR2014 basically).
> The package has two simple R methods that call a C entry point and
> then all the calculations are performed by the C code.
> I'd like to write extended unit tests but I am not sure how to do it:
> I've some tests for the R portions (that obviously depends
> also on the C calculation in some parts) but I would like to test in a
> more fine grained way the C code therefore RUnit and BiocGenerics
> seems to solve only a portion of my problem. I would like to use a
> C-based library for unit testing and link it to the automatic
> check done for the package thanks to BiocGenerics: is this reasonable?
> I've looked at some other packages without being able to find
> something similar to this.
> I've read 
> http://stackoverflow.com/questions/26322135/unit-testing-rcpp-code-in-a-package,
> but I think
> that it would be nice to test the C code with the whole package.
> 
> Thanks,
> E.
> ps. the source code right now is here:
> https://github.com/vodkatad/MatrixRider (the vignette is on its way,
> the nodevel branch is the one active now and it works with R version
> 3.1.2 and Bioconductor 3.0 for our internal use).
> 
> -- 
> $ pom
> 
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Use Imports instead of Depends in the DESCRIPTION files of bioconductor packages

2015-01-03 Thread Peter Haverty
There are few other changes in there too, but profiling did identify low
hanging fruit like the sapplys. From there I have found a long list of
refactoring opportunities that offer ~2% improvements. These may not be
worth the risk of reversions, however. I'll be putting together a patch
proposal with the easy changes in the next few days.

Regards,

Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Fri, Jan 2, 2015 at 1:58 AM, Michael Lawrence 
wrote:

> Pete Haverty is the one working on this. He has almost cut loading time in
> half by just changing some sapply and lappy calls to vapply calls. Most
> likely because allocating all of those list elements is expensive, and
> Martin's memory parameters also help with that.
>
> On Wed, Dec 31, 2014 at 10:30 PM, Herv� Pag�s 
> wrote:
>
> > Hi Gordon,
> >
> > My guess is that it has to do with how many symbols get exported.
> > For example on my machine, doing library(limma) in a fresh session
> > takes 0.261s and triggers export of 292 symbols (as reported by
> > ls(..., all.names=TRUE)). Doing library(GenomicRanges) in a fresh
> > session takes 2.724s and triggers export of 1581 symbols (counting
> > the symbols exported by all the packages that get loaded).
> >
> > Michael it's great to hear that somebody is working on speeding up
> > the code in charge of this.
> >
> > Happy New Year everybody!
> > H.
> >
> >
> >
> > On 12/31/2014 06:07 PM, Gordon K Smyth wrote:
> >
> >> Hi Michael,
> >>
> >> What aspect of the methods package causes the slowness?
> >>
> >> There are many packages (limma for one) that depend on methods but load
> >> quickly.
> >>
> >> Regards
> >> Gordon
> >>
> >>
> >>  Date: Wed, 31 Dec 2014 09:17:01 -0800
> >>> From: Michael Lawrence 
> >>> To: Peng Yu 
> >>> Cc: Bioconductor Package Maintainer ,
> >>> "bioc-devel@r-project.org" 
> >>> Subject: Re: [Bioc-devel] [devteam-bioc] Use Imports instead of
> >>> Depends in the DESCRIPTION files of bioconductor packages.
> >>>
> >>> The slowness is due to the methods package. We're working on it.
> >>>
> >>> Michael
> >>>
> >>> On Wed, Dec 31, 2014 at 8:47 AM, Peng Yu  wrote:
> >>>
> >>>  On Wed, Dec 31, 2014 at 9:41 AM, Martin Morgan <
> mtmor...@fredhutch.org>
>  wrote:
> 
> > On 12/24/2014 07:31 PM, Maintainer wrote:
> >
> >>
> >> Hi,
> >>
> >> Many bioconductor packages Depends on other packages but not Imports
> >> other packages. (e.g., IRanges Depends on BiocGenerics.) Imports is
> >> usually preferred to Depends.
> >>
> >>
> >>
> >>  http://stackoverflow.com/questions/8637993/better-
>  explanation-of-when-to-use-imports-depends
> 
>   http://obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/
> >>
> >> Could the unnecessary Depends be forced to be replaced by Imports?
> >> This should improve the package load time significantly.
> >>
> >
> >
> > R package symbols and other objects are collated at build time into a
> >
>  'name
> 
> > space'. When used,
> >
> > - Import: loads the name space from disk.
> > - Depends: loads the name space from disk, and attaches it to the
> >
>  search()
> 
> > path.
> >
> > Attaching is very inexpensive compared to loading, so there is no
> speed
> > improvement gained by Import'ing instead of Depend'ing.
> >
> 
>  Yes. For example, changing Depends to Imports does not improve the
>  package load time much.
> 
>  But loading a package in 4 sec seems to be too long.
> 
>    system.time(suppressPackageStartupMessages(library(MBASED)))
> >
> user  system elapsed
>    4.404   0.100   4.553
> 
>  For example, it only takes 10% of the time to load ggplot2. It seems
>  that many bioconductor packages have similar problems.
> 
>   system.time(suppressPackageStartupMessages(library(ggplot2)))
> >
> user  system elapsed
>    0.394   0.036   0.460
> 
>   The main reason to Depend: on a package is because the symbols
> > defined by
> > the package are needed by the end-user. Import'ing a package is
> >
>  appropriate
> 
> > when the package provides functionality only relevant to the package
> >
>  author.
> 
>  What causes the load time to be too long? Is it because exporting too
>  many functions from all dependent packages to the global namespace?
> 
>   There are likely to be specific packages that mis-use Depends;
> packages
> >
>  such
> 
> > as IRanges, GenomicRanges, etc use Depends: as intended, to  provide
> > functions that are useful to the end user.
> >
> > Maintainers are certainly encouraged to think carefully about adding
> > packages providing functionality irrelevant to the end-user to the
> >
>  Depends:
> 
> > field. The codetoolsBioC package (available from svn, 

Re: [Bioc-devel] SummarizedExperiment vs ExpressionSet

2014-11-26 Thread Peter Haverty
OK, GRanges as vector that does overlap stuff makes sense, but I think
putting a DataFrame of metadata on that confuses the purpose of the
object.  How about a "GRangesTable" that inherits from both GenomicRanges
and DataTable?  It would be a DataFrame with a fancy index.  The DataFrame
API would make stuff like colnames work (rather than needing
colnames(mcols(x)) ). If this were used as the rowData for
SummarizedExperiment, then a plain DataFrame could be made to work too.

Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Wed, Nov 26, 2014 at 9:33 AM, Michael Lawrence  wrote:

>
>
> On Wed, Nov 26, 2014 at 9:07 AM, Peter Haverty 
> wrote:
>
>> Hi all,
>>
>> I believe there is a strong need for an object that organizes a collection
>> of rectangular data (matrices, etc.) with metadata on the rows and
>> columns.  Can SummarizedExperiment inherit from something simpler that has
>> a DataFrame as rowData?
>
>   (I believe GenomicRanges should inherit from
>> DataTable, rather than Vector, and subset as x[i,j], but maybe that's
>> getting a bit off topic.)
>
>
> Have to disagree on that. A GRanges is a vector of ranges; a table is a
> list of vectors all of the same length. Different things. There was a lot
> of thought invested in that. But it does subset as x[i,j], so in theory
> SummarizedExperiment could be generalized to contain something with the
> contract of 2D extraction.
>
>
>> I often see people stuffing arbitrary data into
>> an ExpressionSet and calling one of the assays "exprs" as a work-around.
>>
>> Regards,
>>
>> Pete
>>
>> 
>> Peter M. Haverty, Ph.D.
>> Genentech, Inc.
>> phave...@gene.com
>>
>> On Wed, Nov 26, 2014 at 7:19 AM, Laurent Gatto  wrote:
>>
>> >
>> > On 26 November 2014 14:59, Wolfgang Huber wrote:
>> >
>> > > A colleague and I are designing a package for quantitative proteomics
>> > > data, and we are debating whether to base it on the
>> > > SummarizedExperiment or the ExpressionSet class.
>> > >
>> > > There is no immediate use for the ranges aspect of
>> > > SummarizedExperiment, so that would have to be carried around with
>> > > NAs, and this is a parsimony argument for using ExpressionSet
>> > > instead. OTOH, the interface of SummarizedExperiment is cleaner, its
>> > > code more modern and more likely to be updated, and users of the
>> > > Bioconductor project are likely to benefit from having to deal with a
>> > > single interface that works the same or similarly across packages,
>> > > rather than a variety of formats; which argues that new packages
>> > > should converge towards SummarizedExperiment('s interface).
>> > >
>> > > Are there any pertinent insights from this group?
>> >
>> > Instead of ExpressionSet, you could use MSnbase::MSnSet, which is
>> > essentially an ExpressionSet for quantitative proteomics (i.e it has a
>> > MIAPE slot, instead of MIAME for example).
>> >
>> > Ideally, a SummarizedExperiment for proteomics would use peptide/protein
>> > ranges, which is in the pipeline, as far as I am concerned. When that
>> > becomes available, there should be infrastructure to coerce and MSnSet
>> > (and/or other relevant data) into an SummarizedExperiment.
>> >
>> > Hope this helps.
>> >
>> > Best wishes,
>> >
>> > Laurent
>> >
>> > > Thanks and best wishes
>> > > Wolfgang
>> > >
>> > > ___
>> > > Bioc-devel@r-project.org mailing list
>> > > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >
>> > --
>> > Laurent Gatto
>> > http://cpu.sysbiol.cam.ac.uk/
>> >
>> > ___
>> > Bioc-devel@r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >
>>
>> [[alternative HTML version deleted]]
>>
>> ___
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] SummarizedExperiment vs ExpressionSet

2014-11-26 Thread Peter Haverty
Hi all,

I believe there is a strong need for an object that organizes a collection
of rectangular data (matrices, etc.) with metadata on the rows and
columns.  Can SummarizedExperiment inherit from something simpler that has
a DataFrame as rowData?  (I believe GenomicRanges should inherit from
DataTable, rather than Vector, and subset as x[i,j], but maybe that's
getting a bit off topic.)  I often see people stuffing arbitrary data into
an ExpressionSet and calling one of the assays "exprs" as a work-around.

Regards,

Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Wed, Nov 26, 2014 at 7:19 AM, Laurent Gatto  wrote:

>
> On 26 November 2014 14:59, Wolfgang Huber wrote:
>
> > A colleague and I are designing a package for quantitative proteomics
> > data, and we are debating whether to base it on the
> > SummarizedExperiment or the ExpressionSet class.
> >
> > There is no immediate use for the ranges aspect of
> > SummarizedExperiment, so that would have to be carried around with
> > NAs, and this is a parsimony argument for using ExpressionSet
> > instead. OTOH, the interface of SummarizedExperiment is cleaner, its
> > code more modern and more likely to be updated, and users of the
> > Bioconductor project are likely to benefit from having to deal with a
> > single interface that works the same or similarly across packages,
> > rather than a variety of formats; which argues that new packages
> > should converge towards SummarizedExperiment('s interface).
> >
> > Are there any pertinent insights from this group?
>
> Instead of ExpressionSet, you could use MSnbase::MSnSet, which is
> essentially an ExpressionSet for quantitative proteomics (i.e it has a
> MIAPE slot, instead of MIAME for example).
>
> Ideally, a SummarizedExperiment for proteomics would use peptide/protein
> ranges, which is in the pipeline, as far as I am concerned. When that
> becomes available, there should be infrastructure to coerce and MSnSet
> (and/or other relevant data) into an SummarizedExperiment.
>
> Hope this helps.
>
> Best wishes,
>
> Laurent
>
> > Thanks and best wishes
> > Wolfgang
> >
> > ___
> > Bioc-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
> --
> Laurent Gatto
> http://cpu.sysbiol.cam.ac.uk/
>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Please bump version number when committing changes

2014-09-05 Thread Peter Haverty
Hi All,
Git-svn is a nice workaround for the developer. As a user you don't want to be 
installing from version control in any case. Version control is a means for 
tracking changes, not for distributing software.   Let the CI system protect 
you from needless drama.

Typed with thumbs.

> On Sep 5, 2014, at 5:03 PM, "Ryan C. Thompson"  wrote:
> 
> Hi all,
> 
> Just to throw in a suggestion here, I know that many people use a tool like 
> git-svn in this kind of situation. They want the ability to make multiple 
> small commits in order to save their progress, but they don't want those 
> commits visible until they are ready to push all at once. This allows one to 
> make breaking changes in one commit that are fixed by subsequent commits, 
> because the intermediate states will never be exposed.
> 
> For information on git-svn, see here: 
> http://git-scm.com/book/en/Git-and-Other-Systems-Git-and-Subversion
> 
> Note that I don't personally have any experience with svn or with git-svn, 
> but this seems like exactly the use case for it.
> 
> -Ryan
> 
>> On Fri 05 Sep 2014 04:50:49 PM PDT, Peter Haverty wrote:
>> Hi all,
>> 
>> I respectfully disagree.  One should certainly check in each discrete unit
>> of work.  These will often not result in something that is ready to be used
>> by someone else.  Bumping the version number constitutes a new release and
>> carries the implicit promise that the package works again.  This is why
>> continuous integration systems do a build when the version number changes.
>> 
>> One should expect working software when installing a pre-build package (the
>> tests passed, right?).  Checking out from SVN is for developers of that
>> package and nothing should be assumed about the current state of the code.
>> 
>> To keep everyone happy, one could add a commit hook to our SVN setup that
>> would add the SVN revision number to the version string.  This would be for
>> dev only and hopefully not sufficient to trigger a build.
>> 
>> That's my two cents.  Happy weekend all.
>> 
>> Regards,
>> 
>> 
>> 
>> Pete
>> 
>> 
>> Peter M. Haverty, Ph.D.
>> Genentech, Inc.
>> phave...@gene.com
>> 
>> 
>>> On Fri, Sep 5, 2014 at 4:30 PM, Dan Tenenbaum  wrote:
>>> 
>>> 
>>> 
>>> - Original Message -
>>>> From: "Stephanie M. Gogarten" 
>>>> To: "Dan Tenenbaum" , "bioc-devel" <
>>> bioc-devel@r-project.org>
>>>> Sent: Friday, September 5, 2014 4:27:13 PM
>>>> Subject: Re: [Bioc-devel] Please bump version number when committing
>>> changes
>>>> 
>>>> I am guilty of doing this today, but I have (I think) a good reason.
>>>> I'm making a bunch of changes that are all related to each other, but
>>>> are being implemented and tested in stages.  I'd like to use svn to
>>>> commit when I've made a set of changes that works, so I can roll back
>>>> if
>>>> I break something in the next step, but I'd like the users to see
>>>> them
>>>> all at once as a single version update.  Perhaps others are doing
>>>> something similar?
>>> 
>>> I understand the motivation but this still results in an ambiguous state
>>> if two different people check out your package from svn at different times
>>> today (before and after your changes).
>>> 
>>> Version numbers are cheap, so if version 1.2.3 exists for a day before
>>> version 1.2.4 (which contains all the changes you want to push to your
>>> users) then that's ok, IMO.
>>> 
>>> Including a version bump doesn't impact whether or not you can rollback a
>>> commit with svn.
>>> 
>>> Dan
>>> 
>>> 
>>> 
>>>> Stephanie
>>>> 
>>>>> On 9/4/14, 12:04 PM, Dan Tenenbaum wrote:
>>>>> Hello,
>>>>> 
>>>>> Looking through our svn logs, I see that there are many commits
>>>>> that are not accompanied by version bumps.
>>>>> All svn commits (or, if you are using the git-svn bridge, every
>>>>> group of commits included in a push) should include a version bump
>>>>> (that is, incrementing the "z" segment of the x.y.z version
>>>>> number). This practice is documented at
>>>>> http://www.bioconductor.org/developers/how-to/version-numbering/ .
>>>>> 
>

Re: [Bioc-devel] Please bump version number when committing changes

2014-09-05 Thread Peter Haverty
Hi all,

I respectfully disagree.  One should certainly check in each discrete unit
of work.  These will often not result in something that is ready to be used
by someone else.  Bumping the version number constitutes a new release and
carries the implicit promise that the package works again.  This is why
continuous integration systems do a build when the version number changes.

One should expect working software when installing a pre-build package (the
tests passed, right?).  Checking out from SVN is for developers of that
package and nothing should be assumed about the current state of the code.

To keep everyone happy, one could add a commit hook to our SVN setup that
would add the SVN revision number to the version string.  This would be for
dev only and hopefully not sufficient to trigger a build.

That's my two cents.  Happy weekend all.

Regards,



Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com


On Fri, Sep 5, 2014 at 4:30 PM, Dan Tenenbaum  wrote:

>
>
> - Original Message -
> > From: "Stephanie M. Gogarten" 
> > To: "Dan Tenenbaum" , "bioc-devel" <
> bioc-devel@r-project.org>
> > Sent: Friday, September 5, 2014 4:27:13 PM
> > Subject: Re: [Bioc-devel] Please bump version number when committing
> changes
> >
> > I am guilty of doing this today, but I have (I think) a good reason.
> > I'm making a bunch of changes that are all related to each other, but
> > are being implemented and tested in stages.  I'd like to use svn to
> > commit when I've made a set of changes that works, so I can roll back
> > if
> > I break something in the next step, but I'd like the users to see
> > them
> > all at once as a single version update.  Perhaps others are doing
> > something similar?
> >
>
> I understand the motivation but this still results in an ambiguous state
> if two different people check out your package from svn at different times
> today (before and after your changes).
>
> Version numbers are cheap, so if version 1.2.3 exists for a day before
> version 1.2.4 (which contains all the changes you want to push to your
> users) then that's ok, IMO.
>
> Including a version bump doesn't impact whether or not you can rollback a
> commit with svn.
>
> Dan
>
>
>
> > Stephanie
> >
> > On 9/4/14, 12:04 PM, Dan Tenenbaum wrote:
> > > Hello,
> > >
> > > Looking through our svn logs, I see that there are many commits
> > > that are not accompanied by version bumps.
> > > All svn commits (or, if you are using the git-svn bridge, every
> > > group of commits included in a push) should include a version bump
> > > (that is, incrementing the "z" segment of the x.y.z version
> > > number). This practice is documented at
> > > http://www.bioconductor.org/developers/how-to/version-numbering/ .
> > >
> > > Failure to bump the version has two consequences:
> > >
> > > 1) Your changes will not propagate to our package repository or web
> > > site, so users installing your package via biocLite() will not
> > > receive the latest changes unless you bump the version.
> > >
> > > 2) Users *can* always get the current files of your package using
> > > Subversion, but if you've made changes without bumping the version
> > > number, it can be difficult to troubleshoot problems. If two
> > > people are looking at what appears to be the same version of a
> > > package, but it's behaving differently, it can be really
> > > frustrating to realize that the packages actually differ (but not
> > > by version number).
> > >
> > > So if you're not already, please get in the habit of bumping the
> > > version number with each set of changes you commit.
> > >
> > > Let us know on bioc-devel if you have any questions about this.
> > >
> > > Thanks,
> > > Dan
> > >
> > > ___
> > > Bioc-devel@r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > >
> >
>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] viewMedians

2014-06-03 Thread Peter Haverty
rangeColMeans (or rangeMeans) takes a vector, but can recycle over columns
in a matrix. I guess we could have rangeMeans for vector-ish things,
rangeColMeans and rangeRowMeans for two-d things.

I have rangeMeans for Rles and RleDataFrame, which does each Rle. I'm
flexible on naming.

Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com


On Mon, Jun 2, 2014 at 6:50 PM, Michael Lawrence 
wrote:

> So rangeMeans,matrix implies rangeColMeans? Honestly, I would just have a
> rangeColMeans and rangeRowMeans, which is consistent with the existing
> row/colMeans. Don't see a good reason to prefer columns over rows.
>
>
> On Mon, Jun 2, 2014 at 5:34 PM, Peter Haverty 
> wrote:
>
>> I have have rangeColMeans which is essentially rangeMeans for
>> vector/matrix. I renamed this to make it a method on rangeMeans. I think it
>> would be great to have methods for all the commonly used types.  We should
>> put some thought into how these would all share as much code as is
>> practical.
>>
>> Pete
>>
>> 
>> Peter M. Haverty, Ph.D.
>> Genentech, Inc.
>> phave...@gene.com
>>
>>
>> On Mon, Jun 2, 2014 at 1:24 PM, Michael Lawrence <
>> lawrence.mich...@gene.com> wrote:
>>
>>> While we rework things, what about adding support for atomic vectors, in
>>> addition to Rles? Also, what about functions that are optimized for
>>> partitionings? Those would be easy to write and would let us greatly
>>> accelerate e.g. sum,CompressedIntegerList. Right now we rely on rowsum()
>>> which is fast but could be much faster.
>>>
>>> Michael
>>>
>>>
>>>
>>> On Mon, Jun 2, 2014 at 10:48 AM, Hervé Pagès  wrote:
>>>
>>>> Hi Peter,
>>>>
>>>> Seems like you have a pretty good implementation of the view* functions
>>>> in genoset. Nice work! And great to hear that there is so much room for
>>>> improvements to the implementation currently in IRanges. I'll try to
>>>> give this a shot soon but first I want to move Rle's to the S4Vectors
>>>> package.
>>>>
>>>> Cheers,
>>>> H.
>>>>
>>>>
>>>>
>>>> On 06/01/2014 07:58 PM, Peter Haverty wrote:
>>>>
>>>>> I think viewMedians would be great.  While you have the hood up, there
>>>>> are
>>>>> some opportunities for some speedups and code simplification, I
>>>>> believe.
>>>>>
>>>>> I did some experimentation with view* in the genoset package. I made an
>>>>> alternate version of the C for viewMeans and found about a 10X
>>>>> speedup.  I
>>>>> hoisted the branching for the different types and did the NA handling
>>>>> with
>>>>> arithmetic rather than branching. The search for the Rle runs covered
>>>>> by
>>>>> each view is now done with findInterval.  There are quite a few code
>>>>> sections that differ only in the type of the NA value and the pointers
>>>>> to
>>>>> the input/output vectors. I think it would be worth considering C++
>>>>> templates.
>>>>>
>>>>> On the R side, each view* function is pretty similar too. In
>>>>> genoset/R/RleDataFrame-views.R I tried to factor out all the shared
>>>>> pieces.
>>>>>
>>>>> While we're on the topic, I think the view* functions should have
>>>>> range*
>>>>> equivalents that skip the View object and work on an Rle and an
>>>>> IRanges.
>>>>>   If you already have a Views object around, view* are perfect.
>>>>> Otherwise,
>>>>> making the Views objects uses time that could be saved.
>>>>>
>>>>> Overall I found about a 90X speedup over viewMeans(RleViewsList).
>>>>>
>>>>> I hope there is some useful food for thought in these experiments. I
>>>>> have a
>>>>> vignette that shows some of the timings if anyone is interested.
>>>>>
>>>>> Regards,
>>>>> Pete
>>>>>
>>>>> 
>>>>> Peter M. Haverty, Ph.D.
>>>>> Genentech, Inc.
>>>>> phave...@gene.com
>>>>>
>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>> ___
>>>>> Bioc-devel@r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>>
>>>> --
>>>> Hervé Pagès
>>>>
>>>> Program in Computational Biology
>>>> Division of Public Health Sciences
>>>> Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N, M1-B514
>>>> P.O. Box 19024
>>>> Seattle, WA 98109-1024
>>>>
>>>> E-mail: hpa...@fhcrc.org
>>>> Phone:  (206) 667-5791
>>>> Fax:(206) 667-1319
>>>>
>>>>
>>>> ___
>>>> Bioc-devel@r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>
>>>
>>
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] viewMedians

2014-06-02 Thread Peter Haverty
I have have rangeColMeans which is essentially rangeMeans for
vector/matrix. I renamed this to make it a method on rangeMeans. I think it
would be great to have methods for all the commonly used types.  We should
put some thought into how these would all share as much code as is
practical.

Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com


On Mon, Jun 2, 2014 at 1:24 PM, Michael Lawrence 
wrote:

> While we rework things, what about adding support for atomic vectors, in
> addition to Rles? Also, what about functions that are optimized for
> partitionings? Those would be easy to write and would let us greatly
> accelerate e.g. sum,CompressedIntegerList. Right now we rely on rowsum()
> which is fast but could be much faster.
>
> Michael
>
>
>
> On Mon, Jun 2, 2014 at 10:48 AM, Hervé Pagès  wrote:
>
>> Hi Peter,
>>
>> Seems like you have a pretty good implementation of the view* functions
>> in genoset. Nice work! And great to hear that there is so much room for
>> improvements to the implementation currently in IRanges. I'll try to
>> give this a shot soon but first I want to move Rle's to the S4Vectors
>> package.
>>
>> Cheers,
>> H.
>>
>>
>>
>> On 06/01/2014 07:58 PM, Peter Haverty wrote:
>>
>>> I think viewMedians would be great.  While you have the hood up, there
>>> are
>>> some opportunities for some speedups and code simplification, I believe.
>>>
>>> I did some experimentation with view* in the genoset package. I made an
>>> alternate version of the C for viewMeans and found about a 10X speedup.
>>>  I
>>> hoisted the branching for the different types and did the NA handling
>>> with
>>> arithmetic rather than branching. The search for the Rle runs covered by
>>> each view is now done with findInterval.  There are quite a few code
>>> sections that differ only in the type of the NA value and the pointers to
>>> the input/output vectors. I think it would be worth considering C++
>>> templates.
>>>
>>> On the R side, each view* function is pretty similar too. In
>>> genoset/R/RleDataFrame-views.R I tried to factor out all the shared
>>> pieces.
>>>
>>> While we're on the topic, I think the view* functions should have range*
>>> equivalents that skip the View object and work on an Rle and an IRanges.
>>>   If you already have a Views object around, view* are perfect.
>>> Otherwise,
>>> making the Views objects uses time that could be saved.
>>>
>>> Overall I found about a 90X speedup over viewMeans(RleViewsList).
>>>
>>> I hope there is some useful food for thought in these experiments. I
>>> have a
>>> vignette that shows some of the timings if anyone is interested.
>>>
>>> Regards,
>>> Pete
>>>
>>> 
>>> Peter M. Haverty, Ph.D.
>>> Genentech, Inc.
>>> phave...@gene.com
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ___
>>> Bioc-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpa...@fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:(206) 667-1319
>>
>>
>> ___
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] viewMedians

2014-06-01 Thread Peter Haverty
I think viewMedians would be great.  While you have the hood up, there are
some opportunities for some speedups and code simplification, I believe.

I did some experimentation with view* in the genoset package. I made an
alternate version of the C for viewMeans and found about a 10X speedup.  I
hoisted the branching for the different types and did the NA handling with
arithmetic rather than branching. The search for the Rle runs covered by
each view is now done with findInterval.  There are quite a few code
sections that differ only in the type of the NA value and the pointers to
the input/output vectors. I think it would be worth considering C++
templates.

On the R side, each view* function is pretty similar too. In
genoset/R/RleDataFrame-views.R I tried to factor out all the shared pieces.

While we're on the topic, I think the view* functions should have range*
equivalents that skip the View object and work on an Rle and an IRanges.
 If you already have a Views object around, view* are perfect. Otherwise,
making the Views objects uses time that could be saved.

Overall I found about a 90X speedup over viewMeans(RleViewsList).

I hope there is some useful food for thought in these experiments. I have a
vignette that shows some of the timings if anyone is interested.

Regards,
Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] SummarizedExperiment

2014-03-25 Thread Peter Haverty
Also, keeping just the colnames would be sufficient to put a DataFrame in
SE's assays.  DataFrames need to have colnames, but can have NULL rownames
(right?).
BTW, BigMatrix also has the argument "withDimnames" for subsetting. Adding
dimnames to ( and copying ) a huge vector takes as much time as pulling
that huge vector from an mmapped file, so I made it optional there too.

Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com


On Tue, Mar 25, 2014 at 9:42 AM, Tim Triche, Jr. wrote:

> If it makes genosets coercible into SEs then I'm all for the change and
> its permanence
>
> --t
>
> > On Mar 25, 2014, at 9:31 AM, Peter Haverty 
> wrote:
> >
> > One benefit of having dimnames on assays would be that one could use
> > DataFrames as assays, like in eSet.  My genoset class is becoming more
> and
> > more like SummarizedExperiment. The dimname issues prevent me from
> > switching entirely from eSet to SummarizedExperiment.
> >
> > I think that keeping only one copy of dimnames is a great feature, if a
> bit
> > dangerous.  My typical object has ~6 BigMatrix and/or DataFrame of Rle
> > objects as assays, so the rownames actually make up a considerable
> portion
> > of the object size.  (My typical dataset is 2.5M rows by 1k samples).
> I've
> > been moving towards keeping a single dimnames copy just to improve RData
> > load times.
> >
> > I think that assays should be required to have dimnames when they are
> added
> > to a SummarizedExperiment. These dimnames should be checked for equality
> > with the dimnames of the SE in the setter function.
> >
> > Perhaps with the recent (R 3.1) improvements in shallow/lazy copying and
> > reference counting, adding dimnames to outgoing assays will be less of a
> > performance hit.
> >
> > I also like the compromise I have seen elsewhere, where the colnames are
> > always retained on assays, but only one rownames copy is kept.  Colnames
> > are typically small and getting them wrong often makes for silent, but
> > catastrophic errors.
> >
> > Pete
> >
> > 
> > Peter M. Haverty, Ph.D.
> > Genentech, Inc.
> > phave...@gene.com
> >
> >[[alternative HTML version deleted]]
> >
> > ___
> > Bioc-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] SummarizedExperiment

2014-03-25 Thread Peter Haverty
One benefit of having dimnames on assays would be that one could use
DataFrames as assays, like in eSet.  My genoset class is becoming more and
more like SummarizedExperiment. The dimname issues prevent me from
switching entirely from eSet to SummarizedExperiment.

I think that keeping only one copy of dimnames is a great feature, if a bit
dangerous.  My typical object has ~6 BigMatrix and/or DataFrame of Rle
objects as assays, so the rownames actually make up a considerable portion
of the object size.  (My typical dataset is 2.5M rows by 1k samples). I've
been moving towards keeping a single dimnames copy just to improve RData
load times.

I think that assays should be required to have dimnames when they are added
to a SummarizedExperiment. These dimnames should be checked for equality
with the dimnames of the SE in the setter function.

Perhaps with the recent (R 3.1) improvements in shallow/lazy copying and
reference counting, adding dimnames to outgoing assays will be less of a
performance hit.

I also like the compromise I have seen elsewhere, where the colnames are
always retained on assays, but only one rownames copy is kept.  Colnames
are typically small and getting them wrong often makes for silent, but
catastrophic errors.

Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel