Re: [Bioc-devel] Compatibility of Bioconductor with tidyverse S3 classes/methods

2020-02-08 Thread stefano
Thanks again.

*To Martin.*

Got the point and I agree. I will do my best.

*To Vincent. *

I see. At the moment ttBulk is an API for users, not yet for developers.
But I can already imagine an API framework where you can feed a new custom
functionality (let's say UMAP dimensionality reduction function), to a
wrapper validator and information integrator (to the original input) that
ensures endomorphic properties, with the output having the same properties
of the input.

In order to do this, the definition of a ttBulk tibble and its
requirements/validation will have to be a little more established, upon
community feedback. The transition then will be pretty easy. When that
happens, I would be interested in having some feedback from you!

Best wishes.

*Stefano *



Stefano Mangiola | Postdoctoral fellow

Papenfuss Laboratory

The Walter Eliza Hall Institute of Medical Research

+61 (0)466452544


Il giorno dom 9 feb 2020 alle ore 03:08 Martin Morgan <
mtmorgan.b...@gmail.com> ha scritto:

> The first thing is that most contributed packages end up being accepted,
> so the discussion here should be considered as (strong) advice, rather than
> requirement. The advice is partly offered to maximize the success of
> contributed packages in the Bioconductor ecosystem, but at the end of the
> day the success of your package depends on the value it adds to the users
> who find it. Vince offered some pretty high enthusiasm, which is a good
> sign!
>
>
>
> I used ‘primarily’ mostly to encourage a more careful implementation of
> support for SE – it’s easy to say ‘yes, my package interoperates with SE’,
> but much more challenging to demonstrate through evaluated code that it
> actually does!
>
>
>
> Cynically but with empirical experience and not a reflection of your own
> commitment, I’ve learned that the promise of ‘future’ integration is seldom
> realized – package submission is often the last time that the community can
> directly influence package implementation and development. It would be
> interesting to develop review processes that continuously assessed package
> quality and utility.
>
>
>
> Martin
>
>
>
>
>
> *From: *stefano 
> *Date: *Friday, February 7, 2020 at 6:39 PM
> *To: *Vincent Carey 
> *Cc: *Martin Morgan , Michael Lawrence <
> lawrence.mich...@gene.com>, "bioc-devel@r-project.org" <
> bioc-devel@r-project.org>
> *Subject: *Re: [Bioc-devel] Compatibility of Bioconductor with tidyverse
> S3 classes/methods
>
>
>
> Thanks Guys for the discussion (I am learning a lot),
>
>
>
> *To Martin:*
>
>
>
> Thanks for the tips. I will start to implement those S4 style methods
> https://github.com/stemangiola/ttBulk/issues/7
>
>
>
> I would *really *like to be part of Bioconductor community with this
> package, if just this
>
>
>
> > " One would expect the vignette and examples to primarily emphasize the
> use of the interoperable (SummmarizedExperiment) version. "
>
>
>
> Could become this
>
>
>
> > One would expect the vignette and examples to emphasize the use of the
> interoperable (SummmarizedExperiment) version.
>
>
>
> I agree with the integration priority of Bioconductor, but this repository
> (and this philosophy) is more than its data structures. There should be
> space for more than one approach to do things, given that the principle are
> respected.
>
>
>
> If this is true, I could really spend energies to use methods as you
> suggested and implement the SummarisedExperiment stream. And with the tips
> of the community the link will become stronger and stronger with time and
> versions.
>
>
>
>
>
> *To Vincent*
>
>
>
> Thanks a lot for the interest.
>
>
>
> *> One thing I feel is missing is an approach to the following question:
> [..] How do I make one that works the way ttBulk's operators work?*
>
>
>
> I'm afraid I don't really understand the question. Are you wondering about
> extension of the framework? Or creating a similar framework for other
> applications? Could you please reformulate, maybe giving a concrete
> example?
>
>
>
> *> Are there patterns there that are preserved across different operators?
> *
>
>
>
> A commonality is the use of code for integrating the new calculated
> information (dplyr), validation functions, ..
>
>
>
> *> Can they be factored out to improve maintainability?*
>
>
>
> Almost surely yes, this is the first version, I hope to see enough
> interest, improve the API upon feedback, and hope for (intellectual and
> practical) contributions from more experts in software engineering.
>
>
>
> *> validObject *
>
>
>
> Seems a good method, and as far as I tested works for S3 objects as well.
> I will try to implement it. In fact I already added it as issue into Github
> https://github.com/stemangiola/ttBulk/issues/6
>
>
>
> At the moment I have a custom validation function
>
>
>
> Best wishes.
>
> *Stefano *
>
>
>
> Stefano Mangiola | Postdoctoral fellow
>
> Papenfuss Laboratory
>
> The Walter Eliza Hall Institute of Medical Research
>
> +61 (0)466452544
>
>
>
>
>
> Il gi

Re: [Bioc-devel] how to trace 'Matrix' as package dependency for 'GenomicScores'

2020-02-08 Thread Martin Morgan
I find it quite interesting to identify formal strategies for removing 
dependencies, but also a little outside my domain of expertise. This code

library(tools)
library(dplyr)

## non-base packages the user requires for GenomicScores
deps <- package_dependencies("GenomicScores", db, recursive=TRUE)[[1]]
deps <- intersect(deps, rownames(db))

## only need the 'universe' of GenomicScores dependencies
db1 <- db[c("GenomicScores", deps),]

## sub-graph of packages between each dependency and GenomicScores
revdeps <- package_dependencies(deps, db1, recursive = TRUE, reverse = TRUE)

tibble(
package = names(olap),
n_remove = lengths(revdeps),
) %>%
arrange(n_remove)

produces a tibble

# A tibble: 106 x 2
   package   n_remove
   
 1 BSgenome 1
 2 AnnotationHub1
 3 shinyjs  1
 4 DT   1
 5 shinycustomloader1
 6 data.table   1
 7 shinythemes  1
 8 rtracklayer  2
 9 BiocFileCache2
10 BiocManager  2
# … with 96 more rows

shows me, via n_remove, that I can remove the dependency on AnnotationHub by 
removing the dependency on just one package (AnnotationHub!), but to remove 
BiocFileCache I'd also have to remove another package (AnnotationHub, I'd 
guess). So this provides some measure of the ease with which a package can be 
removed.

I'd like a 'benefit' column, too -- if I were to remove AnnotationHub, how many 
additional packages would I also be able to remove, because they are present 
only to satisfy the dependency on AnnotationHub? More generally, perhaps there 
is a dependency of AnnotationHub that is only used by AnnotationHub and 
BSgenome. So removing AnnotationHub as a dependency would make it easier to 
remove BSgenome, etc. I guess this is a graph optimization problem.

Probably also worth mentioning the itdepends package 
(https://github.com/r-lib/itdepends), which I think tries primarily to 
determine the relationship between package dependencies and lines of code, 
which seems like complementary information.

Martin

On 2/6/20, 12:29 PM, "Robert Castelo"  wrote:

true, i was just searching for the shortest path, we can search for all 
simple (i.e., without repeating "vertices") paths and there are up to 
five routes from "GenomicScores" to "Matrix"

igraph::all_simple_paths(igraph::igraph.from.graphNEL(g), 
from="GenomicScores", to="Matrix", mode="out")
[[1]]
+ 7/117 vertices, named, from 04133ec:
[1] GenomicScoresBSgenome rtracklayer
[4] GenomicAlignmentsSummarizedExperiment DelayedArray
[7] Matrix

[[2]]
+ 6/117 vertices, named, from 04133ec:
[1] GenomicScoresBSgenome rtracklayer
[4] GenomicAlignmentsSummarizedExperiment Matrix

[[3]]
+ 6/117 vertices, named, from 04133ec:
[1] GenomicScores DTcrosstalk ggplot2   mgcv
[6] Matrix

[[4]]
+ 6/117 vertices, named, from 04133ec:
[1] GenomicScoresrtracklayer  GenomicAlignments
[4] SummarizedExperiment DelayedArray Matrix

[[5]]
+ 5/117 vertices, named, from 04133ec:
[1] GenomicScoresrtracklayer  GenomicAlignments
[4] SummarizedExperiment Matrix

this is interesting, because it means that if i wanted to get rid of the 
"Matrix" dependence i'd need to get rid not only of the "rtracklayer" 
dependence but also of "BSgenome" and "DT".

robert.


On 2/6/20 5:41 PM, Martin Morgan wrote:
> Excellent! I think there are other, independent, paths between your 
immediate dependents...
> 
> RBGL::sp.between(g, start="DT", finish="Matrix", 
detail=TRUE)[[1]]$path_detail
> [1] "DT""crosstalk" "ggplot2"   "mgcv"  "Matrix"
> 
> ??
> 
> Martin
> 
> On 2/6/20, 10:47 AM, "Robert Castelo"  wrote:
> 
>  hi Martin,
>  
>  thanks for hint!! i wasn't aware of 'tools::package_dependencies()',
>  adding a bit of graph sorcery i get the result i was looking for:
>  
>  repos <- BiocManager::repositories()[c(1,5)]
>  repos
>BioCsoft
>  "https://bioconductor.org/packages/3.11/bioc";
>CRAN
>  "https://cran.rstudio.com";
>  
>  db <- available.packages(repos=repos)
>  
>  deps <- tools::package_dependencies("GenomicScores", db,
>  recursive=TRUE)[[1]]
>  
>  deps <- tools::package_dependencies(c("GenomicScores", deps), db)
>  
>  g <- graph::graphNEL(nodes=names(deps), edgeL=deps, 
edgemode="directed")
>  
>  RBGL::sp.between(g, start="GenomicScores", finish="Matrix",
>  detail=TRUE)[[1]]$path_detail
>  [1] "

Re: [Bioc-devel] Compatibility of Bioconductor with tidyverse S3 classes/methods

2020-02-08 Thread Martin Morgan
The first thing is that most contributed packages end up being accepted, so the 
discussion here should be considered as (strong) advice, rather than 
requirement. The advice is partly offered to maximize the success of 
contributed packages in the Bioconductor ecosystem, but at the end of the day 
the success of your package depends on the value it adds to the users who find 
it. Vince offered some pretty high enthusiasm, which is a good sign!

I used ‘primarily’ mostly to encourage a more careful implementation of support 
for SE – it’s easy to say ‘yes, my package interoperates with SE’, but much 
more challenging to demonstrate through evaluated code that it actually does!

Cynically but with empirical experience and not a reflection of your own 
commitment, I’ve learned that the promise of ‘future’ integration is seldom 
realized – package submission is often the last time that the community can 
directly influence package implementation and development. It would be 
interesting to develop review processes that continuously assessed package 
quality and utility.

Martin


From: stefano 
Date: Friday, February 7, 2020 at 6:39 PM
To: Vincent Carey 
Cc: Martin Morgan , Michael Lawrence 
, "bioc-devel@r-project.org" 

Subject: Re: [Bioc-devel] Compatibility of Bioconductor with tidyverse S3 
classes/methods

Thanks Guys for the discussion (I am learning a lot),

To Martin:

Thanks for the tips. I will start to implement those S4 style methods 
https://github.com/stemangiola/ttBulk/issues/7

I would really like to be part of Bioconductor community with this package, if 
just this

> " One would expect the vignette and examples to primarily emphasize the use 
> of the interoperable (SummmarizedExperiment) version. "

Could become this

> One would expect the vignette and examples to emphasize the use of the 
> interoperable (SummmarizedExperiment) version.

I agree with the integration priority of Bioconductor, but this repository (and 
this philosophy) is more than its data structures. There should be space for 
more than one approach to do things, given that the principle are respected.

If this is true, I could really spend energies to use methods as you suggested 
and implement the SummarisedExperiment stream. And with the tips of the 
community the link will become stronger and stronger with time and versions.


To Vincent

Thanks a lot for the interest.

> One thing I feel is missing is an approach to the following question: [..] 
> How do I make one that works the way ttBulk's operators work?

I'm afraid I don't really understand the question. Are you wondering about 
extension of the framework? Or creating a similar framework for other 
applications? Could you please reformulate, maybe giving a concrete example?

> Are there patterns there that are preserved across different operators?

A commonality is the use of code for integrating the new calculated information 
(dplyr), validation functions, ..

> Can they be factored out to improve maintainability?

Almost surely yes, this is the first version, I hope to see enough interest, 
improve the API upon feedback, and hope for (intellectual and practical) 
contributions from more experts in software engineering.

> validObject

Seems a good method, and as far as I tested works for S3 objects as well. I 
will try to implement it. In fact I already added it as issue into Github 
https://github.com/stemangiola/ttBulk/issues/6

At the moment I have a custom validation function

Best wishes.
Stefano

Stefano Mangiola | Postdoctoral fellow
Papenfuss Laboratory
The Walter Eliza Hall Institute of Medical Research
+61 (0)466452544


Il giorno sab 8 feb 2020 alle ore 01:54 Vincent Carey 
mailto:st...@channing.harvard.edu>> ha scritto:
This is an interesting discussion and I hope it is ok to continue it a bit.  I 
found the
readme for the ttBulk repo extremely enticing and I am sure many people will 
want to
explore this way of working with genomic data.  I have only a few moments to 
explore
it and did not read the vignette, but it looks to me as if it is mostly 
recapitulated in the
README, which is an excellent overview.

One thing I feel is missing is an approach to the following question: I like the
idea of a pipe-oriented operator for programming steps in genomic workflows.
How do I make one that works the way ttBulk's operators work?  Well, I can
have a look at ttBulk:::reduce_dimensions.ttBulk ...


It's involved.  Are there patterns there that
are preserved across different operators?  Can
they be factored out to improve maintainability?


One other point before I run


It seems to me the operators "require" that certain
fields be defined in their tibble operands.



> names(attributes(counts))

[1] "names"  "class"  "row.names"  "parameters"

> attributes(counts)$names

[1] "sample" "transcript" "Cell type"

[4] "count"  "time"   "condition"

[7] "batch"  "factor_of_interest"

> validObjec

Re: [Bioc-devel] Compatibility of Bioconductor with tidyverse S3 classes/methods

2020-02-08 Thread Vincent Carey
On Fri, Feb 7, 2020 at 6:39 PM stefano  wrote:

> Thanks Guys for the discussion (I am learning a lot),
>
> *To Martin:*
>
> Thanks for the tips. I will start to implement those S4 style methods
> https://github.com/stemangiola/ttBulk/issues/7
>
> I would *really *like to be part of Bioconductor community with this
> package, if just this
>
> > " One would expect the vignette and examples to primarily emphasize the
> use of the interoperable (SummmarizedExperiment) version. "
>
> Could become this
>
> > One would expect the vignette and examples to emphasize the use of the
> interoperable (SummmarizedExperiment) version.
>
> I agree with the integration priority of Bioconductor, but this repository
> (and this philosophy) is more than its data structures. There should be
> space for more than one approach to do things, given that the principle are
> respected.
>
> If this is true, I could really spend energies to use methods as you
> suggested and implement the SummarisedExperiment stream. And with the tips
> of the community the link will become stronger and stronger with time and
> versions.
>
>
> *To Vincent*
>
> Thanks a lot for the interest.
>
> *> One thing I feel is missing is an approach to the following question:
> [..] How do I make one that works the way ttBulk's operators work?*
>
> I'm afraid I don't really understand the question. Are you wondering about
> extension of the framework? Or creating a similar framework for other
> applications? Could you please reformulate, maybe giving a concrete
> example?
>

We can take further discussion to the issues on the github repo but I will
briefly respond here.  Consider reduce_dimensions.
You give a small number of method options here -- PCA, MDS, tSNE.  The MDS
option makes its way to stats::cmdscale via limma::plotMDS;
the PCA option uses prcomp.  For any number of reasons, users may want to
select alternate dimension reduction procedures or
tune them in ways not passed up through your interface.  This might involve
modifications to your code to introduce changes, or
one could imagine a protocol for "dropping in" a new operator for ttBulk
pipelines.  My question is to understand how this level
of flexibility might be achieved.

An example of an R package that pursues this is mlr3, see
https://github.com/mlr-org/mlr3learners.template ... a link there is broken
but the full details of contributing new pipeline elements are at
https://mlr3book.mlr-org.com/pipelines.html


> *> Are there patterns there that are preserved across different operators?
> *
>
> A commonality is the use of code for integrating the new calculated
> information (dplyr), validation functions, ..
>
> *> Can they be factored out to improve maintainability?*
>
> Almost surely yes, this is the first version, I hope to see enough
> interest, improve the API upon feedback, and hope for (intellectual and
> practical) contributions from more experts in software engineering.
>
> *> validObject *
>
> Seems a good method, and as far as I tested works for S3 objects as well.
> I will try to implement it. In fact I already added it as issue into Github
> https://github.com/stemangiola/ttBulk/issues/6
>
> At the moment I have a custom validation function
>
> Best wishes.
>
> *Stefano *
>
>
>
> Stefano Mangiola | Postdoctoral fellow
>
> Papenfuss Laboratory
>
> The Walter Eliza Hall Institute of Medical Research
>
> +61 (0)466452544
>
>
> Il giorno sab 8 feb 2020 alle ore 01:54 Vincent Carey <
> st...@channing.harvard.edu> ha scritto:
>
>> This is an interesting discussion and I hope it is ok to continue it a
>> bit.  I found the
>> readme for the ttBulk repo extremely enticing and I am sure many people
>> will want to
>> explore this way of working with genomic data.  I have only a few moments
>> to explore
>> it and did not read the vignette, but it looks to me as if it is mostly
>> recapitulated in the
>> README, which is an excellent overview.
>>
>> One thing I feel is missing is an approach to the following question: I
>> like the
>> idea of a pipe-oriented operator for programming steps in genomic
>> workflows.
>> How do I make one that works the way ttBulk's operators work?  Well, I can
>> have a look at ttBulk:::reduce_dimensions.ttBulk ...
>>
>> It's involved.  Are there patterns there that
>> are preserved across different operators?  Can
>> they be factored out to improve maintainability?
>>
>> One other point before I run
>>
>> It seems to me the operators "require" that certain
>> fields be defined in their tibble operands.
>>
>> > names(attributes(counts))
>>
>> [1] "names"  "class"  "row.names"  "parameters"
>>
>> > attributes(counts)$names
>>
>> [1] "sample" "transcript" "Cell type"
>>
>> [4] "count"  "time"   "condition"
>>
>> [7] "batch"  "factor_of_interest"
>>
>> > validObject(counts)
>>
>> *Error in .classEnv(classDef) : *
>>
>> *  trying to get slot "package" from an object of a basic class ("NULL")
>> with