i think this is an interesting analysis i have not viewed talk but it seems to me this could be a nice r journal paper querying jim h on fate of itdepends seems in order the topic is central to robustness of ecosystem so i hope some tools can come out of this
On Wed, Feb 12, 2020 at 12:13 PM Robert Castelo <robert.cast...@upf.edu> wrote: > Martin, Vince, Sean, > > thank you very much for your comments and suggestions, i've looked at > the package 'itdepends' from Jim Hester, this was a great suggestion. i > actually found a talk he gave about it on rstudioconf2019, here: > > > https://resources.rstudio.com/rstudio-conf-2019/it-depends-a-dialog-about-dependencies > > i recommend watching it to anyone interested in this thread, i think > pretty much tackles the most important issues we're concerned as > developers, regarding dependencies. > > ironically, the package 'itdepends' doesn't seem to be actively > developed: it's not part of CRAN, the GitHub repo hasn't been updated in > the last 5 months, it has 10 open issues for 5 closed ones and i've > experienced that some functions break in the current R-devel. > > i also didn't know about 'BiocPkgTools' and this seems to be the right > home for adding the kind of functionality we're talking about, although > i would think the same for 'itdepends' if it would be pushed to CRAN at > some point. > > i've invested some time to develop what it constitutes at the moment my > own needs on this subject. in case this is useful to anyone i've made a > GitHub gist available here: > > https://gist.github.com/rcastelo/7429d05178ddb57a38bd42093c2ddfe2 > > i haven't attempted to integrate this into 'BiocPkgTools' and do a pull > request because of two reasons: > > 1. if i try to fetch the dependencies from CRAN, as well as from BioC > (which is the only default), i get an error: > > library(BiocPkgTools) > > df <- buildPkgDependencyDataFrame(repo=c("BioCsoft", "CRAN")) > Error in url(viewsFileUrl) : invalid 'description' argument > > 2. because some of the calls break 'itdepends' in R-devel, this would > also break 'BiocPkgTools' in R-devel. i'm also not sure how feasible it > is for a BioC package to have a package dependency outside CRAN and BioC. > > my initial motivation for all this was that the installation of > 'GenomicScores' was breaking in one of our servers because of > compilation problems with the package 'Matrix'. this was surprising to > me because i wasn't expecting to have that dependency. after the first > exchange of messages in this thread, using the code we wrote, i > identified that only a few lines in the source of 'GenomicScores' were > leading to that dependency upstream. i could replace them and get rid of > that dependency and actually other ones. > > i've tried to provide a first attempt for a general approach to this > situation. first we should source the gist: > > devtools::source_gist("rcastelo/depburden.R") > > then build a database of dependencies information: > > repos <- BiocManager::repositories()[c("BioCsoft", "CRAN")] > db <- utils::available.packages(repos=repos) > > and now the important part consists of the following three steps: > > 1. identify the burden of dependencies of a package, e.g., "GenomicScores" > > pkgDepMetrics("GenomicScores", db) > ImportedBy Exported Usage DepOverlap > Biobase 1 128 0.781250 0.0250 > BSgenome 1 93 1.075269 0.3625 > XML 2 175 1.142857 0.0125 > IRanges 4 254 1.574803 0.0375 > BiocGenerics 5 139 3.597122 0.0125 > GenomicRanges 4 104 3.846154 0.1125 > S4Vectors 11 262 4.198473 0.0250 > GenomeInfoDb 5 53 9.433962 0.0750 > AnnotationHub 4 33 12.121212 0.6875 > Biostrings NA 240 NA 0.0750 > > following Jim's recommendations on his talk, concretely those in minute > 16, this function reports the number of function calls to a dependency > and the number of exported functions by that dependency. the column > 'Usage' is the percentage of those imported calls to the exposed > functionality by the dependency. for instance, if i want to get rid of > 'AnnotationHub' i'd have to implement in my package about the 12% of the > functionality exported by 'AnnotationHub'. > > the column 'DepOverlap' shows the overlap between the dependency graph > of the analyzed package and the dependency graph of the dependency in > that row. this is calculated as a Jaccard index (intersection of > vertices divided by the union) where 0 would correspond to disjoint > graphs and 1 to identical ones. > > from these numbers i can see that, for instance, i'm importing just one > function call from 'BSgenome' (about 1% of its functionality), while the > dependency burden of 'BSGenome' overlaps more than 1/3 of the total > burden of the package. this is to me a good candidate to explore in the > following two steps. > > 2.let's say we want to investigate what function calls are responsible > for the dependency on "BSgenome" > > funCalls2Dep("GenomicScores", "BSgenome", db) > # A tibble: 1 x 3 > # Groups: pkg [1] > pkg fun n > <chr> <chr> <int> > 1 BSgenome referenceGenome 4 > > so i'm using a function or method called "referenceGenome" imported from > "BSgenome" > > 3. we want now to see what lines in our code contain those function > calls (assuming we're in the source path of the package "GenomicScores"): > > lines <- funCalls2Dep("GenomicScores", "BSgenome", db, ".", "R") > head(lines, 2) > [[1]] > R/makeGScoresPackage.R:60:68: warning: BSgenome::referenceGenome > organism(gsco), > providerVersion(referenceGenome(gsco))), > > ^~~~~~~~~~~~~~~ > > [[2]] > R/makeGScoresPackage.R:69:49: warning: BSgenome::referenceGenome > GENOMEVERSION=providerVersion(referenceGenome(gsco)), > ^~~~~~~~~~~~~~~ > > here i'm using the release version of R because otherwise, as i said > before, some of the function calls to the 'itdepends' package break. > > > i'd be happy to pull-request this code, with the necessary adaptations, > wherever the community feels is more appropriate, but i'd say that the > problem with 'itdepends' and R-devel should be fixed first, and then we > can decide if this is something we want to incorporate into an API and > from what package. > > cheers, > > robert. > > On 2/9/20 5:01 PM, Sean Davis wrote: > > There are some good ideas here that would provide enhancement to > > BiocPkgTools. I don't have the bandwidth to incorporate right now, but > > filing issues or a pull request with a skeleton would be helpful to keep > > track. > > > > Sean > > > > On Sun, Feb 9, 2020 at 7:31 AM Vincent Carey <st...@channing.harvard.edu > > > > wrote: > > > >> On Sat, Feb 8, 2020 at 12:02 PM Martin Morgan <mtmorgan.b...@gmail.com> > >> wrote: > >> > >>> I find it quite interesting to identify formal strategies for removing > >>> dependencies, but also a little outside my domain of expertise. This > code > >>> > >> > >> It would be nice to collect the ideas in this thread into some > >> recommendations. The themes I am thinking of > >> are "how developers can make their packages robust to loss of external > >> packages" and "how can the > >> Bioc ecosystem best deal with departures of packages from itself and > from > >> CRAN?" A good and well-adopted > >> solution to the first one makes the second one moot. > >> > >> Two CRAN-related events I know of that required some effort are > (temporary) > >> loss of ashr and (recently) > >> archiving of Seurat. > >> > >> > >>> library(tools) > >>> library(dplyr) > >>> > >>> ## non-base packages the user requires for GenomicScores > >>> deps <- package_dependencies("GenomicScores", db, recursive=TRUE)[[1]] > >>> deps <- intersect(deps, rownames(db)) > >>> > >>> ## only need the 'universe' of GenomicScores dependencies > >>> db1 <- db[c("GenomicScores", deps),] > >>> > >>> ## sub-graph of packages between each dependency and GenomicScores > >>> revdeps <- package_dependencies(deps, db1, recursive = TRUE, reverse = > >>> TRUE) > >>> > >>> tibble( > >>> package = names(olap), > >>> n_remove = lengths(revdeps), > >>> ) %>% > >>> arrange(n_remove) > >>> > >>> produces a tibble > >>> > >>> # A tibble: 106 x 2 > >>> package n_remove > >>> <chr> <int> > >>> 1 BSgenome 1 > >>> 2 AnnotationHub 1 > >>> 3 shinyjs 1 > >>> 4 DT 1 > >>> 5 shinycustomloader 1 > >>> 6 data.table 1 > >>> 7 shinythemes 1 > >>> 8 rtracklayer 2 > >>> 9 BiocFileCache 2 > >>> 10 BiocManager 2 > >>> # … with 96 more rows > >>> > >>> shows me, via n_remove, that I can remove the dependency on > AnnotationHub > >>> by removing the dependency on just one package (AnnotationHub!), but to > >>> remove BiocFileCache I'd also have to remove another package > >>> (AnnotationHub, I'd guess). So this provides some measure of the ease > >> with > >>> which a package can be removed. > >>> > >>> I'd like a 'benefit' column, too -- if I were to remove AnnotationHub, > >> how > >>> many additional packages would I also be able to remove, because they > are > >>> present only to satisfy the dependency on AnnotationHub? More > generally, > >>> perhaps there is a dependency of AnnotationHub that is only used by > >>> AnnotationHub and BSgenome. So removing AnnotationHub as a dependency > >> would > >>> make it easier to remove BSgenome, etc. I guess this is a graph > >>> optimization problem. > >>> > >>> Probably also worth mentioning the itdepends package ( > >>> https://github.com/r-lib/itdepends), which I think tries primarily to > >>> determine the relationship between package dependencies and lines of > >> code, > >>> which seems like complementary information. > >>> > >>> Martin > >>> > >>> On 2/6/20, 12:29 PM, "Robert Castelo" <robert.cast...@upf.edu> wrote: > >>> > >>> true, i was just searching for the shortest path, we can search > for > >>> all > >>> simple (i.e., without repeating "vertices") paths and there are > up to > >>> five routes from "GenomicScores" to "Matrix" > >>> > >>> igraph::all_simple_paths(igraph::igraph.from.graphNEL(g), > >>> from="GenomicScores", to="Matrix", mode="out") > >>> [[1]] > >>> + 7/117 vertices, named, from 04133ec: > >>> [1] GenomicScores BSgenome rtracklayer > >>> [4] GenomicAlignments SummarizedExperiment DelayedArray > >>> [7] Matrix > >>> > >>> [[2]] > >>> + 6/117 vertices, named, from 04133ec: > >>> [1] GenomicScores BSgenome rtracklayer > >>> [4] GenomicAlignments SummarizedExperiment Matrix > >>> > >>> [[3]] > >>> + 6/117 vertices, named, from 04133ec: > >>> [1] GenomicScores DT crosstalk ggplot2 mgcv > >>> [6] Matrix > >>> > >>> [[4]] > >>> + 6/117 vertices, named, from 04133ec: > >>> [1] GenomicScores rtracklayer GenomicAlignments > >>> [4] SummarizedExperiment DelayedArray Matrix > >>> > >>> [[5]] > >>> + 5/117 vertices, named, from 04133ec: > >>> [1] GenomicScores rtracklayer GenomicAlignments > >>> [4] SummarizedExperiment Matrix > >>> > >>> this is interesting, because it means that if i wanted to get rid > of > >>> the > >>> "Matrix" dependence i'd need to get rid not only of the > "rtracklayer" > >>> dependence but also of "BSgenome" and "DT". > >>> > >>> robert. > >>> > >>> > >>> On 2/6/20 5:41 PM, Martin Morgan wrote: > >>> > Excellent! I think there are other, independent, paths between > your > >>> immediate dependents... > >>> > > >>> > RBGL::sp.between(g, start="DT", finish="Matrix", > >>> detail=TRUE)[[1]]$path_detail > >>> > [1] "DT" "crosstalk" "ggplot2" "mgcv" "Matrix" > >>> > > >>> > ?? > >>> > > >>> > Martin > >>> > > >>> > On 2/6/20, 10:47 AM, "Robert Castelo" <robert.cast...@upf.edu> > >>> wrote: > >>> > > >>> > hi Martin, > >>> > > >>> > thanks for hint!! i wasn't aware of > >>> 'tools::package_dependencies()', > >>> > adding a bit of graph sorcery i get the result i was > looking > >>> for: > >>> > > >>> > repos <- BiocManager::repositories()[c(1,5)] > >>> > repos > >>> > BioCsoft > >>> > "https://bioconductor.org/packages/3.11/bioc" > >>> > CRAN > >>> > "https://cran.rstudio.com" > >>> > > >>> > db <- available.packages(repos=repos) > >>> > > >>> > deps <- tools::package_dependencies("GenomicScores", db, > >>> > recursive=TRUE)[[1]] > >>> > > >>> > deps <- tools::package_dependencies(c("GenomicScores", > deps), > >>> db) > >>> > > >>> > g <- graph::graphNEL(nodes=names(deps), edgeL=deps, > >>> edgemode="directed") > >>> > > >>> > RBGL::sp.between(g, start="GenomicScores", finish="Matrix", > >>> > detail=TRUE)[[1]]$path_detail > >>> > [1] "GenomicScores" "rtracklayer" > >>> "GenomicAlignments" > >>> > [4] "SummarizedExperiment" "Matrix" > >>> > > >>> > so, it was the rtracklayer dependency that leads to Matrix > >>> through > >>> > GenomeAlignments and SummarizedExperiment. > >>> > > >>> > maybe the BioC package 'pkgDepTools' should be deprecated > if > >> its > >>> > functionality is part of 'tools' and it does not even work > as > >>> fast and > >>> > correct as 'tools'. > >>> > > >>> > cheers, > >>> > > >>> > robert. > >>> > > >>> > > >>> > On 2/6/20 2:51 PM, Martin Morgan wrote: > >>> > > The first thing is to get the correct repositories > >>> > > > >>> > > repos = BiocManager::repositories() > >>> > > > >>> > > (maybe trim the experiment and annotation repos from > this). > >> I > >>> also tried pkgDepTools::makeDepGraph() but it took so long that I moved > >>> on... it has an option 'keep.builtin' which might include Matrix. > >>> > > > >>> > > There is also > BiocPkgTools::buildPkgDependencyDataFrame() & > >>> friends, but this seems to build dependencies within a single > >> repository... > >>> > > > >>> > > The building block for a solution is > >>> `tools::package_dependencies()`, and I can confirm that "Matrix" _is_ a > >>> dependency > >>> > > > >>> > > db = available.packages(repos = > >>> BiocManager::repositories()) > >>> > > revdeps <- > tools::package_dependencies("GenomicScores", > >>> db, recursive = TRUE) > >>> > > "Matrix" %in% revdeps[[1]] > >>> > > ## [1] TRUE > >>> > > > >>> > > so I'll leave the clever recursive or graph-based > algorithm > >>> up to you, to report back to the mailing list? > >>> > > > >>> > > For what it's worth I think the last time this came up > >> Martin > >>> Maechler pointed to a function in base R (probably the tools package) > >> that > >>> implements this, too...? > >>> > > > >>> > > Martin Morgan > >>> > > > >>> > > On 2/6/20, 6:40 AM, "Bioc-devel on behalf of Robert > >> Castelo" > >>> <bioc-devel-boun...@r-project.org on behalf of robert.cast...@upf.edu> > >>> wrote: > >>> > > > >>> > > hi, > >>> > > > >>> > > when i load the package 'GenomicScores' in a clean > >>> session i see thorugh > >>> > > the 'sessionInfo()' that the package 'Matrix' is > listed > >>> under "loaded > >>> > > via a namespace (and not attached)". > >>> > > > >>> > > i'd like to know what is the dependency that > >>> 'GenomicsScores' has that > >>> > > ends up requiring the package 'Matrix'. > >>> > > > >>> > > i've tried using the package 'pkgDepTools' without > >>> success, because the > >>> > > dependency graph does not list any path from > >>> 'GenomicScores' to 'Matrix'. > >>> > > > >>> > > i've been manually browsing the Bioc website and, > >> unless > >>> i've overlooked > >>> > > something, the only association with 'Matrix' i > could > >>> find is that > >>> > > 'S4Vectors' and 'GenomicRanges', which are required > by > >>> 'GenomicScores', > >>> > > list 'Matrix' in the 'Suggests' field, but my > >>> understanding is that > >>> > > those packages are not required and should not be > >> loaded. > >>> > > > >>> > > so, is there any way in which i can figure out what > of > >>> the > >>> > > 'GenomicScores' dependencies leads to loading the > >>> package 'Matrix'? > >>> > > > >>> > > here are the depends, import and suggests fields > from > >>> 'GenomicScores': > >>> > > > >>> > > Depends: R (>= 3.5), S4Vectors (>= 0.7.21), > >>> GenomicRanges, methods, > >>> > > BiocGenerics (>= 0.13.8) > >>> > > Imports: utils, XML, Biobase, IRanges (>= 2.3.23), > >>> Biostrings, > >>> > > BSgenome, GenomeInfoDb, AnnotationHub, > shiny, > >>> shinyjs, > >>> > > DT, shinycustomloader, rtracklayer, > data.table, > >>> shinythemes > >>> > > Suggests: BiocStyle, knitr, rmarkdown, > >>> BSgenome.Hsapiens.UCSC.hg19, > >>> > > phastCons100way.UCSC.hg19, > >>> MafDb.1Kgenomes.phase1.hs37d5, > >>> > > SNPlocs.Hsapiens.dbSNP144.GRCh37, > >>> VariantAnnotation, > >>> > > TxDb.Hsapiens.UCSC.hg19.knownGene, gwascat, > >>> RColorBrewer > >>> > > > >>> > > and here a session information in a fresh R-devel > >>> session after loading > >>> > > the package 'GenomicScores': > >>> > > > >>> > > R Under development (unstable) (2020-01-29 r77745) > >>> > > Platform: x86_64-pc-linux-gnu (64-bit) > >>> > > Running under: CentOS Linux 7 (Core) > >>> > > > >>> > > Matrix products: default > >>> > > BLAS: /opt/R/R-devel/lib64/R/lib/libRblas.so > >>> > > LAPACK: /opt/R/R-devel/lib64/R/lib/libRlapack.so > >>> > > > >>> > > locale: > >>> > > [1] LC_CTYPE=en_US.UTF8 LC_NUMERIC=C > >>> > > [3] LC_TIME=en_US.UTF8 > LC_COLLATE=en_US.UTF8 > >>> > > [5] LC_MONETARY=en_US.UTF8 > LC_MESSAGES=en_US.UTF8 > >>> > > [7] LC_PAPER=en_US.UTF8 LC_NAME=C > >>> > > [9] LC_ADDRESS=C LC_TELEPHONE=C > >>> > > [11] LC_MEASUREMENT=en_US.UTF8 LC_IDENTIFICATION=C > >>> > > > >>> > > attached base packages: > >>> > > [1] parallel stats4 stats graphics > grDevices > >>> utils datasets > >>> > > [8] methods base > >>> > > > >>> > > other attached packages: > >>> > > [1] GenomicScores_1.11.4 GenomicRanges_1.39.2 > >>> GenomeInfoDb_1.23.10 > >>> > > [4] IRanges_2.21.3 S4Vectors_0.25.12 > >>> BiocGenerics_0.33.0 > >>> > > [7] colorout_1.2-2 > >>> > > > >>> > > loaded via a namespace (and not attached): > >>> > > [1] Rcpp_1.0.3 lattice_0.20-38 > >>> > > [3] shinycustomloader_0.9.0 Rsamtools_2.3.3 > >>> > > [5] Biostrings_2.55.4 assertthat_0.2.1 > >>> > > [7] digest_0.6.23 mime_0.9 > >>> > > [9] BiocFileCache_1.11.4 R6_2.4.1 > >>> > > [11] RSQLite_2.2.0 httr_1.4.1 > >>> > > [13] pillar_1.4.3 zlibbioc_1.33.1 > >>> > > [15] rlang_0.4.4 curl_4.3 > >>> > > [17] data.table_1.12.8 blob_1.2.1 > >>> > > [19] DT_0.12 Matrix_1.2-18 > >>> > > [21] shinythemes_1.1.2 shinyjs_1.1 > >>> > > [23] BiocParallel_1.21.2 > AnnotationHub_2.19.7 > >>> > > [25] htmlwidgets_1.5.1 RCurl_1.98-1.1 > >>> > > [27] bit_1.1-15.1 shiny_1.4.0 > >>> > > [29] DelayedArray_0.13.3 compiler_4.0.0 > >>> > > [31] httpuv_1.5.2 > rtracklayer_1.47.0 > >>> > > [33] pkgconfig_2.0.3 htmltools_0.4.0 > >>> > > [35] tidyselect_1.0.0 > >>> SummarizedExperiment_1.17.1 > >>> > > [37] tibble_2.1.3 > >> GenomeInfoDbData_1.2.2 > >>> > > [39] interactiveDisplayBase_1.25.0 > matrixStats_0.55.0 > >>> > > [41] XML_3.99-0.3 crayon_1.3.4 > >>> > > [43] dplyr_0.8.4 dbplyr_1.4.2 > >>> > > [45] later_1.0.0 > >>> GenomicAlignments_1.23.1 > >>> > > [47] bitops_1.0-6 rappdirs_0.3.1 > >>> > > [49] grid_4.0.0 xtable_1.8-4 > >>> > > [51] DBI_1.1.0 magrittr_1.5 > >>> > > [53] XVector_0.27.0 promises_1.1.0 > >>> > > [55] vctrs_0.2.2 tools_4.0.0 > >>> > > [57] bit64_0.9-7 BSgenome_1.55.3 > >>> > > [59] Biobase_2.47.2 glue_1.3.1 > >>> > > [61] purrr_0.3.3 > BiocVersion_3.11.1 > >>> > > [63] fastmap_1.0.1 yaml_2.2.1 > >>> > > [65] AnnotationDbi_1.49.1 > BiocManager_1.30.10 > >>> > > [67] memoise_1.1.0 > >>> > > > >>> > > > >>> > > > >>> > > thanks!! > >>> > > > >>> > > robert. > >>> > > > >>> > > _______________________________________________ > >>> > > Bioc-devel@r-project.org mailing list > >>> > > https://stat.ethz.ch/mailman/listinfo/bioc-devel > >>> > > > >>> > > > >>> > > >>> > -- > >>> > Robert Castelo, PhD > >>> > Associate Professor > >>> > Dept. of Experimental and Health Sciences > >>> > Universitat Pompeu Fabra (UPF) > >>> > Barcelona Biomedical Research Park (PRBB) > >>> > Dr Aiguader 88 > >>> > E-08003 Barcelona, Spain > >>> > telf: +34.933.160.514 > >>> > fax: +34.933.160.550 > >>> > > >>> > > >>> > >>> -- > >>> Robert Castelo, PhD > >>> Associate Professor > >>> Dept. of Experimental and Health Sciences > >>> Universitat Pompeu Fabra (UPF) > >>> Barcelona Biomedical Research Park (PRBB) > >>> Dr Aiguader 88 > >>> E-08003 Barcelona, Spain > >>> telf: +34.933.160.514 > >>> fax: +34.933.160.550 > >>> > >>> _______________________________________________ > >>> Bioc-devel@r-project.org mailing list > >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel > >>> > >> > >> -- > >> The information in this e-mail is intended only for th...{{dropped:20}} > > > > _______________________________________________ > > Bioc-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/bioc-devel > > > > -- > Robert Castelo, PhD > Associate Professor > Dept. of Experimental and Health Sciences > Universitat Pompeu Fabra (UPF) > Barcelona Biomedical Research Park (PRBB) > Dr Aiguader 88 > E-08003 Barcelona, Spain > telf: +34.933.160.514 > fax: +34.933.160.550 > -- The information in this e-mail is intended only for the ...{{dropped:18}} _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel