[Bioc-devel] VariantAnnotation: DNAStringSets for ref/alt alleles in 'VRanges' class
Hi, Would it be reasonable to (optionally) allow storing the reference and alternative alleles in the 'VRanges' class as a 'DNAStringSet'? Currently, 'character' and 'Rle' are possible. Having a 'DNAStringSet' would make it more consistent with the rest of the 'VariantAnnotation' framework and make use of the efficient 'Biostrings' string handling infrastructure. Best wishes Julian ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] VariantAnnotation: DNAStringSets for ref/alt alleles in 'VRanges' class
This was a consideration. I guess I've never got much use out of them being DNAStringSets, so I just went with the simple character vectors. It makes sense to support DNAStringSet. I could imagine someone e.g. wanting to represent mutations at the protein-level, and structural variants will require more complexity, but DNA is by far the most common use case. Are you willing to submit this as a patch? Just out of curiosity, how are you using Biostrings in this case? On Mon, Nov 4, 2013 at 1:12 AM, Julian Gehring julian.gehr...@embl.dewrote: Hi, Would it be reasonable to (optionally) allow storing the reference and alternative alleles in the 'VRanges' class as a 'DNAStringSet'? Currently, 'character' and 'Rle' are possible. Having a 'DNAStringSet' would make it more consistent with the rest of the 'VariantAnnotation' framework and make use of the efficient 'Biostrings' string handling infrastructure. Best wishes Julian ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
Actually, the check that I proposed is only supposed to check for usage of user-defined variables, not variables from packages. Truthfully, though, I guess I'm not the right person to work on this, since in practice I use forked processes for the vast majority of my inside-R parallelization, so I never have to worry about things being undefined in the forked subprocess. Therefore I cant really dogfood any of the stuff that might be implemented as a result of this thread. -Ryan On Mon Nov 4 03:48:23 2013, Michael Lawrence wrote: So what is the best practice for ensuring that something is actually visible to the worker? If the worker needs functionality from a package, should the namespace be explicitly referenced via ::? Lazy users might want to include library() calls in the worker function. This proposed check will then throw an exception. Probably a good thing, but is there a way for a user to declare imported namespaces? I know that BatchJobs allows for passing a list of packages to be loaded via library() on the worker. That is leveraging the search path to make sure everything is visible and is a reasonable compromise (:: is always an option). We could essentially reimplement the search path if we wanted isolation, but the worker is already isolated. Anyway, somehow those types of declarations should be taken into account. Moving back to the general discussion, for complex operations, it's easiest to have the worker in a package. In that case, the worker will likely rely on other functions, and the cleanest way to get those functions to the worker is to have them installed as a package. At least with BatchJobs, when the worker is inside a package namespace, that namespace is automatically loaded (but not attached), so all functions are automatically visible, without any extra work by me. Michael On Sun, Nov 3, 2013 at 10:46 PM, Ryan r...@thompsonclan.org mailto:r...@thompsonclan.org wrote: Ok, here is my attempt at a function to get the list of user-defined free variables that a function refers to: https://gist.github.com/__DarwinAwardWinner/7298557 https://gist.github.com/DarwinAwardWinner/7298557 Is uses codetools, so it is subject to the limitations of that package, but for simple examples, it successfully detects when a function refers to something in the global env. On Sun Nov 3 21:14:29 2013, Gabriel Becker wrote: Ryan (et al), FYI: f function() { x = rnorm(x) x } findGlobals(f) [1] = { rnorm x should be in the list of globals but it isn't. ~G sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] codetools_0.2-8 On Sun, Nov 3, 2013 at 5:37 PM, Ryan r...@thompsonclan.org mailto:r...@thompsonclan.org mailto:r...@thompsonclan.org mailto:r...@thompsonclan.org wrote: Looking at the codetools package, I think findGlobals is basically exactly what we want here, right? As you say, there are necessarily limitations due to R being a dynamic language, but the goal is to catch common errors, not stop people from tricking the check. I think I'll try to code something up soon. -Ryan On 11/3/13, 5:10 PM, Gabriel Becker wrote: Henrik, See https://github.com/duncantl/__CodeDepends https://github.com/duncantl/CodeDepends (as used by used by https://github.com/gmbecker/__RCacheSuite https://github.com/gmbecker/RCacheSuite). It will identify necessarily defined symbols (input variables) for code that is not doing certain tricks (eg get(), mixing data.frame columns and gobal variables in formulas, etc ). Tierney's codetools package also does things along these lines but there are some situations where it has trouble. I can give more detail if desired. ~G On Sun, Nov 3, 2013 at 3:04 PM, Ryan r...@thompsonclan.org mailto:r...@thompsonclan.org mailto:r...@thompsonclan.org mailto:r...@thompsonclan.org wrote: Another potential easy
Re: [Bioc-devel] VariantAnnotation: DNAStringSets for ref/alt alleles in 'VRanges' class
Hi Michael, Sure, I'll try to dig into it and construct a patch that adds this feature. I stumbled upon this after converting data between the 'VCF' and 'VRanges' class. The primary use case I had in mind is having a more efficient storing and processing for short InDels, or defining variants by ref/alt alleles also with respect to the sequence context. Best wishes Julian On 11/04/2013 12:56 PM, Michael Lawrence wrote: This was a consideration. I guess I've never got much use out of them being DNAStringSets, so I just went with the simple character vectors. It makes sense to support DNAStringSet. I could imagine someone e.g. wanting to represent mutations at the protein-level, and structural variants will require more complexity, but DNA is by far the most common use case. Are you willing to submit this as a patch? Just out of curiosity, how are you using Biostrings in this case? On Mon, Nov 4, 2013 at 1:12 AM, Julian Gehring julian.gehr...@embl.de mailto:julian.gehr...@embl.de wrote: Hi, Would it be reasonable to (optionally) allow storing the reference and alternative alleles in the 'VRanges' class as a 'DNAStringSet'? Currently, 'character' and 'Rle' are possible. Having a 'DNAStringSet' would make it more consistent with the rest of the 'VariantAnnotation' framework and make use of the efficient 'Biostrings' string handling infrastructure. Best wishes Julian _ Bioc-devel@r-project.org mailto:Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/__listinfo/bioc-devel https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
Weird, I guess it needs to be logged in or something. I don't know if the issue is that its in a non-master branch or waht. The repo is fully public and the forCRAN_0.3.5 in branch definitely exists on github. I started chrome (where I'm not logged into github) and got the same 404 error but after navigating to the file by going to the repo and changing the branch and navigating to the file, it now works even when i quit chrome and restart it. I don't know if it needed me to do that or if there was an intermittent problem that is now fixed. Anyway, here is the raw code, the link for which seems to work (in a browser where I'm not logged into github). If it still doesn't I can just attach the file here if you want. It doesn't rely on any of the rest of the CodeDepends machinery. https://raw.github.com/duncantl/CodeDepends/forCRAN_0.3.5/R/librarySymbols.R ~G On Mon, Nov 4, 2013 at 11:34 AM, Michael Lawrence lawrence.mich...@gene.com wrote: The dynamic nature of R limits the extent of these checks. But as Ryan has noted, a simple sanity check goes a long way. If what he has done could be extended to the rest of the search path (people always forget to attach packages), I think we've hit the 80% with 20%. Got a 404 on that URL btw. Michael On Mon, Nov 4, 2013 at 11:05 AM, Gabriel Becker gmbec...@ucdavis.eduwrote: Hey guys, Here is code that I have written which resolves library names into a full list of symbols: https://github.com/duncantl/CodeDepends/blob/forCRAN_0.3.5/R/librarySymbols.RNote this does not require that the packages actually be loaded at the time of the check, and does not load them (or rather, it loads them but does not attach them, so no searchpath muddying occurs). You do need a list of packages to check though (it adds the base ones automatically). It handles dependency and could be easily extended to handle suggests as well I think. When CodeDepends gets pushed to cran (not my call and not high on my priority list to push for currently) it will actually do exactly what you want. (the forCRAN_0.3.5 branch already does and I believe it is documented, so you could use devtools to install it now). As a side note, I'm not sure that existence of a symbol is sufficient (it certainly is necessary). What about situations where the symbol exists but is stale compared to the value in the parent? Are we sure that can never happen? ~G On Mon, Nov 4, 2013 at 7:29 AM, Michel Lang michell...@gmail.com wrote: You might want to consider using Recall() for recursion which should solve this. Determining the required variables using heuristics as codetools will probably lead to some confusion when using functions which include calls to, e.g., with(): f = function() { with(iris, Sepal.Length + Sepal.Width) } codetools:::findGlobals(f) I would suggest to write up some documentation on what the function's environment contains and how to to define variables accordingly - or why it can generally be considered a good idea to pass everything essential as an argument. Nevertheless a bpExport function would be a good addition for some rare corner cases in my opinion. Michel 2013/11/3 Henrik Bengtsson h...@biostat.ucsf.edu Hi, in BiocParallel, is there a suggested (or planned) best standards for making *locally* assigned variables (e.g. functions) available to the applied function when it runs in a separate R process (which will be the most common use case)? I understand that avoid local variables should be avoided and it's preferred to put as mush as possible in packages, but that's not always possible or very convenient. EXAMPLE: library('BiocParallel') library('BatchJobs') # Here I pick a recursive functions to make the problem a bit harder, i.e. # the function needs to call itself (itself = see below) fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib(n-2) + fib(n-1) } # Executing in the current R session cluster.functions - makeClusterFunctionsInteractive() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) # Executing in a separate R process, where fib() is not defined # (not specific to BiocParallel) cluster.functions - makeClusterFunctionsLocal() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
The code that I wrote intentionally avoids checking for package variables, since I consider that a separate problem. Package variables can be provided to the child by leading the package, whereas user-defined variables must be serialized in the parent and sent to the child. I think I could fairly easily adapt the same code to return a list of all packages that a function depends on. -Ryan On Nov 4, 2013 11:35 AM, Michael Lawrence lawrence.mich...@gene.com wrote: The dynamic nature of R limits the extent of these checks. But as Ryan has noted, a simple sanity check goes a long way. If what he has done could be extended to the rest of the search path (people always forget to attach packages), I think we've hit the 80% with 20%. Got a 404 on that URL btw. Michael On Mon, Nov 4, 2013 at 11:05 AM, Gabriel Becker gmbec...@ucdavis.edu wrote: Hey guys, Here is code that I have written which resolves library names into a full list of symbols: https://github.com/duncantl/CodeDepends/blob/forCRAN_0.3.5/R/librarySymbols.RNote this does not require that the packages actually be loaded at the time of the check, and does not load them (or rather, it loads them but does not attach them, so no searchpath muddying occurs). You do need a list of packages to check though (it adds the base ones automatically). It handles dependency and could be easily extended to handle suggests as well I think. When CodeDepends gets pushed to cran (not my call and not high on my priority list to push for currently) it will actually do exactly what you want. (the forCRAN_0.3.5 branch already does and I believe it is documented, so you could use devtools to install it now). As a side note, I'm not sure that existence of a symbol is sufficient (it certainly is necessary). What about situations where the symbol exists but is stale compared to the value in the parent? Are we sure that can never happen? ~G On Mon, Nov 4, 2013 at 7:29 AM, Michel Lang michell...@gmail.com wrote: You might want to consider using Recall() for recursion which should solve this. Determining the required variables using heuristics as codetools will probably lead to some confusion when using functions which include calls to, e.g., with(): f = function() { with(iris, Sepal.Length + Sepal.Width) } codetools:::findGlobals(f) I would suggest to write up some documentation on what the function's environment contains and how to to define variables accordingly - or why it can generally be considered a good idea to pass everything essential as an argument. Nevertheless a bpExport function would be a good addition for some rare corner cases in my opinion. Michel 2013/11/3 Henrik Bengtsson h...@biostat.ucsf.edu Hi, in BiocParallel, is there a suggested (or planned) best standards for making *locally* assigned variables (e.g. functions) available to the applied function when it runs in a separate R process (which will be the most common use case)? I understand that avoid local variables should be avoided and it's preferred to put as mush as possible in packages, but that's not always possible or very convenient. EXAMPLE: library('BiocParallel') library('BatchJobs') # Here I pick a recursive functions to make the problem a bit harder, i.e. # the function needs to call itself (itself = see below) fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib(n-2) + fib(n-1) } # Executing in the current R session cluster.functions - makeClusterFunctionsInteractive() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) # Executing in a separate R process, where fib() is not defined # (not specific to BiocParallel) cluster.functions - makeClusterFunctionsLocal() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) : Errors occurred during execution. First error message: Error in FUN(...): could not find function fib [...] # The following illustrates that the solution is not always straightforward. # (not specific to BiocParallel; must have been discussed previously) values - bplapply(0:9, FUN=function(n, fib) { fib(n) }, fib=fib) Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) :
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
Ryan, I agree that in some sense it is a different problem, but my point is with a different approach we can easily answer both. The code I posted returns a named character vector of symbol names with package name being the name. This makes it a trivial lookup to determine both a) what symbols aren't available in any of the packages and b) what packages provide the remaining required symbols. No extra work required. You do have to give it a list of packages to check, but it is easy to write a wrapper that automatically passes it all currently attached packages if desired (a combination of search() and gsub() would be a quick and dirty way to do this). All that said, I'm simply trying to help. If you guys don't want to use my code/approach that is your perogative as I'm not currently working on BiocParallel myself. ~G On Mon, Nov 4, 2013 at 11:54 AM, Ryan Thompson r...@thompsonclan.org wrote: The code that I wrote intentionally avoids checking for package variables, since I consider that a separate problem. Package variables can be provided to the child by leading the package, whereas user-defined variables must be serialized in the parent and sent to the child. I think I could fairly easily adapt the same code to return a list of all packages that a function depends on. -Ryan On Nov 4, 2013 11:35 AM, Michael Lawrence lawrence.mich...@gene.com wrote: The dynamic nature of R limits the extent of these checks. But as Ryan has noted, a simple sanity check goes a long way. If what he has done could be extended to the rest of the search path (people always forget to attach packages), I think we've hit the 80% with 20%. Got a 404 on that URL btw. Michael On Mon, Nov 4, 2013 at 11:05 AM, Gabriel Becker gmbec...@ucdavis.edu wrote: Hey guys, Here is code that I have written which resolves library names into a full list of symbols: https://github.com/duncantl/CodeDepends/blob/forCRAN_0.3.5/R/librarySymbols.RNote this does not require that the packages actually be loaded at the time of the check, and does not load them (or rather, it loads them but does not attach them, so no searchpath muddying occurs). You do need a list of packages to check though (it adds the base ones automatically). It handles dependency and could be easily extended to handle suggests as well I think. When CodeDepends gets pushed to cran (not my call and not high on my priority list to push for currently) it will actually do exactly what you want. (the forCRAN_0.3.5 branch already does and I believe it is documented, so you could use devtools to install it now). As a side note, I'm not sure that existence of a symbol is sufficient (it certainly is necessary). What about situations where the symbol exists but is stale compared to the value in the parent? Are we sure that can never happen? ~G On Mon, Nov 4, 2013 at 7:29 AM, Michel Lang michell...@gmail.com wrote: You might want to consider using Recall() for recursion which should solve this. Determining the required variables using heuristics as codetools will probably lead to some confusion when using functions which include calls to, e.g., with(): f = function() { with(iris, Sepal.Length + Sepal.Width) } codetools:::findGlobals(f) I would suggest to write up some documentation on what the function's environment contains and how to to define variables accordingly - or why it can generally be considered a good idea to pass everything essential as an argument. Nevertheless a bpExport function would be a good addition for some rare corner cases in my opinion. Michel 2013/11/3 Henrik Bengtsson h...@biostat.ucsf.edu Hi, in BiocParallel, is there a suggested (or planned) best standards for making *locally* assigned variables (e.g. functions) available to the applied function when it runs in a separate R process (which will be the most common use case)? I understand that avoid local variables should be avoided and it's preferred to put as mush as possible in packages, but that's not always possible or very convenient. EXAMPLE: library('BiocParallel') library('BatchJobs') # Here I pick a recursive functions to make the problem a bit harder, i.e. # the function needs to call itself (itself = see below) fib - function(n=0) { if (n 0) stop(Invalid 'n': , n) if (n == 0 || n == 1) return(1) fib(n-2) + fib(n-1) } # Executing in the current R session cluster.functions - makeClusterFunctionsInteractive() bpParams - BatchJobsParam(cluster.functions=cluster.functions) register(bpParams) values - bplapply(0:9, FUN=fib) ## SubmitJobs |++| 100% (00:00:00) ## Waiting [S:0 R:0 D:10 E:0] |+++| 100% (00:00:00) # Executing in a separate R process, where fib() is
Re: [Bioc-devel] disappearing .tex file when running R CMD Sweave on a new vignette
On Mon, Nov 4, 2013 at 12:46 PM, Dan Tenenbaum dtene...@fhcrc.org wrote: - Original Message - From: Tim Triche, Jr. tim.tri...@gmail.com To: bioc-devel@r-project.org Sent: Monday, November 4, 2013 12:25:19 PM Subject: [Bioc-devel] disappearing .tex file when running R CMD Sweave on a new vignette I get a bizarre error when compiling a newly-added Methylumi vignette: 10 : echo keep.source term verbatim (label = sessioninfo, methylumi450k.Rnw:136) Error in driver$finish(drobj) : the output file 'methylumi450k.tex' has disappeared Calls: Anonymous - do.call - Anonymous - Anonymous Execution halted This is bizarre because 1) the file is still there, and 2) all the heavy lifting is done. sessionInfo(), etc. is included properly and the vignette concludes with \end{document}, but nothing I do seems to resolve this driver error. Any suggestions would be most appreciated. Probably has to do with calling setwd() in the vignette? Maybe you need an on.exit() that restores the original directory. My guess is that you changed directory and then R can't see the tex file because it's in a different directory. See http://stackoverflow.com/questions/12162092/r-sweave-output-error tools::buildVignettes(), which is used by 'R CMD build', tries to protect against this by always resetting the working directory after weave:ing and tangle:ing a vignette, cf. http://svn.r-project.org/R/trunk/src/library/tools/R/Vignettes.R. tools::buildVignette() [no plural 's'], which is used by 'R CMD Sweave' (which is what Tim uses), should also do this, but looking at the code, this may only work properly if argument 'dir' is an absolute path, which it may not be the case (not sure). I've just submitted a bug report PR#15530 [https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=15530] with a patch on this. There may also be related issues in the utils::Sweave drivers (looks like code running during garbage collection) - I'll let someone else look into that. But, undoing setwd():s in the vignette should solve this, iff that's what behind this in the first place. /Henrik Dan Thanks, --t *He that would live in peace and at ease, * *Must not speak all he knows, nor judge all he sees.* Benjamin Franklin, Poor Richard's Almanackhttp://archive.org/details/poorrichardsalma00franrich [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] BiocParallel: Best standards for passing locally assigned variables/functions, e.g. a bpExport()?
On 11/4/13, 11:05 AM, Gabriel Becker wrote: As a side note, I'm not sure that existence of a symbol is sufficient (it certainly is necessary). What about situations where the symbol exists but is stale compared to the value in the parent? Are we sure that can never happen? I think this is a different issue. We want to detect when a function depends on variables outside that function in the user's workspace, or variables defined in a pacakge that the user has loaded. I think we can assume that R child processes will be of the same version with the same set of installed packages, so package-defined variables will not have different values in child processes. For user variables, I think the goal should be to prevent (or at least highly discourage) dependencies on them entirely, so I don't think it matters what their value may be in the child. I realize this is somewhat counter to the question that started this thread, which was about exporting variables to the children, but I think it is the most straightforward approach. As I believe someone noted earlier in the thread, Henrik's original problem of a recursive function is properly solved by using the Recall function. -Ryan ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel