So what is the best practice for ensuring that something is actually
visible to the worker? If the worker needs functionality from a
package, should the namespace be explicitly referenced via ::? Lazy
users might want to include library() calls in the worker function.
This proposed check will then throw an exception. Probably a good
thing, but is there a way for a user to declare imported namespaces?
I know that BatchJobs allows for passing a list of packages to be
loaded via library() on the worker. That is leveraging the search path
to make sure everything is visible and is a reasonable compromise (::
is always an option). We could essentially reimplement the search path
if we wanted isolation, but the worker is already isolated. Anyway,
somehow those types of declarations should be taken into account.
Moving back to the general discussion, for complex operations, it's
easiest to have the worker in a package. In that case, the worker will
likely rely on other functions, and the cleanest way to get those
functions to the worker is to have them installed as a package. At
least with BatchJobs, when the worker is inside a package namespace,
that namespace is automatically loaded (but not attached), so all
functions are automatically visible, without any extra work by me.
Michael
On Sun, Nov 3, 2013 at 10:46 PM, Ryan <r...@thompsonclan.org
<mailto:r...@thompsonclan.org>> wrote:
Ok, here is my attempt at a function to get the list of
user-defined free variables that a function refers to:
https://gist.github.com/__DarwinAwardWinner/7298557
<https://gist.github.com/DarwinAwardWinner/7298557>
Is uses codetools, so it is subject to the limitations of that
package, but for simple examples, it successfully detects when a
function refers to something in the global env.
On Sun Nov 3 21:14:29 2013, Gabriel Becker wrote:
Ryan (et al),
FYI:
> f
function() {
x = rnorm(x)
x
}
> findGlobals(f)
[1] "=" "{" "rnorm"
"x" should be in the list of globals but it isn't.
~G
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods
base
other attached packages:
[1] codetools_0.2-8
On Sun, Nov 3, 2013 at 5:37 PM, Ryan <r...@thompsonclan.org
<mailto:r...@thompsonclan.org>
<mailto:r...@thompsonclan.org <mailto:r...@thompsonclan.org>>>
wrote:
Looking at the codetools package, I think "findGlobals" is
basically exactly what we want here, right? As you say,
there are
necessarily limitations due to R being a dynamic language,
but the
goal is to catch common errors, not stop people from
tricking the
check.
I think I'll try to code something up soon.
-Ryan
On 11/3/13, 5:10 PM, Gabriel Becker wrote:
Henrik,
See https://github.com/duncantl/__CodeDepends
<https://github.com/duncantl/CodeDepends> (as used by used by
https://github.com/gmbecker/__RCacheSuite
<https://github.com/gmbecker/RCacheSuite>). It will identify
necessarily defined symbols (input variables) for code
that is
not doing certain tricks (eg get(), mixing data.frame
columns and
gobal variables in formulas, etc ).
Tierney's codetools package also does things along
these lines
but there are some situations where it has trouble. I
can give
more detail if desired.
~G
On Sun, Nov 3, 2013 at 3:04 PM, Ryan
<r...@thompsonclan.org <mailto:r...@thompsonclan.org>
<mailto:r...@thompsonclan.org
<mailto:r...@thompsonclan.org>>> wrote:
Another potential easy step we can do is that if
FUN function
in the user's workspace, we automatically export that
function under the same name in the children. This
would make
recursive functions just work, but it might be a
bit too
magical.
On 11/3/13, 2:38 PM, Ryan wrote:
Here's an easy thing we can add to
BiocParallel in the
short term. The following code defines a
wrapper function
"withBPExtraErrorText" that simply appends an
additional
message to the end of any error that looks
like it is
about a missing variable. We could wrap every
evaluation
in a similar tryCatch to at least provide a more
informative error message when a subprocess
has a missing
variable.
-Ryan
withBPExtraErrorText <- function(expr) {
tryCatch({
expr
}, simpleError = function(err) {
if (grepl("^object '(.*)' not found$",
err$message, perl=TRUE)) {
## It is an error due to a variable
not found.
err$message <- paste0(err$message,
". Maybe
you forgot to export this variable from the main R
session using \"bpexport\"?")
}
stop(err)
})
}
x <- 5
## Succeeds
withBPExtraErrorText(x)
## Fails with more informative error message
withBPExtraErrorText(y)
On Sun Nov 3 14:01:48 2013, Henrik Bengtsson
wrote:
On Sun, Nov 3, 2013 at 1:29 PM, Michael
Lawrence
<lawrence.mich...@gene.com
<mailto:lawrence.mich...@gene.com>
<mailto:lawrence.michael@gene.__com
<mailto:lawrence.mich...@gene.com>>> wrote:
An analog to clusterExport is a good
idea. To
make it even easier, we could
have a dynamic environment based on
object tables
that would catch missing
symbols and download them from the
parent thread.
But maybe there's some
benefit to being explicit?
A first step to fully automate this would
be to
provide some (opt
in/out) mechanism for code inspection and
warn about
non-defined
objects (cf. 'R CMD check'). That is of
course major
work, but will
certainly spare the community/users 1000's
of hours
in troubleshooting
and the mailing lists from "why doesn't my
parallel
code not work"
messages. Such protection may be better
suited for
the 'parallel'
package though. Unfortunately, it's beyond my
skills/time to pull
such a thing together.
/Henrik
Michael
On Sun, Nov 3, 2013 at 12:39 PM,
Henrik Bengtsson
<h...@biostat.ucsf.edu
<mailto:h...@biostat.ucsf.edu> <mailto:h...@biostat.ucsf.edu
<mailto:h...@biostat.ucsf.edu>>>
wrote:
Hi,
in BiocParallel, is there a
suggested (or
planned) best standards for
making *locally* assigned
variables (e.g.
functions) available to the
applied function when it runs in a
separate R
process (which will be
the most common use case)? I
understand that
avoid local variables
should be avoided and it's
preferred to put
as mush as possible in
packages, but that's not always
possible or
very convenient.
EXAMPLE:
library('BiocParallel')
library('BatchJobs')
# Here I pick a recursive
functions to make
the problem a bit harder, i.e.
# the function needs to call
itself ("itself"
= see below)
fib <- function(n=0) {
if (n < 0) stop("Invalid 'n': ", n)
if (n == 0 || n == 1) return(1)
fib(n-2) + fib(n-1)
}
# Executing in the current R session
cluster.functions <-
makeClusterFunctionsInteractiv__e()
bpParams <-
BatchJobsParam(cluster.__functions=cluster.functions)
register(bpParams)
values <- bplapply(0:9, FUN=fib)
## SubmitJobs
|+++++++++++++++++++++++++++++__+++++| 100%
(00:00:00)
## Waiting [S:0 R:0 D:10 E:0]
|+++++++++++++++++++| 100% (00:00:00)
# Executing in a separate R
process, where
fib() is not defined
# (not specific to BiocParallel)
cluster.functions <-
makeClusterFunctionsLocal()
bpParams <-
BatchJobsParam(cluster.__functions=cluster.functions)
register(bpParams)
values <- bplapply(0:9, FUN=fib)
## SubmitJobs
|+++++++++++++++++++++++++++++__+++++| 100%
(00:00:00)
## Waiting [S:0 R:0 D:10 E:0]
|+++++++++++++++++++| 100% (00:00:00)
Error in LastError$store(results =
results,
is.error = !ok, throw.error =
TRUE)
:
Errors occurred during
execution. First
error message:
Error in FUN(...): could not find
function "fib"
[...]
# The following illustrates that
the solution
is not always
straightforward.
# (not specific to BiocParallel;
must have
been discussed previously)
values <- bplapply(0:9,
FUN=function(n, fib) {
fib(n)
}, fib=fib)
Error in LastError$store(results =
results,
is.error = !ok,
throw.error = TRUE) :
Errors occurred during
execution. First
error message:
Error in fib(n): could not find
function "fib"
[...]
# Workaround; make fib() aware of
itself
# (this is something the user need
to do, and
would be very
# hard for BiocParallel et al. to
automate.
BTW, should all
# recursive functions be
implemented this way?).
fib <- function(n=0) {
if (n < 0) stop("Invalid 'n': ", n)
if (n == 0 || n == 1) return(1)
fib <- sys.function() # Make
function
aware of itself
fib(n-2) + fib(n-1)
}
values <- bplapply(0:9,
FUN=function(n, fib) {
fib(n)
}, fib=fib)
WISHLIST:
Considering the above recursive
issue solved,
a slightly more explicit
and standardized solution is then:
values <- bplapply(0:9,
FUN=function(n,
BPGLOBALS=NULL) {
for (name in names(BPGLOBALS))
assign(name, BPGLOBALS[[name]])
fib(n)
}, BPGLOBALS=list(fib=fib))
Could the above be generalized
into something
as neat as:
bpExport("fib")
values <- bplapply(0:9,
FUN=function(n) {
BiocParallel::bpImport("fib")
fib(n)
})
or ideally just (analogously to
parallel::clusterExport()):
bpExport("fib")
values <- bplapply(0:9, FUN=fib)
/Henrik
_________________________________________________
Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
<mailto:Bioc-devel@r-project.__org
<mailto:Bioc-devel@r-project.org>> mailing list
https://stat.ethz.ch/mailman/__listinfo/bioc-devel
<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
_________________________________________________
Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
<mailto:Bioc-devel@r-project.__org
<mailto:Bioc-devel@r-project.org>> mailing list
https://stat.ethz.ch/mailman/__listinfo/bioc-devel
<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
_________________________________________________
Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
<mailto:Bioc-devel@r-project.__org
<mailto:Bioc-devel@r-project.org>>
mailing list
https://stat.ethz.ch/mailman/__listinfo/bioc-devel
<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
--
Gabriel Becker
Graduate Student
Statistics Department
University of California, Davis
--
Gabriel Becker
Graduate Student
Statistics Department
University of California, Davis