Actually, the check that I proposed is only supposed to check for usage of user-defined variables, not variables from packages. Truthfully, though, I guess I'm not the right person to work on this, since in practice I use forked processes for the vast majority of my inside-R parallelization, so I never have to worry about things being undefined in the forked subprocess. Therefore I cant really dogfood any of the stuff that might be implemented as a result of this thread.

-Ryan

On Mon Nov  4 03:48:23 2013, Michael Lawrence wrote:
So what is the best practice for ensuring that something is actually
visible to the worker? If the worker needs functionality from a
package, should the namespace be explicitly referenced via ::?  Lazy
users might want to include library() calls in the worker function.
This proposed check will then throw an exception. Probably a good
thing, but is there a way for a user to declare imported namespaces?
 I know that BatchJobs allows for passing a list of packages to be
loaded via library() on the worker. That is leveraging the search path
to make sure everything is visible and is a reasonable compromise (::
is always an option). We could essentially reimplement the search path
if we wanted isolation, but the worker is already isolated. Anyway,
somehow those types of declarations should be taken into account.

Moving back to the general discussion, for complex operations, it's
easiest to have the worker in a package. In that case, the worker will
likely rely on other functions, and the cleanest way to get those
functions to the worker is to have them installed as a package. At
least with BatchJobs, when the worker is inside a package namespace,
that namespace is automatically loaded (but not attached), so all
functions are automatically visible, without any extra work by me.

Michael


On Sun, Nov 3, 2013 at 10:46 PM, Ryan <r...@thompsonclan.org
<mailto:r...@thompsonclan.org>> wrote:

    Ok, here is my attempt at a function to get the list of
    user-defined free variables that a function refers to:

    https://gist.github.com/__DarwinAwardWinner/7298557
    <https://gist.github.com/DarwinAwardWinner/7298557>

    Is uses codetools, so it is subject to the limitations of that
    package, but for simple examples, it successfully detects when a
    function refers to something in the global env.


    On Sun Nov  3 21:14:29 2013, Gabriel Becker wrote:

        Ryan (et al),

        FYI:

        > f
        function() {
        x = rnorm(x)
        x
        }
        > findGlobals(f)
        [1] "="     "{"     "rnorm"

        "x" should be in the list of globals but it isn't.

        ~G

        > sessionInfo()
        R version 3.0.2 (2013-09-25)
        Platform: x86_64-pc-linux-gnu (64-bit)

        locale:
         [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
         [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
         [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
         [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
         [9] LC_ADDRESS=C               LC_TELEPHONE=C
        [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

        attached base packages:
        [1] stats     graphics  grDevices utils     datasets  methods
          base

        other attached packages:
        [1] codetools_0.2-8



        On Sun, Nov 3, 2013 at 5:37 PM, Ryan <r...@thompsonclan.org
        <mailto:r...@thompsonclan.org>
        <mailto:r...@thompsonclan.org <mailto:r...@thompsonclan.org>>>
        wrote:

            Looking at the codetools package, I think "findGlobals" is
            basically exactly what we want here, right? As you say,
        there are
            necessarily limitations due to R being a dynamic language,
        but the
            goal is to catch common errors, not stop people from
        tricking the
            check.

            I think I'll try to code something up soon.

            -Ryan


            On 11/3/13, 5:10 PM, Gabriel Becker wrote:

                Henrik,

                See https://github.com/duncantl/__CodeDepends
            <https://github.com/duncantl/CodeDepends> (as used by used by
            https://github.com/gmbecker/__RCacheSuite
            <https://github.com/gmbecker/RCacheSuite>). It will identify
                necessarily defined symbols (input variables) for code
            that is
                not doing certain tricks (eg get(), mixing data.frame
            columns and
                gobal variables in formulas, etc ).

                Tierney's codetools package also does things along
            these lines
                but there are some situations where it has trouble. I
            can give
                more detail if desired.

                ~G


                On Sun, Nov 3, 2013 at 3:04 PM, Ryan
            <r...@thompsonclan.org <mailto:r...@thompsonclan.org>
                <mailto:r...@thompsonclan.org
            <mailto:r...@thompsonclan.org>>> wrote:

                    Another potential easy step we can do is that if
            FUN function
                    in the user's workspace, we automatically export that
                    function under the same name in the children. This
            would make
                    recursive functions just work, but it might be a
            bit too
                    magical.


                    On 11/3/13, 2:38 PM, Ryan wrote:

                        Here's an easy thing we can add to
            BiocParallel in the
                        short term. The following code defines a
            wrapper function
                        "withBPExtraErrorText" that simply appends an
            additional
                        message to the end of any error that looks
            like it is
                        about a missing variable. We could wrap every
            evaluation
                        in a similar tryCatch to at least provide a more
                        informative error message when a subprocess
            has a missing
                        variable.

                        -Ryan

                        withBPExtraErrorText <- function(expr) {
                           tryCatch({
                               expr
                           }, simpleError = function(err) {
                               if (grepl("^object '(.*)' not found$",
                        err$message, perl=TRUE)) {
                                   ## It is an error due to a variable
            not found.
                                   err$message <- paste0(err$message,
            ". Maybe
                        you forgot to export this variable from the main R
                        session using \"bpexport\"?")
                               }
                               stop(err)
                           })
                        }

                        x <- 5

                        ## Succeeds
                        withBPExtraErrorText(x)

                        ## Fails with more informative error message
                        withBPExtraErrorText(y)



                        On Sun Nov  3 14:01:48 2013, Henrik Bengtsson
            wrote:

                            On Sun, Nov 3, 2013 at 1:29 PM, Michael
            Lawrence
                            <lawrence.mich...@gene.com
            <mailto:lawrence.mich...@gene.com>
                            <mailto:lawrence.michael@gene.__com
            <mailto:lawrence.mich...@gene.com>>> wrote:

                                An analog to clusterExport is a good
            idea. To
                                make it even easier, we could
                                have a dynamic environment based on
            object tables
                                that would catch missing
                                symbols and download them from the
            parent thread.
                                But maybe there's some
                                benefit to being explicit?


                            A first step to fully automate this would
            be to
                            provide some (opt
                            in/out) mechanism for code inspection and
            warn about
                            non-defined
                            objects (cf. 'R CMD check').  That is of
            course major
                            work, but will
                            certainly spare the community/users 1000's
            of hours
                            in troubleshooting
                            and the mailing lists from "why doesn't my
            parallel
                            code not work"
                            messages.  Such protection may be better
            suited for
                            the 'parallel'
                            package though.  Unfortunately, it's beyond my
                            skills/time to pull
                            such a thing together.

                            /Henrik


                                Michael


                                On Sun, Nov 3, 2013 at 12:39 PM,
            Henrik Bengtsson
                                <h...@biostat.ucsf.edu
            <mailto:h...@biostat.ucsf.edu> <mailto:h...@biostat.ucsf.edu
            <mailto:h...@biostat.ucsf.edu>>>

                                wrote:


                                    Hi,

                                    in BiocParallel, is there a
            suggested (or
                                    planned) best standards for
                                    making *locally* assigned
            variables (e.g.
                                    functions) available to the
                                    applied function when it runs in a
            separate R
                                    process (which will be
                                    the most common use case)?  I
            understand that
                                    avoid local variables
                                    should be avoided and it's
            preferred to put
                                    as mush as possible in
                                    packages, but that's not always
            possible or
                                    very convenient.

                                    EXAMPLE:

                                    library('BiocParallel')
                                    library('BatchJobs')

                                    # Here I pick a recursive
            functions to make
                                    the problem a bit harder, i.e.
                                    # the function needs to call
            itself ("itself"
                                    = see below)
                                    fib <- function(n=0) {
                                       if (n < 0) stop("Invalid 'n': ", n)
                                       if (n == 0 || n == 1) return(1)
                                       fib(n-2) + fib(n-1)
                                    }

                                    # Executing in the current R session
                                    cluster.functions <-
                                    makeClusterFunctionsInteractiv__e()
                                    bpParams <-

            BatchJobsParam(cluster.__functions=cluster.functions)
                                    register(bpParams)
                                    values <- bplapply(0:9, FUN=fib)
                                    ## SubmitJobs

            |+++++++++++++++++++++++++++++__+++++| 100%
                                    (00:00:00)
                                    ## Waiting [S:0 R:0 D:10 E:0]
                                    |+++++++++++++++++++| 100% (00:00:00)


                                    # Executing in a separate R
            process, where
                                    fib() is not defined
                                    # (not specific to BiocParallel)
                                    cluster.functions <-
            makeClusterFunctionsLocal()
                                    bpParams <-

            BatchJobsParam(cluster.__functions=cluster.functions)
                                    register(bpParams)
                                    values <- bplapply(0:9, FUN=fib)
                                    ## SubmitJobs

            |+++++++++++++++++++++++++++++__+++++| 100%
                                    (00:00:00)
                                    ## Waiting [S:0 R:0 D:10 E:0]
                                    |+++++++++++++++++++| 100% (00:00:00)
                                    Error in LastError$store(results =
            results,
                                    is.error = !ok, throw.error =
                                    TRUE)
                                    :
                                       Errors occurred during
            execution. First
                                    error message:
                                    Error in FUN(...): could not find
            function "fib"
                                    [...]


                                    # The following illustrates that
            the solution
                                    is not always
                                    straightforward.
                                    # (not specific to BiocParallel;
            must have
                                    been discussed previously)
                                    values <- bplapply(0:9,
            FUN=function(n, fib) {
                                       fib(n)
                                    }, fib=fib)
                                    Error in LastError$store(results =
            results,
                                    is.error = !ok,
                                    throw.error = TRUE) :
                                       Errors occurred during
            execution. First
                                    error message:
                                    Error in fib(n): could not find
            function "fib"
                                    [...]

                                    # Workaround; make fib() aware of
            itself
                                    # (this is something the user need
            to do, and
                                    would be very
                                    #  hard for BiocParallel et al. to
            automate.
                                     BTW, should all
                                    #  recursive functions be
            implemented this way?).
                                    fib <- function(n=0) {
                                       if (n < 0) stop("Invalid 'n': ", n)
                                       if (n == 0 || n == 1) return(1)
                                       fib <- sys.function() # Make
            function
                                    aware of itself
                                       fib(n-2) + fib(n-1)
                                    }
                                    values <- bplapply(0:9,
            FUN=function(n, fib) {
                                       fib(n)
                                    }, fib=fib)


                                    WISHLIST:
                                    Considering the above recursive
            issue solved,
                                    a slightly more explicit
                                    and standardized solution is then:

                                    values <- bplapply(0:9,
            FUN=function(n,
                                    BPGLOBALS=NULL) {
                                       for (name in names(BPGLOBALS))
                                    assign(name, BPGLOBALS[[name]])
                                       fib(n)
                                    }, BPGLOBALS=list(fib=fib))

                                    Could the above be generalized
            into something
                                    as neat as:

                                    bpExport("fib")
                                    values <- bplapply(0:9,
            FUN=function(n) {
                                       BiocParallel::bpImport("fib")
                                       fib(n)
                                    })

                                    or ideally just (analogously to
                                    parallel::clusterExport()):

                                    bpExport("fib")
                                    values <- bplapply(0:9, FUN=fib)

                                    /Henrik


            _________________________________________________
            Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
                                    <mailto:Bioc-devel@r-project.__org
            <mailto:Bioc-devel@r-project.org>> mailing list
            https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>





            _________________________________________________
            Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
                            <mailto:Bioc-devel@r-project.__org
            <mailto:Bioc-devel@r-project.org>> mailing list
            https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>


                    _________________________________________________
            Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
            <mailto:Bioc-devel@r-project.__org
            <mailto:Bioc-devel@r-project.org>>

                    mailing list
            https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>




                --
                Gabriel Becker
                Graduate Student
                Statistics Department
                University of California, Davis





        --
        Gabriel Becker
        Graduate Student
        Statistics Department
        University of California, Davis



_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to