On Sun, Mar 4, 2018 at 10:18 AM, Paul Gilbert <pgilbert...@gmail.com> wrote: > On Mon, Feb 26, 2018 at 3:25 PM, Gary Black <gwblack...@sbcglobal.net> > wrote: > > (Sorry to be a bit slow responding.) > > You have not supplied a complete example, which would be good in this case > because what you are suggesting could be a serious bug in R or a package. > Serious journals require reproducibility these days. For example, JSS is > very clear on this point. > > To your question >> My question simply is: should the location of the set.seed command >> matter, >> provided that it is applied before any commands which involve randomness >> (such as partitioning)? > > the answer is no, it should not matter. But the proviso is important. > > You can determine where things are messing up using something like > > set.seed(654321) > zk <- RNGkind() # [1] "Mersenne-Twister" "Inversion" > zk > z <- runif(2) > z > set.seed(654321) > > # install.packages(c('caret', 'ggplot2', 'e1071')) > library(caret) > all(runif(2) == z) # should be true but it is not always > > set.seed(654321) > library(ggplot2) > all(runif(2) == z) # should be true > > set.seed(654321) > library(e1071) > all(runif(2) == z) # should be true > > all(RNGkind() == zk) # should be true > > On my computer package caret seems to sometimes, but not always, do > something that advances or changes the RNG. So you will need to set the seed > after that package is loaded if you want reproducibility. > > As Bill Dunlap points out, parallel can introduce much more complicated > issues. If you are in fact using parallel then we really need a new thread > with a better subject line, and the discussion will get much messier. > > The short answer is that, yes you should be able to get reproducible results > with parallel computing. If you cannot then you are almost certainly doing > something wrong. To publish you really must have reproducible results. > > In the example that Bill gave, I think the problem is that set.seed() only > resets the seed in the main thread, the nodes continue to operate with > unreset RNG. To demonstrate this to yourself you can do > > library(parallel) > cl <- parallel::makeCluster(3) > parallel::clusterCall(cl, function()set.seed(100)) > parallel::clusterCall(cl, function()RNGkind()) > parallel::clusterCall(cl, function()runif(2)) # similar result from all > nodes > # [1] 0.3077661 0.2576725 > > However, do *NOT* do that in real work. You will be getting the same RNG > stream from each node. If you are using random numbers and parallel you need > to read a lot more, and probably consider a variant of the "L'Ecuyer" > generator or something designed for parallel computing. > > One special point I will mention because it does not seem to be widely > appreciated: the number of nodes affects the random stream, so recording the > number of compute nodes along with the RNG and seed information is important > for reproducible results. This has the unfortunate consequence that an > experiment cannot be reproduced on a larger cluster. (If anyone knows > differently I would very much like to hear.)
[Disclaimer: I'm the author] future.apply::future_lapply(X, ..., future.seed) etc. can produce identical RNG results regardless of how 'X' is chunked up. For example, library(future.apply) task <- function(i) { c(i = i, random = sample.int(10, size = 1), pid = Sys.getpid()) } y <- list() plan(multiprocess, workers = 1L) y[[1]] <- future_sapply(1:10, FUN = task, future.seed = 42) plan(multiprocess, workers = 2L) y[[2]] <- future_sapply(1:10, FUN = task, future.seed = 42) plan(multiprocess, workers = 3L) y[[3]] <- future_sapply(1:10, FUN = task, future.seed = 42) gives the exact same random output: > y [[1]] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] i 1 2 3 4 5 6 7 8 9 10 random 5 10 1 8 7 9 3 5 10 4 pid 31933 31933 31933 31933 31933 31933 31933 31933 31933 31933 [[2]] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] i 1 2 3 4 5 6 7 8 9 10 random 5 10 1 8 7 9 3 5 10 4 pid 32141 32141 32141 32141 32141 32142 32142 32142 32142 32142 [[3]] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] i 1 2 3 4 5 6 7 8 9 10 random 5 10 1 8 7 9 3 5 10 4 pid 32199 32199 32199 32200 32200 32200 32200 32201 32201 32201 To base the RNG on the current RNG seed (== .GlobalEnv$.Random.seed), one can use 'future.seed = TRUE'. For performance reasons, I choose the default to be 'future.seed = FALSE', because there can be a substantial overhead in setting up reproducible L'Ecuyer subRNG-streams for all elements in 'X'. I think the snowFT package by Sevcikova & Rossini also provides this mechanism; Hana Sevcikova is also behind the rlecuyer package. Hope this helps /Henrik > > Paul Gilbert > > > >> Hi all, >> >> For some odd reason when running naïve bayes, k-NN, etc., I get slightly >> different results (e.g., error rates, classification probabilities) from >> run >> to run even though I am using the same random seed. >> >> Nothing else (input-wise) is changing, but my results are somewhat >> different >> from run to run. The only randomness should be in the partitioning, and I >> have set the seed before this point. >> >> My question simply is: should the location of the set.seed command >> matter, >> provided that it is applied before any commands which involve randomness >> (such as partitioning)? >> >> If you need to see the code, it is below: >> >> Thank you, >> Gary >> >> >> A. Separate the original (in-sample) data from the new >> (out-of-sample) >> data. Set a random seed. >> >>> InvestTech <- as.data.frame(InvestTechRevised) >>> outOfSample <- InvestTech[5001:nrow(InvestTech), ] >>> InvestTech <- InvestTech[1:5000, ] >>> set.seed(654321) >> >> B. Install and load the caret, ggplot2 and e1071 packages. >> >>> install.packages(“caret”) >>> install.packages(“ggplot2”) >>> install.packages(“e1071”) >>> library(caret) >>> library(ggplot2) >>> library(e1071) >> >> C. Bin the predictor variables with approximately equal counts using >> the cut_number function from the ggplot2 package. We will use 20 bins. >> >>> InvestTech[, 1] <- cut_number(InvestTech[, 1], n = 20) >>> InvestTech[, 2] <- cut_number(InvestTech[, 2], n = 20) >>> outOfSample[, 1] <- cut_number(outOfSample[, 1], n = 20) >>> outOfSample[, 2] <- cut_number(outOfSample[, 2], n = 20) >> >> D. Partition the original (in-sample) data into 60% training and 40% >> validation sets. >> >>> n <- nrow(InvestTech) >>> train <- sample(1:n, size = 0.6 * n, replace = FALSE) >>> InvestTechTrain <- InvestTech[train, ] >>> InvestTechVal <- InvestTech[-train, ] >> >> E. Use the naiveBayes function in the e1071 package to fit the model. >> >>> model <- naiveBayes(`Purchase (1=yes, 0=no)` ~ ., data = InvestTechTrain) >>> prob <- predict(model, newdata = InvestTechVal, type = “raw”) >>> pred <- ifelse(prob[, 2] >= 0.3, 1, 0) >> >> F. Use the confusionMatrix function in the caret package to output >> the >> confusion matrix. >> >>> confMtr <- confusionMatrix(pred,unlist(InvestTechVal[, 3]),mode = >> “everything”, positive = “1”) >>> accuracy <- confMtr$overall[1] >>> valError <- 1 – accuracy >>> confMtr >> >> G. Classify the 18 new (out-of-sample) readers using the following >> code. >>> prob <- predict(model, newdata = outOfSample, type = “raw”) >>> pred <- ifelse(prob[, 2] >= 0.3, 1, 0) >>> cbind(pred, prob, outOfSample[, -3]) >> >> >> > > > If your computations involve the parallel package then set.seed(seed) > may not produce repeatable results. E.g., > >> cl <- parallel::makeCluster(3) # Create cluster with 3 nodes on local > host >> set.seed(100); runif(2) > [1] 0.3077661 0.2576725 >> parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1)) > [,1] [,2] [,3] > [1,] 101.7779 102.5308 103.3459 > [2,] 101.8128 102.6114 103.9102 >> >> set.seed(100); runif(2) > [1] 0.3077661 0.2576725 >> parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1)) > [,1] [,2] [,3] > [1,] 101.1628 102.9643 103.2684 > [2,] 101.9205 102.6937 103.7907 > > > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.