Re: [R] Help with big data and parallel computing: 500, 000 x 4 linear models
Don't run 500K separate models. Use the limma package to fit one model that can learn the variance parameters jointly. Run it on your laptop. And don't use %methylation as your Y variable, use logit(percent), i.e. the Beta value. -Aaron On Mon, Aug 8, 2016 at 2:49 PM, Ellis, Alicia Mwrote: > I have a large dataset with ~500,000 columns and 1264 rows. Each column > represents the percent methylation at a given location in the genome. I > need to run 500,000 linear models for each of 4 predictors of interest in > the form of: > Methylation.stie1 ~ predictor1 + covariate1+ covariate2 + ... covariate9 > ...and save only the pvalue for the predictor > > The original methylation data file had methylation sites as row labels and > the individuals as columns so I read the data in chunks and transposed it > so I now have 5 csv files (chunks) with columns representing methylation > sites and rows as individuals. > > I was able to get results for all of the regressions by running each chunk > of methylation data separately on our supercomputer using the code below. > However, I'm going to have to do this again for another project and I would > really like to accomplish two things to make the whole process more > computationally efficient: > > > 1) Work with data.tables instead of data.frames (reading and > manipulating will be much easier and faster) > > 2) Do the work in parallel using say 12 cores at once and having the > program divide the work up on the cores rather than me having to split the > data and run 5 separate jobs on the supercomputer. > > I have some basic knowledge of the data.table package but I wasn't able to > modify the foreach code below to get it to work and the code using > data.frames didn't seem to be using all 12 cores that I created in the > cluster. > > Can anyone suggest some modifications to the foreach code below that will > allow me to do this in parallel with datatables and not have to do it in > chunks? > > > # Set up cluster > clus = makeCluster(12, type = "SOCK") > registerDoSNOW(clus) > getDoParWorkers() > getDoParName() > > > ### Following code needs to be modified to run the full > dataset (batch1-batch5) in parallel > ### Currently I read in the following chunks, and run each predictor > separately for each chunk of data > > ### Methylation data in batches > batch1=read.csv("/home/alicia.m.ellis/batch1.csv") ## #Each batch > has about 100,000 columns and 1264 rows; want to alter this to: > ## batch1=fread(file= ) > batch2=read.csv(file="/home/alicia.m.ellis/batch2.csv") > batch3=read.csv(file="/home/alicia.m.ellis/batch3.csv") > batch4=read.csv(file="/home/alicia.m.ellis/batch4.csv") > batch5=read.csv(file="/home/alicia.m.ellis/batch5.csv") > > predictors ## this is a data.frame with 4 columns and 1264 rows > > covariates ## this is a data.frame with 9 columns and 1264 rows > > fits <- as.data.table(batch1)[, list(MyFits = lapply(1:ncol(batch1), > function(x) summary(lm(batch1[, x] ~ predictors[,1] + > > covariates[,1]+ > > covariates[,2]+ > > covariates[,3]+ > > covariates[,4]+ > > covariates[,5]+ > > covariates[,6]+ > > covariates[,7]+ > > covariates[,8]+ > > covariates[,9] > ) > )$coefficients[2,4] > ) > ) > ] > > > ## This is what I was trying but > wasn't having much luck > I'm having trouble getting the data merged as a single data.frame and > the code below doesn't seem to be dividing the work among the 12 cores in > the cluster > > all. fits = foreach (j=1:ncol(predictors), i=1:ncol(meth1), > combine='rbind', .inorder=TRUE) %dopar% { > > model = lm(meth[, i] ~ predictors[,j] + >covariates[,1]+ >covariates[,2]+ >covariates[,3]+ >covariates[,4]+ >covariates[,5]+ >covariates[,6]+ >covariates[,7]+ >covariates[,8]+ >covariates[,9]) > summary(model)$coefficients[2,4] > } > > > Alicia Ellis, Ph.D > Biostatistician > Pathology & Laboratory Medicine > Colchester Research Facility > 360 South Park Drive, Room 209C > Colchester, VT 05446 > 802-656-9840 > > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do
Re: [R] combine glmnet and coxph (and survfit) with strata()
I'm also curious how to use glmnet with survfit -- specifically, for use with interval regression (which, under the hood, is implemented using survfit). Can you show how you converted your Surv object formula to a design matrix for use with glmnet? Thanks, -Aaron On Sun, Dec 8, 2013 at 12:45 AM, Jieyue Li jieyuel...@gmail.com wrote: Dear All, I want to generate survival curve with cox model but I want to estimate the coefficients using glmnet. However, I also want to include a strata() term in the model. Could anyone please tell me how to have this strata() effect in the model in glmnet? I tried converting a formula with strata() to a design matrix and feeding to glmnet, but glmnet just treats the strata() term with one independent variable... I know that if there is no such strata(), I can estimate coefficients from glmnet and use ...init=selectedBeta,iter=0) in the coxph. Please advise me or also correct me if I'm wrong. Thank you very much! Best, Jieyue [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Renaming variables
On Fri, Sep 20, 2013 at 10:10 AM, Preetam Pal lordpree...@gmail.com wrote: I have 25 variables in the data file (name: score), i.e. X1,X2,.,X25. I dont want to use score$X1, score$X2 everytime I use these variables. attach(score) plot(X1, X2) # etc. etc. -Aaron [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] XYZ data
for plotting purposes, I typically jitter() the x's and y's to see the otherwise overlapping data points -Aaron On Wed, Jun 26, 2013 at 12:29 PM, Shane Carey careys...@gmail.com wrote: Nope, neither work. :-( On Wed, Jun 26, 2013 at 5:16 PM, Clint Bowman cl...@ecy.wa.gov wrote: John, That still leaves a string of identical numbers in the vector. Shane, ?jitter perhaps jitter(X,1,0.0001) Clint Clint BowmanINTERNET: cl...@ecy.wa.gov Air Quality Modeler INTERNET: cl...@math.utah.edu Department of Ecology VOICE: (360) 407-6815 PO Box 47600FAX:(360) 407-7534 Olympia, WA 98504-7600 USPS: PO Box 47600, Olympia, WA 98504-7600 Parcels:300 Desmond Drive, Lacey, WA 98503-1274 On Wed, 26 Jun 2013, John Kane wrote: mm - 1:10 nn - mm + .001 John Kane Kingston ON Canada -Original Message- From: careys...@gmail.com Sent: Wed, 26 Jun 2013 16:48:34 +0100 To: r-help@r-project.org Subject: [R] XYZ data I have x, y, z data. The x, y fields dont change but Z does. How do I add a very small number onto the end of each x, y data point. For example: Original (X) Original (Y) Original (Z) 15 20 30 15 20 40 New (X) New (Y) New (Z) 15.1 20.01 30 15.2 20.02 40 Thanks -- Shane [[alternative HTML version deleted]] __** R-help@r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/r-help https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/**posting-guide.html http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __**__ FREE 3D EARTH SCREENSAVER - Watch the Earth right on your desktop! __** R-help@r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/r-help https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/** posting-guide.html http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Shane [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Dependency-aware scripting tools for R
shameless self-plug: we break out of R to do this, and after many painful years developing and maintaining idiosyncratic Makefiles, we are now using Taverna to (visually) glue together UNIX commands (including R scripts) -- the benefits of which (over make and brethren) is that you can actually *see* the dependencies and overall workflow (nesting workflows also makes it easier to manage complexity). see TavernaPBS: http://cphg.virginia.edu/mackey/projects/sequencing-pipelines/tavernapbs/ while designed to automate job submission to a PBS queuing system, you can also use it to simply execute non-PBS jobs. -- Aaron J. Mackey, PhD Assistant Professor Center for Public Health Genomics University of Virginia amac...@virginia.edu http://www.cphg.virginia.edu/mackey On Thu, Apr 19, 2012 at 3:27 PM, Sean Davis sdav...@mail.nih.gov wrote: There are numerous tools like scons, make, ruffus, ant, rake, etc. that can be used to build complex pipelines based on task dependencies. These tools are written in a variety of languages, but I have not seen such a thing for R. Is anyone aware of a package available? The goal is to be able to develop robust bioinformatic pipelines driven by scripts written in R. Thanks, Sean __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] HWEBayes, swapping the homozygotes genotype frequencies
Without really knowing this code, I can guess that it may be the triangular prior at work. Bayes Factors are notorious for being sensitive to the prior. Presumably, the prior somehow prefers to see the rarer allele as the BB, and not the AA homozygous genotype (this is a common assumption: that AA is the reference, and thus the major, more frequent, allele). -Aaron On Sat, Oct 8, 2011 at 7:52 PM, stat999 yumik091...@gmail.com wrote: I evaluated the Bayes factor in the k=2 allele case with a triangular prior under the null as in the example in the help file: HWETriangBF2(nvec=c(88,10,2)) [1] 0.4580336 When I swap the n11 entry and n22 entry of nvec, I received totally different Bayes factor: HWETriangBF2(nvec=c(2,10,88)) [1] 5.710153 In my understanding, defining the genotype frequency as n11 or n22 are arbitrary. So I was expecting the same value of Bayes factor. This is the case for conjugate Dirichlet prior: DirichNormHWE(nvec=c(88,10,2), c(1,1))/DirichNormSat(nvec=c(88,10,2), c(1,1,1)) [1] 1.542047 DirichNormHWE(nvec=c(2,10,88), c(1,1))/DirichNormSat(nvec=c(2,10,88), c(1,1,1)) [1] 1.542047 Could you explain why the HWETriangBF2 is returining completely different values of Bayes Factor?? -- View this message in context: http://r.789695.n4.nabble.com/HWEBayes-swapping-the-homozygotes-genotype-frequencies-tp3886313p3886313.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Hardy Weinberg
H-W only gives you the expected frequency of AA, AB, and BB genotypes (i.e. a 1x3 table): minor - runif(1, 0.05, 0.25) major - 1-minor AA - minor^2 AB - 2*minor*major BB - major^2 df - cbind(AA, AB, BB) -Aaron On Tue, Jun 21, 2011 at 9:30 PM, Jim Silverton jim.silver...@gmail.comwrote: Hello all, I am interested in simulating 10,000 2 x 3 tables for SNPs data with the Hardy Weinberg formulation. Is there a quick way to do this? I am assuming that the minor allelle frequency is uniform in (0.05, 0.25). -- Thanks, Jim. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Shrink file size of pdf graphics
You can try something like this, at the command line: gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.5 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf evidently, the new compactPDF() function in R 2.13 does something very similar. -Aaron On Thu, May 19, 2011 at 11:30 AM, Duncan Murdoch murdoch.dun...@gmail.com wrote: On 19/05/2011 11:14 AM, Layman123 wrote: Hi everyone, My data consists of a system of nearly 75000 roads, available as a shapefile. When I plot the road system, by adding the individual roads with 'lines' and store it as a pdf-file with 'pdf' I get a file of size 13 MB. This is way too large to add it in my LaTeX-document, because there will be some more graphics of this type. Now I'm curious to learn wheter there is a possibility in R to shrink the file size of this graphic? I merely need it in a resolution so that it looks smooth when printed out. I don't know much about the storage of R graphics, but maybe there is a way to change the way the file is stored perhaps as a pixel image? There are several possibilities. You can use a bitmapped device (e.g. png()) to save the image; pdflatex can include those. You can compress the .pdf file using an external tool like pdftk (or do it internally in R 2.14.x, coming soon). There are probably others... Duncan Murdoch __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reporting odds ratios or risk ratios from GLM
OR - exp(coef(GLM.2)[-1]) OR.ci - exp(confint(GLM.2)[-1,]) -Aaron On Tue, Mar 15, 2011 at 1:25 PM, lafadnes ubu...@fadnes.net wrote: I am a new R user (am using it through the Rcmdr package) and have struggled to find out how to report OR and RR directly when running GLM models (not only reporting coefficients.) Example of the syntax that I have used: GLM.2 - glm(diarsev ~ treatmentarm +childage +breastfed, family=binomial(logit), data=fieldtrials2) summary(GLM.2) This works well except that I manually have to calculate the OR based on the coefficients. Can I get these directly (with confidence intervals) by just amending the syntax? Will be grateful for advice! -- View this message in context: http://r.789695.n4.nabble.com/Reporting-odds-ratios-or-risk-ratios-from-GLM-tp3357209p3357209.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Complex sampling?
What I think you need is something along the lines of: matrix(c(sample(3:7), sample(3:7), sample(3:7), sample(3:7), ...), nrow=2) now, each column are your random pairs. -Aaron On Wed, Mar 9, 2011 at 1:01 PM, Hosack, Michael mhos...@state.pa.us wrote: -Original Message- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Hosack, Michael Sent: Wednesday, March 09, 2011 7:34 AM To: r-help at R-project.org Subject: [R] Complex sampling? R users, I am trying to generate a randomized weekday survey schedule that ensures even coverage of weekdays in the sample, where the distribution of variable DOW is random with respect to WEEK. To accomplish this I need to randomly sample without replacement two weekdays per week for each of 27 weeks (only 5 are shown). This seems simple enough, sampling without replacement. However, I need to sample from a sequence (3:7) that needs to be completely depleted and replenished until the final selection is made. Here is an example of what I want to do, beginning at WEEK 1. I would prefer to do this without using a loop, if possible. sample frame: [3,4,5,6,7] -- [4,5,6] -- [4],[1,2,3,(4),5,6] -- [1,2,4,5,6] -- for each WEEK in dataframe OK, now you have me completely lost. Sorry, but I have no clue as to what you just did here. I looks like you are trying to describe some transformation/algorithm but I don't follow it. I could not reply to this email because it not been delivered to my inbox, so I had to copy it from the forum. I apologize for the confusion, this would take less than a minute to explain in conversation but an hour to explain well in print. Two DOW_NUMs will be selected randomly without replacement from the vector 3:7 for each WEEK. When this vector is reduced to a single integer that integer will be selected and the vector will be restored and a single integer will then be selected that differs from the prior selected integer (i.e. cannot sample the same day twice in the same week). This process will be repeated until two DOW_NUM have been assigned for each WEEK. That process is what I attempted to illustrate in my original message. This is beyond my current coding capabilities. Randomly sample 2 DOW_NUM without replacement from each WEEK ( () = no two identical DOW_NUM can be sampled in the same WEEK) sample = {3,7}, {5,6}, {4,3}, {1,5}, -- for each WEEK in dataframe So, are you sampling from [3,4,5,6,7], or [1,2,4,5,6], or ...? Can you show an 'example' of what you would like to end up given your data below? Thanks you, Mike DATE DOW DOW_NUM WEEK 2 2011-05-02 Mon 31 3 2011-05-03 Tue 41 4 2011-05-04 Wed 51 5 2011-05-05 Thu 61 6 2011-05-06 Fri 71 9 2011-05-09 Mon 32 10 2011-05-10 Tue 42 11 2011-05-11 Wed 52 12 2011-05-12 Thu 62 13 2011-05-13 Fri 72 16 2011-05-16 Mon 33 17 2011-05-17 Tue 43 18 2011-05-18 Wed 53 19 2011-05-19 Thu 63 20 2011-05-20 Fri 73 23 2011-05-23 Mon 34 24 2011-05-24 Tue 44 25 2011-05-25 Wed 54 26 2011-05-26 Thu 64 27 2011-05-27 Fri 74 30 2011-05-30 Mon 35 31 2011-05-31 Tue 45 32 2011-06-01 Wed 55 33 2011-06-02 Thu 65 34 2011-06-03 Fri 75 DF - structure(list(DATE = structure(c(15096, 15097, 15098, 15099, 15100, 15103, 15104, 15105, 15106, 15107, 15110, 15111, 15112, 15113, 15114, 15117, 15118, 15119, 15120, 15121, 15124, 15125, 15126, 15127, 15128), class = Date), DOW = c(Mon, Tue, Wed, Thu, Fri, Mon, Tue, Wed, Thu, Fri, Mon, Tue, Wed, Thu, Fri, Mon, Tue, Wed, Thu, Fri, Mon, Tue, Wed, Thu, Fri), DOW_NUM = c(3, 4, 5, 6, 7, 3, 4, 5, 6, 7, 3, 4, 5, 6, 7, 3, 4, 5, 6, 7, 3, 4, 5, 6, 7), WEEK = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5)), .Names = c(DATE, DOW, DOW_NUM, WEEK), row.names = c(2L, 3L, 4L, 5L, 6L, 9L, 10L, 11L, 12L, 13L, 16L, 17L, 18L, 19L, 20L, 23L, 24L, 25L, 26L, 27L, 30L, 31L, 32L, 33L, 34L), class = data.frame) Dan Daniel Nordlund Bothell, WA USA __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] optimization challenge
FYI, in bioinformatics, we use dynamic programming algorithms in similar ways to solve similar problems of finding guaranteed-optimal partitions in streams of data (usually DNA or protein sequence, but sometimes numerical data from chip-arrays). These path optimization algorithms are often called Viterbi algorithms, a web search for which should provide multiple references. The solutions are not necessarily unique (there may be multiple paths/partitions with identical integer maxima in some systems) and there is much research on whether the optimal solution is actually the one you want to work with (for example, there may be a fair amount of probability mass within an area/ensemble of suboptimal solutions that overall have greater posterior probabilities than does the optimal solution singleton). See Chip Lawrence's PNAS paper for more erudite discussion, and references therein: www.pnas.org/content/105/9/3209.abstract -Aaron P.S. Good to see you here Albyn -- I enjoyed your stat. methods course at Reed back in 1993, which started me down a somewhat windy road to statistical genomics! -- Aaron J. Mackey, PhD Assistant Professor Center for Public Health Genomics University of Virginia amac...@virginia.edu On Wed, Jan 13, 2010 at 5:23 PM, Ravi Varadhan rvarad...@jhmi.edu wrote: Greg - thanks for posting this interesting problem. Albyn - thanks for posting a solution. Now, I have some questions: (1) is the algorithm guaranteed to find a best solution? (2) can there be multiple solutions (it seems like there can be more than 1 solution depending on the data)?, and (3) is there a good reference for this and similar algorithms? Thanks Best, Ravi. --- Ravi Varadhan, Ph.D. Assistant Professor, The Center on Aging and Health Division of Geriatric Medicine and Gerontology Johns Hopkins University Ph: (410) 502-2619 Fax: (410) 614-9625 Email: rvarad...@jhmi.edu Webpage: http://www.jhsph.edu/agingandhealth/People/Faculty_personal_pages/Varadhan.h tmlhttp://www.jhsph.edu/agingandhealth/People/Faculty_personal_pages/Varadhan.h%0Atml -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Albyn Jones Sent: Wednesday, January 13, 2010 1:19 PM To: Greg Snow Cc: r-help@r-project.org Subject: Re: [R] optimization challenge The key idea is that you are building a matrix that contains the solutions to smaller problems which are sub-problems of the big problem. The first row of the matrix SSQ contains the solution for no splits, ie SSQ[1,j] is just the sum of squares about the overall mean for reading chapters1 through j in one day. The iteration then uses row m-1 to construct row m, since if SSQ[m-1,j] (optimal reading of j chapters in m-1 days) is part of the overall optimal solution, you have already computed it, and so don't ever need to recompute it. TS = SSQ[m-1,j]+(SSQ1[j+1]) computes the vector of possible solutions for SSQ[m,n] (n chapters in n days) breaking it into two pieces: chapters 1 to j in m-1 days, and chapters j+1 to n in 1 day. j is a vector in the function, and min(TS) is the minimum over choices of j, ie SSQ[m,n]. At the end, SSQ[128,239] is the optimal value for reading all 239 chapters in 128 days. That's just the objective function, so the rest involves constructing the list of optimal cuts, ie which chapters are grouped together for each day's reading. That code uses the same idea... constructing a list of lists of cutpoints. statisticians should study a bit of data structures and algorithms! albyn On Wed, Jan 13, 2010 at 10:45:11AM -0700, Greg Snow wrote: WOW, your results give about half the variance of my best optim run (possibly due to my suboptimal use of optim). Can you describe a little what the algorithm is doing? -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 -Original Message- From: Albyn Jones [mailto:jo...@reed.edu] Sent: Tuesday, January 12, 2010 5:31 PM To: Greg Snow Cc: r-help@r-project.org Subject: Re: [R] optimization challenge Greg Nice problem: I wasted my whole day on it :-) I was explaining my plan for a solution to a colleague who is a computer scientist, he pointed out that I was trying to re-invent the wheel known as dynamic programming. here is my code, apparently it is called bottom up dynamic programming. It runs pretty quickly, and returns (what I hope is :-) the optimal sum of squares and the cut-points. function(X=bom3$Verses,days=128){ # find optimal BOM reading schedule for Greg Snow # minimize variance of quantity to read per day over 128 days # N = length(X) Nm1 = N-1 SSQ-
[R] hmm.discnp or other?
(I think) I'd like to use the hmm.discnp package for a simple discrete, two-state HMM, but my training data is irregularly shaped (i.e. the observation chains are of varying length). Additionally, I do not see how to label the state of the observations given to the hmm() function. Ultimately, I'd like to 1) train the hmm on labeled data, 2) use viterbi() to calculate optimal labeling of unlabeled observations. More concretely, I have labeled data that looks something like: 11212321221223121221112233222122112 ABA 21221223121221112233222122112 ABAAA 3121221112233222122112 BB from which I'd like to build the two hidden state (A and B) hmm that emits observed 1, 2, or 3 at probabilities dictated by the hidden state, with transition probabilities between the two states. Given the trained HMM, I then wish to label new sequences via viterbi(). Am I missing the purpose of this package? I also read through the msm package docs, but my data doesn't really have a time coordinate on which the data should be aligned. Thanks for any pointers, -Aaron amac...@virginia.edu [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] an S idiom for ordering matrix by columns?
There's got to be a better way to use order() on a matrix than this: y 2L-035-3 2L-081-23 2L-143-18 2L-189-1 2R-008-5 2R-068-15 3L-113-4 3L-173-2 3981 1 221 12 2 8571 1 221 22 2 9111 1 221 22 2 3831 1 221 12 2 6391 2 212 21 2 7561 2 212 21 2 3L-186-1 3R-013-7 3R-032-1 3R-169-10 X-002 X-087 398122 2 1 2 857122 2 1 2 911122 2 1 2 383122 2 1 2 639221 2 1 2 756221 2 1 2 y[order(y[,1],y[,2],y[,3],y[,4],y[,5],y[,6],y[,7],y[,8],y[,9],y[,10],y[,11],y[,12],y[,13],y[,14]),] 2L-035-3 2L-081-23 2L-143-18 2L-189-1 2R-008-5 2R-068-15 3L-113-4 3L-173-2 3981 1 221 12 2 3831 1 221 12 2 8571 1 221 22 2 9111 1 221 22 2 6391 2 212 21 2 7561 2 212 21 2 3L-186-1 3R-013-7 3R-032-1 3R-169-10 X-002 X-087 398122 2 1 2 383122 2 1 2 857122 2 1 2 911122 2 1 2 639221 2 1 2 756221 2 1 2 Thanks for any suggestions! -Aaron [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] an S idiom for ordering matrix by columns?
Thanks to all, do.call(order, as.data.frame(y)) was the idiom I was missing! -Aaron On Thu, Feb 19, 2009 at 11:52 AM, Gustaf Rydevik gustaf.ryde...@gmail.comwrote: On Thu, Feb 19, 2009 at 5:40 PM, Aaron Mackey ajmac...@gmail.com wrote: There's got to be a better way to use order() on a matrix than this: y 2L-035-3 2L-081-23 2L-143-18 2L-189-1 2R-008-5 2R-068-15 3L-113-4 3L-173-2 3981 1 221 12 2 8571 1 221 22 2 9111 1 221 22 2 3831 1 221 12 2 6391 2 212 21 2 7561 2 212 21 2 3L-186-1 3R-013-7 3R-032-1 3R-169-10 X-002 X-087 398122 2 1 2 857122 2 1 2 911122 2 1 2 383122 2 1 2 639221 2 1 2 756221 2 1 2 y[order(y[,1],y[,2],y[,3],y[,4],y[,5],y[,6],y[,7],y[,8],y[,9],y[,10],y[,11],y[,12],y[,13],y[,14]),] 2L-035-3 2L-081-23 2L-143-18 2L-189-1 2R-008-5 2R-068-15 3L-113-4 3L-173-2 3981 1 221 12 2 3831 1 221 12 2 8571 1 221 22 2 9111 1 221 22 2 6391 2 212 21 2 7561 2 212 21 2 3L-186-1 3R-013-7 3R-032-1 3R-169-10 X-002 X-087 398122 2 1 2 383122 2 1 2 857122 2 1 2 911122 2 1 2 639221 2 1 2 756221 2 1 2 Thanks for any suggestions! -Aaron You mean something like this: test-matrix(sample(1:4,100,replace=T),ncol=10) test[do.call(order,data.frame(test)),] ? Regards, Gustaf -- Gustaf Rydevik, M.Sci. tel: +46(0)703 051 451 address:Essingetorget 40,112 66 Stockholm, SE skype:gustaf_rydevik [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] database table merging tips with R
I would load your set of userid's into a temporary table in oracle, then join that table with the rest of your SQL query to get only the matching rows out. -Aaron On Thu, Sep 11, 2008 at 2:33 PM, Avram Aelony [EMAIL PROTECTED] wrote: Dear R list, What is the best way to efficiently marry an R dataset with a very large (Oracle) database table? The goal is to only return Oracle table rows that match IDs present in the R dataset. I have an R data frame with 2000 user IDs analogous to: r = data.frame(userid=round(runif(2000)*10,0)) ...and I need to pull data from an Oracle table only for these 2000 IDs. The Oracle table is quite large. Additionally, the sql query may need to join to other tables to bring in ancillary fields. I currently connect to Oracle via odbc: library(RODBC) connection - odbcConnect(, uid=, pwd=) d = sqlQuery(connection, select userid, x, y, z from largetable where timestamp sysdate -7) ...allowing me to pull data from the database table into the R object d and then use the R merge function. The problem however is that if d is too large it may fail due to memory limitations or be inefficient. I would like to push the merge portion to the database and it would be very convenient if it were possible to request that the query look to the R object for the ID's to which it should restrict the output. Is there a way to do this? Something like the following fictional code: d = sqlQuery(connection, select t.userid, x, y, z from largetable t where r$userid=t.userid) Would sqldf (http://code.google.com/p/sqldf/) help me out here? If so, how? This would be convenient and help me avoid needing to create a temporary table to store the R data, join via sql, then return the data back to R. I am using R version 2.7.2 (2008-08-25) / i386-pc-mingw32 . Thanks for your comments, ideas, recommendations. -Avram __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] database table merging tips with R
Sorry, I see now you want to avoid this, but you did ask what was the best way to efficiently ..., and the temp. table solution certainly matches your description. What's wrong with using a temporary table? -Aaron On Thu, Sep 11, 2008 at 3:05 PM, Aaron Mackey [EMAIL PROTECTED] wrote: I would load your set of userid's into a temporary table in oracle, then join that table with the rest of your SQL query to get only the matching rows out. -Aaron On Thu, Sep 11, 2008 at 2:33 PM, Avram Aelony [EMAIL PROTECTED] wrote: Dear R list, What is the best way to efficiently marry an R dataset with a very large (Oracle) database table? The goal is to only return Oracle table rows that match IDs present in the R dataset. I have an R data frame with 2000 user IDs analogous to: r = data.frame(userid=round(runif(2000)*10,0)) ...and I need to pull data from an Oracle table only for these 2000 IDs. The Oracle table is quite large. Additionally, the sql query may need to join to other tables to bring in ancillary fields. I currently connect to Oracle via odbc: library(RODBC) connection - odbcConnect(, uid=, pwd=) d = sqlQuery(connection, select userid, x, y, z from largetable where timestamp sysdate -7) ...allowing me to pull data from the database table into the R object d and then use the R merge function. The problem however is that if d is too large it may fail due to memory limitations or be inefficient. I would like to push the merge portion to the database and it would be very convenient if it were possible to request that the query look to the R object for the ID's to which it should restrict the output. Is there a way to do this? Something like the following fictional code: d = sqlQuery(connection, select t.userid, x, y, z from largetable t where r$userid=t.userid) Would sqldf (http://code.google.com/p/sqldf/) help me out here? If so, how? This would be convenient and help me avoid needing to create a temporary table to store the R data, join via sql, then return the data back to R. I am using R version 2.7.2 (2008-08-25) / i386-pc-mingw32 . Thanks for your comments, ideas, recommendations. -Avram __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] database table merging tips with R
I guess I'd do it something like this: dbGetQuery(con, CREATE TEMPORARY TABLE foo ( etc etc)) sapply(@userids, function (x) { dbGetQuery(con, paste(INSERT INTO foo (userid) VALUES (, x, ))) }) then later: dbGetQuery(con, DROP TABLE foo); -Aaron On Thu, Sep 11, 2008 at 3:21 PM, Avram Aelony [EMAIL PROTECTED] wrote: Perhaps I will need to create a temp table, but I am asking if there is a way to avoid it. It would be great if there were a way to tie the R data frame temporarily to the query in a transparent fashion. If not, I will see if I can create/drop the temp table directly from sqlQuery. -Avram On Thursday, September 11, 2008, at 12:07PM, Aaron Mackey [EMAIL PROTECTED] wrote: Sorry, I see now you want to avoid this, but you did ask what was the best way to efficiently ..., and the temp. table solution certainly matches your description. What's wrong with using a temporary table? -Aaron On Thu, Sep 11, 2008 at 3:05 PM, Aaron Mackey [EMAIL PROTECTED] wrote: I would load your set of userid's into a temporary table in oracle, then join that table with the rest of your SQL query to get only the matching rows out. -Aaron On Thu, Sep 11, 2008 at 2:33 PM, Avram Aelony [EMAIL PROTECTED] wrote: Dear R list, What is the best way to efficiently marry an R dataset with a very large (Oracle) database table? The goal is to only return Oracle table rows that match IDs present in the R dataset. I have an R data frame with 2000 user IDs analogous to: r = data.frame(userid=round(runif(2000)*10,0)) ...and I need to pull data from an Oracle table only for these 2000 IDs. The Oracle table is quite large. Additionally, the sql query may need to join to other tables to bring in ancillary fields. I currently connect to Oracle via odbc: library(RODBC) connection - odbcConnect(, uid=, pwd=) d = sqlQuery(connection, select userid, x, y, z from largetable where timestamp sysdate -7) ...allowing me to pull data from the database table into the R object d and then use the R merge function. The problem however is that if d is too large it may fail due to memory limitations or be inefficient. I would like to push the merge portion to the database and it would be very convenient if it were possible to request that the query look to the R object for the ID's to which it should restrict the output. Is there a way to do this? Something like the following fictional code: d = sqlQuery(connection, select t.userid, x, y, z from largetable t where r$userid=t.userid) Would sqldf (http://code.google.com/p/sqldf/) help me out here? If so, how? This would be convenient and help me avoid needing to create a temporary table to store the R data, join via sql, then return the data back to R. I am using R version 2.7.2 (2008-08-25) / i386-pc-mingw32 . Thanks for your comments, ideas, recommendations. -Avram __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Gumbell distribution - minimum case
If you mean you want an EVD with a fat left tail (instead of a fat right tail), then can;t you just multiply all the values by -1 to reverse the distribution? A new location parameter could then shift the distribution wherever you want along the number line ... -Aaron On Mon, Sep 8, 2008 at 5:22 PM, Richard Gwozdz [EMAIL PROTECTED] wrote: Hello, I would like to sample from a Gumbell (minimum) distribution. I have installed package {evd} but the Gumbell functions there appear to refer to the maximum case. Unfortunately, setting the scale parameter negative does not appear to work. Is there a separate package for the Gumbell minimum? -- _ Rich Gwozdz Fire and Mountain Ecology Lab College of Forest Resources University of Washington cell: 206-769-6808 office: 206-543-9138 [EMAIL PROTECTED] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] hex2RGB back to hex not the same?
Witness this oddity (to me): rainbow_hcl(10)[1] [1] #E18E9E d - attributes(hex2RGB(rainbow_hcl(10)))$coords[1,] rgb(d[1], d[2], d[3]) [1] #C54D5F What happened? FYI, this came up as I'm trying to reuse the RGB values I get from rainbow_hcl in a call to rgb() where I can also set alpha transparency levels ... -Aaron [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] lmer syntax, matrix of (grouped) covariates?
I have a fairly large model: length(Y) [1] 3051 dim(covariates) [1] 3051 211 All of these 211 covariates need to be nested hierarchically within a grouping class, of which there are 8. I have an accessory vector, cov2class that specifies the mapping between covariates and the 8 classes. Now, I understand I can break all this information up into individual vectors (cov1, cov2, ..., cov211, class1, class2, ..., class8), and do something like this: model - lmer(Y ~ 1 + cov1 + cov2 + ... + cov211 + (cov1 + cov2 + ... | class1) + (...) + (... + cov210 + cov211 | class8) But I'd like keep things syntactically simpler, and use the covariates and cov2class variables directly. I haven't been able to find the right syntactic sugar to get this done. Thanks for any help, -Aaron [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.