Re: [R] Help with big data and parallel computing: 500, 000 x 4 linear models

2016-08-08 Thread Aaron Mackey
Don't run 500K separate models. Use the limma package to fit one model that
can learn the variance parameters jointly. Run it on your laptop. And don't
use %methylation as your Y variable, use logit(percent), i.e. the Beta
value.

-Aaron

On Mon, Aug 8, 2016 at 2:49 PM, Ellis, Alicia M 
wrote:

> I have a large dataset with ~500,000 columns and 1264 rows.  Each column
> represents the percent methylation at a given location in the genome.  I
> need to run 500,000 linear models for each of 4 predictors of interest in
> the form of:
> Methylation.stie1 ~ predictor1 + covariate1+ covariate2 + ... covariate9
> ...and save only the pvalue for the predictor
>
> The original methylation data file had methylation sites as row labels and
> the individuals as columns so I read the data in chunks and transposed it
> so I now have 5 csv files (chunks) with columns representing methylation
> sites and rows as individuals.
>
> I was able to get results for all of the regressions by running each chunk
> of methylation data separately on our supercomputer using the code below.
> However, I'm going to have to do this again for another project and I would
> really like to accomplish two things to make the whole process more
> computationally efficient:
>
>
> 1)  Work with data.tables instead of data.frames (reading and
> manipulating will be much easier and faster)
>
> 2)  Do the work in parallel using say 12 cores at once and having the
> program divide the work up on the cores rather than me having to split the
> data and run 5 separate jobs on the supercomputer.
>
> I have some basic knowledge of the data.table package but I wasn't able to
> modify the foreach code below to get it to work and the code using
> data.frames didn't seem to be using all 12 cores that I created in the
> cluster.
>
> Can anyone suggest some modifications to the foreach code below that will
> allow me to do this in parallel with datatables and not have to do it in
> chunks?
>
>
> # Set up cluster
> clus = makeCluster(12, type = "SOCK")
> registerDoSNOW(clus)
> getDoParWorkers()
> getDoParName()
>
>
> ### Following code needs to be modified to run the full
> dataset (batch1-batch5) in parallel
> ### Currently I read in the following chunks, and run each predictor
> separately for each chunk of data
>
> ### Methylation data in batches
> batch1=read.csv("/home/alicia.m.ellis/batch1.csv")  ## #Each batch
> has about 100,000 columns and 1264 rows; want to alter this to:
> ## batch1=fread(file= )
> batch2=read.csv(file="/home/alicia.m.ellis/batch2.csv")
> batch3=read.csv(file="/home/alicia.m.ellis/batch3.csv")
> batch4=read.csv(file="/home/alicia.m.ellis/batch4.csv")
> batch5=read.csv(file="/home/alicia.m.ellis/batch5.csv")
>
> predictors  ## this is a data.frame with 4 columns and 1264 rows
>
> covariates ## this is a data.frame with 9 columns and 1264 rows
>
> fits <- as.data.table(batch1)[, list(MyFits = lapply(1:ncol(batch1),
> function(x) summary(lm(batch1[, x] ~ predictors[,1] +
>
> covariates[,1]+
>
> covariates[,2]+
>
> covariates[,3]+
>
> covariates[,4]+
>
> covariates[,5]+
>
> covariates[,6]+
>
> covariates[,7]+
>
> covariates[,8]+
>
> covariates[,9]
> )
> )$coefficients[2,4]
> )
> )
> ]
>
>
> ##  This is what I was trying but
> wasn't having much luck
>  I'm having trouble getting the data merged as a single data.frame and
> the code below doesn't seem to be dividing the work among the 12 cores in
> the cluster
>
> all. fits = foreach (j=1:ncol(predictors), i=1:ncol(meth1),
> combine='rbind', .inorder=TRUE) %dopar% {
>
>   model = lm(meth[, i] ~ predictors[,j] +
>covariates[,1]+
>covariates[,2]+
>covariates[,3]+
>covariates[,4]+
>covariates[,5]+
>covariates[,6]+
>covariates[,7]+
>covariates[,8]+
>covariates[,9])
>   summary(model)$coefficients[2,4]
> }
>
>
> Alicia Ellis, Ph.D
> Biostatistician
> Pathology & Laboratory Medicine
> Colchester Research Facility
> 360 South Park Drive, Room 209C
> Colchester, VT  05446
> 802-656-9840
>
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do 

Re: [R] combine glmnet and coxph (and survfit) with strata()

2013-12-09 Thread Aaron Mackey
I'm also curious how to use glmnet with survfit -- specifically, for use
with interval regression (which, under the hood, is implemented using
survfit).  Can you show how you converted your Surv object formula to a
design matrix for use with glmnet?

Thanks,
-Aaron


On Sun, Dec 8, 2013 at 12:45 AM, Jieyue Li jieyuel...@gmail.com wrote:

 Dear All,

 I want to generate survival curve with cox model but I want to estimate the
 coefficients using glmnet. However, I also want to include a strata() term
 in the model. Could anyone please tell me how to have this strata() effect
 in the model in glmnet? I tried converting a formula with strata() to a
 design matrix and feeding to glmnet, but glmnet just treats the strata()
 term with one independent variable...

 I know that if there is no such strata(), I can estimate coefficients from
 glmnet and use ...init=selectedBeta,iter=0) in the coxph. Please advise
 me or also correct me if I'm wrong.

 Thank you very much!

 Best,

 Jieyue

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Renaming variables

2013-09-20 Thread Aaron Mackey
On Fri, Sep 20, 2013 at 10:10 AM, Preetam Pal lordpree...@gmail.com wrote:

 I have 25 variables in the data file (name: score), i.e. X1,X2,.,X25.

 I dont want to use score$X1, score$X2 everytime I use these variables.


attach(score)

plot(X1, X2) # etc. etc.

-Aaron

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] XYZ data

2013-06-26 Thread Aaron Mackey
for plotting purposes, I typically jitter() the x's and y's to see the
otherwise overlapping data points
-Aaron

On Wed, Jun 26, 2013 at 12:29 PM, Shane Carey careys...@gmail.com wrote:

 Nope, neither work. :-(


 On Wed, Jun 26, 2013 at 5:16 PM, Clint Bowman cl...@ecy.wa.gov wrote:

  John,
 
  That still leaves a string of identical numbers in the vector.
 
  Shane,
 
  ?jitter
 
  perhaps jitter(X,1,0.0001)
 
  Clint
 
  Clint BowmanINTERNET:   cl...@ecy.wa.gov
  Air Quality Modeler INTERNET:   cl...@math.utah.edu
  Department of Ecology   VOICE:  (360) 407-6815
  PO Box 47600FAX:(360) 407-7534
  Olympia, WA 98504-7600
 
  USPS:   PO Box 47600, Olympia, WA 98504-7600
  Parcels:300 Desmond Drive, Lacey, WA 98503-1274
 
  On Wed, 26 Jun 2013, John Kane wrote:
 
   mm  -  1:10
  nn  - mm + .001
 
  John Kane
  Kingston ON Canada
 
 
   -Original Message-
  From: careys...@gmail.com
  Sent: Wed, 26 Jun 2013 16:48:34 +0100
  To: r-help@r-project.org
  Subject: [R] XYZ data
 
  I have x, y, z data. The x, y fields dont change but Z does. How do I
 add
  a
  very small number onto the end of each x, y data point.
 
  For example:
 
  Original (X)  Original (Y) Original (Z)
  15   20  30
  15   20  40
 
 
 
 
  New (X)  New (Y) New (Z)
  15.1 20.01  30
  15.2 20.02  40
 
 
  Thanks
  --
  Shane
 
  [[alternative HTML version deleted]]
 
  __**
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/**listinfo/r-help
 https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/**posting-guide.html
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 
  __**__
  FREE 3D EARTH SCREENSAVER - Watch the Earth right on your desktop!
 
  __**
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/**listinfo/r-help
 https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide http://www.R-project.org/**
  posting-guide.html http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 


 --
 Shane

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Dependency-aware scripting tools for R

2012-04-19 Thread Aaron Mackey
shameless self-plug: we break out of R to do this, and after many painful
years developing and maintaining idiosyncratic Makefiles, we are now using
Taverna to (visually) glue together UNIX commands (including R scripts) --
the benefits of which (over make and brethren) is that you can actually
*see* the dependencies and overall workflow (nesting workflows also makes
it easier to manage complexity).

see TavernaPBS:

  http://cphg.virginia.edu/mackey/projects/sequencing-pipelines/tavernapbs/

while designed to automate job submission to a PBS queuing system, you can
also use it to simply execute non-PBS jobs.

--
Aaron J. Mackey, PhD
Assistant Professor
Center for Public Health Genomics
University of Virginia
amac...@virginia.edu
http://www.cphg.virginia.edu/mackey


On Thu, Apr 19, 2012 at 3:27 PM, Sean Davis sdav...@mail.nih.gov wrote:

 There are numerous tools like scons, make, ruffus, ant, rake, etc.
 that can be used to build complex pipelines based on task
 dependencies.  These tools are written in a variety of languages, but
 I have not seen such a thing for R.  Is anyone aware of a package
 available?  The goal is to be able to develop robust bioinformatic
 pipelines driven by scripts written in R.

 Thanks,
 Sean

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] HWEBayes, swapping the homozygotes genotype frequencies

2011-10-09 Thread Aaron Mackey
Without really knowing this code, I can guess that it may be the
triangular prior at work.  Bayes Factors are notorious for being sensitive
to the prior.  Presumably, the prior somehow prefers to see the rarer allele
as the BB, and not the AA homozygous genotype (this is a common
assumption: that AA is the reference, and thus the major, more frequent,
allele).

-Aaron

On Sat, Oct 8, 2011 at 7:52 PM, stat999 yumik091...@gmail.com wrote:

  I evaluated the Bayes factor in the k=2 allele case with a triangular
 prior under the null as in the example in the help file:


 HWETriangBF2(nvec=c(88,10,2))
 [1] 0.4580336

 When I swap the n11 entry and n22 entry of nvec, I received totally
 different Bayes factor:

 
  HWETriangBF2(nvec=c(2,10,88))
 [1] 5.710153
 

 In my understanding, defining the genotype frequency as n11 or n22 are
 arbitrary.
 So I was expecting the same value of Bayes factor.

 This is the case for conjugate Dirichlet prior:
 DirichNormHWE(nvec=c(88,10,2), c(1,1))/DirichNormSat(nvec=c(88,10,2),
 c(1,1,1))
 [1] 1.542047
 DirichNormHWE(nvec=c(2,10,88), c(1,1))/DirichNormSat(nvec=c(2,10,88),
 c(1,1,1))
 [1] 1.542047

 Could you explain why the HWETriangBF2 is returining completely different
 values of Bayes Factor??


 --
 View this message in context:
 http://r.789695.n4.nabble.com/HWEBayes-swapping-the-homozygotes-genotype-frequencies-tp3886313p3886313.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Hardy Weinberg

2011-06-22 Thread Aaron Mackey
H-W only gives you the expected frequency of AA, AB, and BB genotypes (i.e.
a 1x3 table):

minor - runif(1, 0.05, 0.25)
major - 1-minor
AA - minor^2
AB - 2*minor*major
BB - major^2
df - cbind(AA, AB, BB)


-Aaron

On Tue, Jun 21, 2011 at 9:30 PM, Jim Silverton jim.silver...@gmail.comwrote:

 Hello all,
 I am interested in simulating 10,000  2 x 3 tables for SNPs data with the
 Hardy Weinberg formulation. Is there a quick way to do this? I am assuming
 that the minor allelle frequency is uniform in (0.05, 0.25).

 --
 Thanks,
 Jim.

[[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Shrink file size of pdf graphics

2011-05-19 Thread Aaron Mackey
You can try something like this, at the command line:

  gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.5 -dPDFSETTINGS=/screen
-dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf

evidently, the new compactPDF() function in R 2.13 does something very similar.

-Aaron

On Thu, May 19, 2011 at 11:30 AM, Duncan Murdoch
murdoch.dun...@gmail.com wrote:

 On 19/05/2011 11:14 AM, Layman123 wrote:

 Hi everyone,

 My data consists of a system of nearly 75000 roads, available as a
 shapefile. When I plot the road system, by adding the individual roads with
 'lines' and store it as a pdf-file with 'pdf' I get a file of size 13 MB.
 This is way too large to add it in my LaTeX-document, because there will be
 some more graphics of this type.
 Now I'm curious to learn wheter there is a possibility in R to shrink the
 file size of this graphic? I merely need it in a resolution so that it looks
 smooth when printed out. I don't know much about the storage of R
 graphics, but maybe there is a way to change the way the file is stored
 perhaps as a pixel image?


 There are several possibilities.  You can use a bitmapped device (e.g. png()) 
 to save the image; pdflatex can include those.

 You can compress the .pdf file using an external tool like pdftk (or do it 
 internally in R 2.14.x, coming soon).

 There are probably others...

 Duncan Murdoch

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Reporting odds ratios or risk ratios from GLM

2011-03-15 Thread Aaron Mackey
OR - exp(coef(GLM.2)[-1])
OR.ci - exp(confint(GLM.2)[-1,])

-Aaron

On Tue, Mar 15, 2011 at 1:25 PM, lafadnes ubu...@fadnes.net wrote:

 I am a new R user (am using it through the Rcmdr package) and have
 struggled
 to find out how to report OR and RR directly when running GLM models (not
 only reporting coefficients.)

 Example of the syntax that I have used:

 GLM.2 - glm(diarsev ~ treatmentarm +childage +breastfed,
 family=binomial(logit), data=fieldtrials2)
 summary(GLM.2)

 This works well except that I manually have to calculate the OR based on
 the
 coefficients. Can I get these directly (with confidence intervals) by just
 amending the syntax?

 Will be grateful for advice!

 --
 View this message in context:
 http://r.789695.n4.nabble.com/Reporting-odds-ratios-or-risk-ratios-from-GLM-tp3357209p3357209.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Complex sampling?

2011-03-09 Thread Aaron Mackey
What I think you need is something along the lines of:

matrix(c(sample(3:7), sample(3:7), sample(3:7), sample(3:7), ...), nrow=2)

now, each column are your random pairs.

-Aaron

On Wed, Mar 9, 2011 at 1:01 PM, Hosack, Michael mhos...@state.pa.us wrote:

  -Original Message-
  From: r-help-bounces at r-project.org [mailto:r-help-bounces at
 r-project.org]
  On Behalf Of Hosack, Michael
  Sent: Wednesday, March 09, 2011 7:34 AM
  To: r-help at R-project.org
  Subject: [R] Complex sampling?
 
  R users,
 
  I am trying to generate a randomized weekday survey schedule that ensures
  even coverage of weekdays in
  the sample, where the distribution of variable DOW is random with respect
  to WEEK. To accomplish this I need
  to randomly sample without replacement two weekdays per week for each of
  27 weeks (only 5 are shown).

 This seems simple enough, sampling without replacement.

 However,
  I need to sample from a sequence (3:7) that needs to be completely
  depleted and replenished until the
  final selection is made. Here is an example of what I want to do,
  beginning at WEEK 1. I would prefer to do
  this without using a loop, if possible.
 
  sample frame: [3,4,5,6,7] -- [4,5,6] -- [4],[1,2,3,(4),5,6] --
  [1,2,4,5,6] -- for each WEEK in dataframe

 OK, now you have me completely lost.  Sorry, but I have no clue as to what
 you just did here.  I looks like you are trying to describe some
 transformation/algorithm but I don't follow it.



 I could not reply to this email because it not been delivered to my inbox,
 so I had to copy it from the forum.
 I apologize for the confusion, this would take less than a minute to
 explain in conversation but an hour
 to explain well in print. Two DOW_NUMs will be selected randomly without
 replacement from the vector 3:7 for each WEEK. When this vector is reduced
 to a single integer that integer will be selected and the vector will be
 restored and a single integer will then be selected that differs from the
 prior selected integer (i.e. cannot sample the same day twice in the same
 week). This process will be repeated until two DOW_NUM have been assigned
 for each WEEK. That process is what I attempted to illustrate in my original
 message. This is beyond my current coding capabilities.



 
  Randomly sample 2 DOW_NUM without replacement from each WEEK ( () = no
 two
  identical DOW_NUM can be sampled
  in the same WEEK)
 
  sample = {3,7}, {5,6}, {4,3}, {1,5}, -- for each WEEK in dataframe
 

 So, are you sampling from [3,4,5,6,7], or [1,2,4,5,6], or ...?  Can you
 show an 'example' of what you would like to end up given your data below?

 
  Thanks you,
 
  Mike
 
 
   DATE DOW DOW_NUM WEEK
  2  2011-05-02 Mon   31
  3  2011-05-03 Tue   41
  4  2011-05-04 Wed   51
  5  2011-05-05 Thu   61
  6  2011-05-06 Fri   71
  9  2011-05-09 Mon   32
  10 2011-05-10 Tue   42
  11 2011-05-11 Wed   52
  12 2011-05-12 Thu   62
  13 2011-05-13 Fri   72
  16 2011-05-16 Mon   33
  17 2011-05-17 Tue   43
  18 2011-05-18 Wed   53
  19 2011-05-19 Thu   63
  20 2011-05-20 Fri   73
  23 2011-05-23 Mon   34
  24 2011-05-24 Tue   44
  25 2011-05-25 Wed   54
  26 2011-05-26 Thu   64
  27 2011-05-27 Fri   74
  30 2011-05-30 Mon   35
  31 2011-05-31 Tue   45
  32 2011-06-01 Wed   55
  33 2011-06-02 Thu   65
  34 2011-06-03 Fri   75
 
  DF -
  structure(list(DATE = structure(c(15096, 15097, 15098, 15099,
  15100, 15103, 15104, 15105, 15106, 15107, 15110, 15111, 15112,
  15113, 15114, 15117, 15118, 15119, 15120, 15121, 15124, 15125,
  15126, 15127, 15128), class = Date), DOW = c(Mon, Tue,
  Wed, Thu, Fri, Mon, Tue, Wed, Thu, Fri, Mon,
  Tue, Wed, Thu, Fri, Mon, Tue, Wed, Thu, Fri,
  Mon, Tue, Wed, Thu, Fri), DOW_NUM = c(3, 4, 5, 6, 7,
  3, 4, 5, 6, 7, 3, 4, 5, 6, 7, 3, 4, 5, 6, 7, 3, 4, 5, 6, 7),
  WEEK = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4,
  4, 4, 4, 4, 5, 5, 5, 5, 5)), .Names = c(DATE, DOW, DOW_NUM,
  WEEK), row.names = c(2L, 3L, 4L, 5L, 6L, 9L, 10L, 11L, 12L,
  13L, 16L, 17L, 18L, 19L, 20L, 23L, 24L, 25L, 26L, 27L, 30L, 31L,
  32L, 33L, 34L), class = data.frame)
 

 Dan

 Daniel Nordlund
 Bothell, WA USA

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] optimization challenge

2010-01-13 Thread Aaron Mackey
FYI, in bioinformatics, we use dynamic programming algorithms in similar
ways to solve similar problems of finding guaranteed-optimal partitions in
streams of data (usually DNA or protein sequence, but sometimes numerical
data from chip-arrays).  These path optimization algorithms are often
called Viterbi algorithms, a web search for which should provide multiple
references.

The solutions are not necessarily unique (there may be multiple
paths/partitions with identical integer maxima in some systems) and there is
much research on whether the optimal solution is actually the one you want
to work with (for example, there may be a fair amount of probability mass
within an area/ensemble of suboptimal solutions that overall have greater
posterior probabilities than does the optimal solution singleton).  See
Chip Lawrence's PNAS paper for more erudite discussion, and references
therein: www.pnas.org/content/105/9/3209.abstract

-Aaron

P.S. Good to see you here Albyn -- I enjoyed your stat. methods course at
Reed back in 1993, which started me down a somewhat windy road to
statistical genomics!

--
Aaron J. Mackey, PhD
Assistant Professor
Center for Public Health Genomics
University of Virginia
amac...@virginia.edu


On Wed, Jan 13, 2010 at 5:23 PM, Ravi Varadhan rvarad...@jhmi.edu wrote:

 Greg - thanks for posting this interesting problem.

 Albyn - thanks for posting a solution.  Now, I have some questions: (1) is
 the algorithm guaranteed to find a best solution? (2) can there be
 multiple solutions (it seems like there can be more than 1 solution
 depending on the data)?, and (3) is there a good reference for this and
 similar algorithms?

 Thanks  Best,
 Ravi.


 
 ---

 Ravi Varadhan, Ph.D.

 Assistant Professor, The Center on Aging and Health

 Division of Geriatric Medicine and Gerontology

 Johns Hopkins University

 Ph: (410) 502-2619

 Fax: (410) 614-9625

 Email: rvarad...@jhmi.edu

 Webpage:

 http://www.jhsph.edu/agingandhealth/People/Faculty_personal_pages/Varadhan.h
 tmlhttp://www.jhsph.edu/agingandhealth/People/Faculty_personal_pages/Varadhan.h%0Atml




 
 


 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org]
 On
 Behalf Of Albyn Jones
 Sent: Wednesday, January 13, 2010 1:19 PM
 To: Greg Snow
 Cc: r-help@r-project.org
 Subject: Re: [R] optimization challenge

 The key idea is that you are building a matrix that contains the
 solutions to smaller problems which are sub-problems of the big
 problem.  The first row of the matrix SSQ contains the solution for no
 splits, ie SSQ[1,j] is just the sum of squares about the overall mean
 for reading chapters1 through j in one day.  The iteration then uses
 row m-1 to construct row m, since if SSQ[m-1,j] (optimal reading of j
 chapters in m-1 days) is part of the overall optimal solution, you
 have already computed it, and so don't ever need to recompute it.

   TS = SSQ[m-1,j]+(SSQ1[j+1])

 computes the vector of possible solutions for SSQ[m,n] (n chapters in n
 days)
 breaking it into two pieces: chapters 1 to j in m-1 days, and chapters j+1
 to
 n in 1 day.  j is a vector in the function, and min(TS) is the minimum
 over choices of j, ie SSQ[m,n].

 At the end, SSQ[128,239] is the optimal value for reading all 239
 chapters in 128 days.  That's just the objective function, so the rest
 involves constructing the list of optimal cuts, ie which chapters are
 grouped together for each day's reading.  That code uses the same
 idea... constructing a list of lists of cutpoints.

 statisticians should study a bit of data structures and algorithms!

 albyn

 On Wed, Jan 13, 2010 at 10:45:11AM -0700, Greg Snow wrote:
  WOW, your results give about half the variance of my best optim run
 (possibly due to my suboptimal use of optim).
 
  Can you describe a little what the algorithm is doing?
 
  --
  Gregory (Greg) L. Snow Ph.D.
  Statistical Data Center
  Intermountain Healthcare
  greg.s...@imail.org
  801.408.8111
 
 
   -Original Message-
   From: Albyn Jones [mailto:jo...@reed.edu]
   Sent: Tuesday, January 12, 2010 5:31 PM
   To: Greg Snow
   Cc: r-help@r-project.org
   Subject: Re: [R] optimization challenge
  
   Greg
  
   Nice problem: I wasted my whole day on it :-)
  
   I was explaining my plan for a solution to a colleague who is a
   computer scientist, he pointed out that I was trying to re-invent the
   wheel known as dynamic programming.  here is my code, apparently it is
   called bottom up dynamic programming.  It runs pretty quickly, and
   returns (what I hope is :-) the optimal sum of squares and the
   cut-points.
  
   function(X=bom3$Verses,days=128){
   # find optimal BOM reading schedule for Greg Snow
   # minimize variance of quantity to read per day over 128 days
   #
   N = length(X)
   Nm1 = N-1
   SSQ- 

[R] hmm.discnp or other?

2009-08-12 Thread Aaron Mackey
(I think) I'd like to use the hmm.discnp package for a simple discrete,
two-state HMM, but my training data is irregularly shaped (i.e. the
observation chains are of varying length).  Additionally, I do not see how
to label the state of the observations given to the hmm() function.
Ultimately, I'd like to 1) train the hmm on labeled data, 2) use viterbi()
to calculate optimal labeling of unlabeled observations.

More concretely, I have labeled data that looks something like:

11212321221223121221112233222122112
ABA

21221223121221112233222122112
 ABAAA

 3121221112233222122112
  BB

from which I'd like to build the two hidden state (A and B) hmm that emits
observed 1, 2, or 3 at probabilities dictated by the hidden state, with
transition probabilities between the two states.  Given the trained HMM, I
then wish to label new sequences via viterbi().

Am I missing the purpose of this package?  I also read through the msm
package docs, but my data doesn't really have a time coordinate on which the
data should be aligned.

Thanks for any pointers,

-Aaron

amac...@virginia.edu

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] an S idiom for ordering matrix by columns?

2009-02-19 Thread Aaron Mackey
There's got to be a better way to use order() on a matrix than this:

 y
2L-035-3 2L-081-23 2L-143-18 2L-189-1 2R-008-5 2R-068-15 3L-113-4
3L-173-2
3981 1 221 12
2
8571 1 221 22
2
9111 1 221 22
2
3831 1 221 12
2
6391 2 212 21
2
7561 2 212 21
2
3L-186-1 3R-013-7 3R-032-1 3R-169-10 X-002 X-087
398122 2 1 2
857122 2 1 2
911122 2 1 2
383122 2 1 2
639221 2 1 2
756221 2 1 2


y[order(y[,1],y[,2],y[,3],y[,4],y[,5],y[,6],y[,7],y[,8],y[,9],y[,10],y[,11],y[,12],y[,13],y[,14]),]
2L-035-3 2L-081-23 2L-143-18 2L-189-1 2R-008-5 2R-068-15 3L-113-4
3L-173-2
3981 1 221 12
2
3831 1 221 12
2
8571 1 221 22
2
9111 1 221 22
2
6391 2 212 21
2
7561 2 212 21
2
3L-186-1 3R-013-7 3R-032-1 3R-169-10 X-002 X-087
398122 2 1 2
383122 2 1 2
857122 2 1 2
911122 2 1 2
639221 2 1 2
756221 2 1 2

Thanks for any suggestions!

-Aaron

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] an S idiom for ordering matrix by columns?

2009-02-19 Thread Aaron Mackey
Thanks to all, do.call(order, as.data.frame(y)) was the idiom I was
missing!

-Aaron

On Thu, Feb 19, 2009 at 11:52 AM, Gustaf Rydevik
gustaf.ryde...@gmail.comwrote:

 On Thu, Feb 19, 2009 at 5:40 PM, Aaron Mackey ajmac...@gmail.com wrote:
  There's got to be a better way to use order() on a matrix than this:
 
  y
 2L-035-3 2L-081-23 2L-143-18 2L-189-1 2R-008-5 2R-068-15 3L-113-4
  3L-173-2
  3981 1 221 12
  2
  8571 1 221 22
  2
  9111 1 221 22
  2
  3831 1 221 12
  2
  6391 2 212 21
  2
  7561 2 212 21
  2
 3L-186-1 3R-013-7 3R-032-1 3R-169-10 X-002 X-087
  398122 2 1 2
  857122 2 1 2
  911122 2 1 2
  383122 2 1 2
  639221 2 1 2
  756221 2 1 2
 
 
 
 y[order(y[,1],y[,2],y[,3],y[,4],y[,5],y[,6],y[,7],y[,8],y[,9],y[,10],y[,11],y[,12],y[,13],y[,14]),]
 2L-035-3 2L-081-23 2L-143-18 2L-189-1 2R-008-5 2R-068-15 3L-113-4
  3L-173-2
  3981 1 221 12
  2
  3831 1 221 12
  2
  8571 1 221 22
  2
  9111 1 221 22
  2
  6391 2 212 21
  2
  7561 2 212 21
  2
 3L-186-1 3R-013-7 3R-032-1 3R-169-10 X-002 X-087
  398122 2 1 2
  383122 2 1 2
  857122 2 1 2
  911122 2 1 2
  639221 2 1 2
  756221 2 1 2
 
  Thanks for any suggestions!
 
  -Aaron
 


 You mean something like this:
  test-matrix(sample(1:4,100,replace=T),ncol=10)
  test[do.call(order,data.frame(test)),]

 ?

 Regards,

 Gustaf


 --
 Gustaf Rydevik, M.Sci.
 tel: +46(0)703 051 451
 address:Essingetorget 40,112 66 Stockholm, SE
 skype:gustaf_rydevik


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] database table merging tips with R

2008-09-11 Thread Aaron Mackey
I would load your set of userid's into a temporary table in oracle,
then join that table with the rest of your SQL query to get only the
matching rows out.

-Aaron

On Thu, Sep 11, 2008 at 2:33 PM, Avram Aelony [EMAIL PROTECTED] wrote:

 Dear R list,

 What is the best way to efficiently marry an R dataset with a very large 
 (Oracle) database table?

 The goal is to only return Oracle table rows that match IDs present in the R 
 dataset.
 I have an R data frame with 2000 user IDs analogous to: r = 
 data.frame(userid=round(runif(2000)*10,0))

 ...and I need to pull data from an Oracle table only for these 2000 IDs.  The 
 Oracle table is quite large. Additionally, the sql query may need to join to 
 other tables to bring in ancillary fields.

 I currently connect to Oracle via odbc:

 library(RODBC)
 connection - odbcConnect(, uid=, pwd=)
 d = sqlQuery(connection, select userid, x, y, z from largetable where 
 timestamp  sysdate -7)

 ...allowing me to pull data from the database table into the R object d and 
 then use the R merge function.  The problem however is that if d is too 
 large it may fail due to memory limitations or be inefficient.  I would like 
 to push the merge portion to the database and it would be very convenient if 
 it were possible to request that the query look to the R object for the ID's 
 to which it should restrict the output.

 Is there a way to do this?
 Something like the following fictional code:
 d = sqlQuery(connection, select t.userid, x, y, z from largetable t where 
 r$userid=t.userid)

 Would sqldf (http://code.google.com/p/sqldf/) help me out here? If so, how?   
 This would be convenient and help me avoid needing to create a temporary 
 table to store the R data, join via sql, then return the data back to R.

 I am using R version 2.7.2 (2008-08-25) / i386-pc-mingw32 .
 Thanks for your comments, ideas, recommendations.


 -Avram

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] database table merging tips with R

2008-09-11 Thread Aaron Mackey
Sorry, I see now you want to avoid this, but you did ask what was the
best way to efficiently ..., and the temp. table solution certainly
matches your description.  What's wrong with using a temporary table?

-Aaron

On Thu, Sep 11, 2008 at 3:05 PM, Aaron Mackey [EMAIL PROTECTED] wrote:
 I would load your set of userid's into a temporary table in oracle,
 then join that table with the rest of your SQL query to get only the
 matching rows out.

 -Aaron

 On Thu, Sep 11, 2008 at 2:33 PM, Avram Aelony [EMAIL PROTECTED] wrote:

 Dear R list,

 What is the best way to efficiently marry an R dataset with a very large 
 (Oracle) database table?

 The goal is to only return Oracle table rows that match IDs present in the R 
 dataset.
 I have an R data frame with 2000 user IDs analogous to: r = 
 data.frame(userid=round(runif(2000)*10,0))

 ...and I need to pull data from an Oracle table only for these 2000 IDs.  
 The Oracle table is quite large. Additionally, the sql query may need to 
 join to other tables to bring in ancillary fields.

 I currently connect to Oracle via odbc:

 library(RODBC)
 connection - odbcConnect(, uid=, pwd=)
 d = sqlQuery(connection, select userid, x, y, z from largetable where 
 timestamp  sysdate -7)

 ...allowing me to pull data from the database table into the R object d 
 and then use the R merge function.  The problem however is that if d is 
 too large it may fail due to memory limitations or be inefficient.  I would 
 like to push the merge portion to the database and it would be very 
 convenient if it were possible to request that the query look to the R 
 object for the ID's to which it should restrict the output.

 Is there a way to do this?
 Something like the following fictional code:
 d = sqlQuery(connection, select t.userid, x, y, z from largetable t where 
 r$userid=t.userid)

 Would sqldf (http://code.google.com/p/sqldf/) help me out here? If so, how?  
  This would be convenient and help me avoid needing to create a temporary 
 table to store the R data, join via sql, then return the data back to R.

 I am using R version 2.7.2 (2008-08-25) / i386-pc-mingw32 .
 Thanks for your comments, ideas, recommendations.


 -Avram

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] database table merging tips with R

2008-09-11 Thread Aaron Mackey
I guess I'd do it something like this:

dbGetQuery(con, CREATE TEMPORARY TABLE foo ( etc etc))
sapply(@userids, function (x) { dbGetQuery(con, paste(INSERT INTO foo
(userid) VALUES (, x, ))) })

then later:

dbGetQuery(con, DROP TABLE foo);

-Aaron

On Thu, Sep 11, 2008 at 3:21 PM, Avram Aelony [EMAIL PROTECTED] wrote:

 Perhaps I will need to create a temp table, but I am asking if there is a way 
 to avoid it.  It would be great if there were a way to tie the R data frame 
 temporarily to the query in a transparent fashion. If not, I will see if I 
 can create/drop the temp table directly from sqlQuery.
 -Avram



 On Thursday, September 11, 2008, at 12:07PM, Aaron Mackey [EMAIL 
 PROTECTED] wrote:
Sorry, I see now you want to avoid this, but you did ask what was the
best way to efficiently ..., and the temp. table solution certainly
matches your description.  What's wrong with using a temporary table?

-Aaron

On Thu, Sep 11, 2008 at 3:05 PM, Aaron Mackey [EMAIL PROTECTED] wrote:
 I would load your set of userid's into a temporary table in oracle,
 then join that table with the rest of your SQL query to get only the
 matching rows out.

 -Aaron

 On Thu, Sep 11, 2008 at 2:33 PM, Avram Aelony [EMAIL PROTECTED] wrote:

 Dear R list,

 What is the best way to efficiently marry an R dataset with a very large 
 (Oracle) database table?

 The goal is to only return Oracle table rows that match IDs present in the 
 R dataset.
 I have an R data frame with 2000 user IDs analogous to: r = 
 data.frame(userid=round(runif(2000)*10,0))

 ...and I need to pull data from an Oracle table only for these 2000 IDs.  
 The Oracle table is quite large. Additionally, the sql query may need to 
 join to other tables to bring in ancillary fields.

 I currently connect to Oracle via odbc:

 library(RODBC)
 connection - odbcConnect(, uid=, pwd=)
 d = sqlQuery(connection, select userid, x, y, z from largetable where 
 timestamp  sysdate -7)

 ...allowing me to pull data from the database table into the R object d 
 and then use the R merge function.  The problem however is that if d is 
 too large it may fail due to memory limitations or be inefficient.  I 
 would like to push the merge portion to the database and it would be very 
 convenient if it were possible to request that the query look to the R 
 object for the ID's to which it should restrict the output.

 Is there a way to do this?
 Something like the following fictional code:
 d = sqlQuery(connection, select t.userid, x, y, z from largetable t where 
 r$userid=t.userid)

 Would sqldf (http://code.google.com/p/sqldf/) help me out here? If so, 
 how?   This would be convenient and help me avoid needing to create a 
 temporary table to store the R data, join via sql, then return the data 
 back to R.

 I am using R version 2.7.2 (2008-08-25) / i386-pc-mingw32 .
 Thanks for your comments, ideas, recommendations.


 -Avram

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.






__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Gumbell distribution - minimum case

2008-09-09 Thread Aaron Mackey
If you mean you want an EVD with a fat left tail (instead of a fat
right tail), then can;t you just multiply all the values by -1 to
reverse the distribution?  A new location parameter could then shift
the distribution wherever you want along the number line ...

-Aaron

On Mon, Sep 8, 2008 at 5:22 PM, Richard Gwozdz [EMAIL PROTECTED] wrote:
 Hello,

 I would like to sample from a Gumbell (minimum) distribution.  I have
 installed package {evd} but the Gumbell functions there appear to refer to
 the maximum case.  Unfortunately, setting the scale parameter negative does
 not appear to work.

 Is there a separate package for the Gumbell minimum?


 --
 _
 Rich Gwozdz
 Fire and Mountain Ecology Lab
 College of Forest Resources
 University of Washington
 cell: 206-769-6808 office: 206-543-9138
 [EMAIL PROTECTED]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] hex2RGB back to hex not the same?

2008-08-28 Thread Aaron Mackey
Witness this oddity (to me):

 rainbow_hcl(10)[1]
[1] #E18E9E
 d - attributes(hex2RGB(rainbow_hcl(10)))$coords[1,]
 rgb(d[1], d[2], d[3])
[1] #C54D5F

What happened?  FYI, this came up as I'm trying to reuse the RGB values I
get from rainbow_hcl in a call to rgb() where I can also set alpha
transparency levels ...

-Aaron

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] lmer syntax, matrix of (grouped) covariates?

2008-08-18 Thread Aaron Mackey
I have a fairly large model:

 length(Y)
[1] 3051
 dim(covariates)
[1] 3051  211

All of these 211 covariates need to be nested hierarchically within a
grouping class, of which there are 8.  I have an accessory vector, 
cov2class that specifies the mapping between covariates and the 8 classes.

Now, I understand I can break all this information up into individual
vectors (cov1, cov2, ..., cov211, class1, class2, ..., class8), and do
something like this:

model - lmer(Y ~ 1 + cov1 + cov2 + ... + cov211 +
  (cov1 + cov2 + ... | class1) +
  (...) +
  (... + cov210 + cov211 | class8)

But I'd like keep things syntactically simpler, and use the covariates
and cov2class
variables directly.  I haven't been able to find the right syntactic sugar
to get this done.

Thanks for any help,

-Aaron

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.