Here is one possible way (you will need to change the dataset and condition, etc.):
tmp1 <- combn(names(iris)[1:4], 2, function(x) { if( any( iris[[ x[1] ]] * iris[[ x[2] ]] < .25 )) { NA } else { paste(x, collapse=':') }} ) tmp1 <- tmp1[ !is.na(tmp1) ] paste(tmp1, collapse=' + ') -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 > -----Original Message----- > From: Matthew Douglas [mailto:matt.dougla...@gmail.com] > Sent: Thursday, March 03, 2011 3:43 PM > To: Greg Snow > Cc: r-help@r-project.org > Subject: Re: [R] Regression with many independent variables > > Thanks for getting back to me so quickly greg. Im not quite sure how > to do what you just said, is there an example that you can show? > > I understand how to create the string with a formula in it but im not > sure how to loop through the pairs of variables? How do I first get > these 2way interaction variables, I can no longer use the "^" right? > > Sorry for so many questions, > > Matt > On Thu, Mar 3, 2011 at 4:16 PM, Greg Snow <greg.s...@imail.org> wrote: > > What you might need to do is create a character string with your > formula in it (looping through pairs of variables and using paste or > sprint) then convert that to a formula using the as.formula function. > > > > -- > > Gregory (Greg) L. Snow Ph.D. > > Statistical Data Center > > Intermountain Healthcare > > greg.s...@imail.org > > 801.408.8111 > > > > > >> -----Original Message----- > >> From: Matthew Douglas [mailto:matt.dougla...@gmail.com] > >> Sent: Thursday, March 03, 2011 2:09 PM > >> To: Greg Snow > >> Cc: r-help@r-project.org > >> Subject: Re: [R] Regression with many independent variables > >> > >> Thanks greg, > >> > >> that formula was exactly what I was looking for. Except now when I > >> run it on my data I get the following error: > >> > >> "Error in model.matrix.default(mt, mf, contrasts) : cannot allocate > >> vector of length 2043479998" > >> > >> I know there are probably many 2-way interactions that are zero so I > >> thought I could save space by removing these. Is there some way that > >> can just delete all the two way interactions that are zero and keep > >> the columns that have non-zero entries? I think that will > >> significantly cut down the memory needed. Or is there just another > way > >> to get around this? > >> > >> thanks, > >> Matt > >> > >> On Tue, Mar 1, 2011 at 3:56 PM, Greg Snow <greg.s...@imail.org> > wrote: > >> > You can use ^2 to get all 2 way interactions and ^3 to get all 3 > way > >> interactions, e.g.: > >> > > >> > lm(Sepal.Width ~ (. - Sepal.Length)^2, data=iris) > >> > > >> > The lm.fit function is what actually does the fitting, so you > could > >> go directly there, but then you lose the benefits of using . and ^. > >> The Matrix package has ways of dealing with sparse matricies, but I > >> don't know if that would help here or not. > >> > > >> > You could also just create x'x and x'y matricies directly since > the > >> variables are 0/1 then use solve. A lot depends on what you are > doing > >> and what questions you are trying to answer. > >> > > >> > -- > >> > Gregory (Greg) L. Snow Ph.D. > >> > Statistical Data Center > >> > Intermountain Healthcare > >> > greg.s...@imail.org > >> > 801.408.8111 > >> > > >> > > >> >> -----Original Message----- > >> >> From: Matthew Douglas [mailto:matt.dougla...@gmail.com] > >> >> Sent: Tuesday, March 01, 2011 1:09 PM > >> >> To: Greg Snow > >> >> Cc: r-help@r-project.org > >> >> Subject: Re: [R] Regression with many independent variables > >> >> > >> >> Hi Greg, > >> >> > >> >> Thanks for the help, it works perfectly. To answer your question, > >> >> there are 339 independent variables but only 10 will be used at > one > >> >> time . So at any given line of the data set there will be 10 non > >> zero > >> >> entries for the independent variables and the rest will be zeros. > >> >> > >> >> One more question: > >> >> > >> >> 1. I still want to find a way to look at the interactions of the > >> >> independent variables. > >> >> > >> >> the regression would look like this: > >> >> > >> >> y = b12*X1X2 + b23*X2X3 +...+ bk-1k*Xk-1Xk > >> >> > >> >> so I think the regression in R would look like this: > >> >> > >> >> lm(MARGIN, P235:P236+P236:P237+....,weights = Poss, data = > adj0708), > >> >> > >> >> my problem is that since I have technically 339 independent > >> variables, > >> >> when I do this regression I would have 339 Choose 2 = approx > 57000 > >> >> independent variables (a vast majority will be 0s though) so I > dont > >> >> want to have to write all of these out. Is there a way to do this > >> >> quickly in R? > >> >> > >> >> Also just a curious question that I cant seem to find to online: > >> >> is there a more efficient model other than lm() that is better > for > >> >> very sparse data sets like mine? > >> >> > >> >> Thanks, > >> >> Matt > >> >> > >> >> > >> >> On Mon, Feb 28, 2011 at 4:30 PM, Greg Snow <greg.s...@imail.org> > >> wrote: > >> >> > Don't put the name of the dataset in the formula, use the data > >> >> argument to lm to provide that. A single period (".") on the > right > >> >> hand side of the formula will represent all the columns in the > data > >> set > >> >> that are not on the left hand side (you can then use "-" to > remove > >> any > >> >> other columns that you don't want included on the RHS). > >> >> > > >> >> > For example: > >> >> > > >> >> >> lm(Sepal.Width ~ . - Sepal.Length, data=iris) > >> >> > > >> >> > Call: > >> >> > lm(formula = Sepal.Width ~ . - Sepal.Length, data = iris) > >> >> > > >> >> > Coefficients: > >> >> > (Intercept) Petal.Length Petal.Width > >> >> Speciesversicolor > >> >> > 3.0485 0.1547 0.6234 > >> - > >> >> 1.7641 > >> >> > Speciesvirginica > >> >> > -2.1964 > >> >> > > >> >> > > >> >> > But, are you sure that a regression model with 339 predictors > will > >> be > >> >> meaningful? > >> >> > > >> >> > -- > >> >> > Gregory (Greg) L. Snow Ph.D. > >> >> > Statistical Data Center > >> >> > Intermountain Healthcare > >> >> > greg.s...@imail.org > >> >> > 801.408.8111 > >> >> > > >> >> > > >> >> >> -----Original Message----- > >> >> >> From: r-help-boun...@r-project.org [mailto:r-help-bounces@r- > >> >> >> project.org] On Behalf Of Matthew Douglas > >> >> >> Sent: Monday, February 28, 2011 1:32 PM > >> >> >> To: r-help@r-project.org > >> >> >> Subject: [R] Regression with many independent variables > >> >> >> > >> >> >> Hi, > >> >> >> > >> >> >> I am trying use lm() on some data, the code works fine but I > >> would > >> >> >> like to use a more efficient way to do this. > >> >> >> > >> >> >> The data looks like this (the data is very sparse with a few > 1s, > >> -1s > >> >> >> and the rest 0s): > >> >> >> > >> >> >> > head(adj0708) > >> >> >> MARGIN Poss P235 P247 P703 P218 P430 P489 P83 P307 > P337.... > >> >> >> 1 64.28571 29 0 0 0 0 0 0 0 0 0 > >> 0 > >> >> >> 0 0 0 > >> >> >> 2 -100.00000 6 0 0 0 0 0 0 0 1 0 > >> 0 > >> >> >> 0 0 0 > >> >> >> 3 100.00000 4 0 0 0 0 0 0 0 1 0 > >> 0 > >> >> >> 0 0 0 > >> >> >> 4 -33.33333 7 0 0 0 0 0 0 0 0 0 > >> 0 > >> >> >> 0 0 0 > >> >> >> 5 200.00000 2 0 0 0 0 0 0 0 0 0 > >> 0 > >> >> >> -1 0 0 > >> >> >> 6 -83.33333 12 0 -1 0 0 0 0 0 0 0 > >> 0 > >> >> >> 0 0 0 > >> >> >> > >> >> >> adj0708 is actually a 35657x341 data set. Each column after > >> "Poss" > >> >> is > >> >> >> an independent variable, the dependent variable is "MARGIN" > and > >> it > >> >> is > >> >> >> weighted by "Poss" > >> >> >> > >> >> >> > >> >> >> The regression is below: > >> >> >> fit.adj0708 <- lm( adj0708$MARGIN~adj0708$P235 + adj0708$P247 > + > >> >> >> adj0708$P703 + adj0708$P430 + adj0708$P489 + adj0708$P218 + > >> >> >> adj0708$P605 + adj0708$P337 + .... + > >> >> >> adj0708$P510,weights=adj0708$Poss) > >> >> >> > >> >> >> I have two questions: > >> >> >> > >> >> >> 1. Is there a way to to condense how I write the independent > >> >> variables > >> >> >> in the lm(), instead of having such a long line of code (I > have > >> 339 > >> >> >> independent variables to be exact)? > >> >> >> 2. I would like to pair the data to look a regression of the > >> >> >> interactions between two independent variables. I think it > would > >> >> look > >> >> >> something like this.... > >> >> >> fit.adj0708 <- lm( adj0708$MARGIN~adj0708$P235:adj0708$P247 + > >> >> >> adj0708$P703:adj0708$P430 + adj0708$P489:adj0708$P218 + > >> >> >> adj0708$P605:adj0708$P337 + ....,weights=adj0708$Poss) > >> >> >> but there will be 339 Choose 2 combinations, so a lot of > >> independent > >> >> >> variables! Is there a more efficient way of writing this code. > Is > >> >> >> there a way I can do this? > >> >> >> > >> >> >> Thanks, > >> >> >> Matt > >> >> >> > >> >> >> ______________________________________________ > >> >> >> R-help@r-project.org mailing list > >> >> >> https://stat.ethz.ch/mailman/listinfo/r-help > >> >> >> PLEASE do read the posting guide http://www.R- > >> project.org/posting- > >> >> >> guide.html > >> >> >> and provide commented, minimal, self-contained, reproducible > >> code. > >> >> > > >> > > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.