On Wed, 5 Dec 2007, Tim Calkins wrote:

> Hi all -
>
> I'm trying to find a way to create dummy variables from factors in a
> regression.  I have been using biglm along the lines of
>
> ff <- log(Price) ~ factor(Colour):factor(Store) +
> factor(DummyVar):factor(Colour):factor(Store)
>
> lm1 <- biglm(ff, data=my.dataset)
>
> but because there are lots of colours (>100) and lots of stores
> (>250), I run it to memory problems.  Now, not every store sells every
> colour and so it should be possible to create the matrix of factor
> variables myself and greatly reduce the size of the problem.  it seems
> that lm / biglm use all combinations of factor levels when used in
> factor(Colour):factor(Store) so by creating my own matrix of factor
> variables i should be able to reduce the size of the problem
> considerably.
>
> If i have a data frame
>> my.dataset <- data.frame(Price=1:12, Colour= c('red','blue','green'),
> Store=c('a', 'b', 'c', 'a', 'c', 'd', 'e', 'e', 'e', 'e', 'b', 'e'),
> DummyVar = sort(rep(c(0,1),6)) )
>
> i want to create a data frame with the dummy vars that looks like
>
> red:a red:e   blue:b  blue:c  blue:e  green:c green:d green:e
> 1     0       0       0       0       0       0       0
> 0     0       1       0       0       0       0       0
> 0     0       0       0       0       1       0       0
> 1     0       0       0       0       0       0       0
> 0     0       0       1       0       0       0       0
> 0     0       0       0       0       0       1       0
> 0     1       0       0       0       0       0       0
> 0     0       0       0       1       0       0       0
> 0     0       0       0       0       0       0       1
> 0     1       0       0       0       0       0       0
> 0     0       1       0       0       0       0       0
> 0     0       0       0       0       0       0       1
>
> any ideas would be appreciated.


Use

mat <- model.matrix( ~ClrStr-1,
        transform( my.dataset, ClrStr =
                factor( paste(Colour,Store,sep=":") ) ) )

then pretty up the colnames() and re-order columns if order matters.

----

However, if DummyVar is a categorical variable, you could just compute 
means on the appropriate subsets by maintaining a table of sums and 
totals. Then in a second pass through the data get the residual sums of 
squares. If the data are already in a database, it might make sense to do 
these operations there and import the results to R for further massaging.


HTH,

Chuck

>
>
> -- 
> Tim Calkins
> 0406 753 997
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:[EMAIL PROTECTED]                  UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to