Re: [R] Restructure some data

Doran, Harold Fri, 26 Feb 2010 08:37:35 -0800

Thank you both for your replies; both are very useful. The larger issue at hand 
is that the data will actually be huge, thus the end result will be a very 
large, sparse data frame.


So, I decided to put all three possible solutions to a timing test and see what 
they yield. I simulated 15000 possible students and created an item pool of 300 
total items that could be selected. I fixed the number of total items each 
students sees to 3, although this will truly be on the order of 50 in the real 
world problem.

So, first the new data for testing all three solutions.

item.pool <- paste("item", 1:300, sep = "")
N <- 15000
set.seed(54321)
dat <- data.frame(id = c(1:N), first.item = sample(item.pool, N, replace=TRUE), 
        second.item = sample(item.pool, N,replace=TRUE), third.item = 
sample(item.pool, N,replace=TRUE),
        score1 = sample(c(0,1), N,replace=TRUE), score2 = sample(c(0,1), 
N,replace=TRUE), score3 = sample(c(0,1), N,replace=TRUE))
        
Now, my original loop is in the function 'harold', I created a new function 
"bill" and "phil". I modified Bill's code only to reflect my original naming 
conventions. Timing results for each solution are below.

> system.time(result <- harold(dat))
   user  system elapsed 
1347.85  441.92 1799.75

> system.time(result <- bill(dat))
   user  system elapsed 
   0.04    0.04    0.09

> system.time(result <- phil(dat))
   user  system elapsed 
   4.42    0.00    4.42

The loop timing is laughable; so it is out. Clearly, Phil wins from the "golf" 
viewpoint, but Bill's solution is quite fast. Phil, it is actually quite 
irrelevant that the original ordering of the columns is not preserved since 
that can be easily remedied in a post-hoc reordering of columns.

Again, thank you both.
Harold

harold <- function(dat){
        Nstu <- nrow(dat)
        df <- matrix(NA, ncol = length(item.pool), nrow = Nstu)
        colnames(df) <- item.pool
        for(i in 1:Nstu){
                for(j in 2:4){
                        rr <- which(dat[i,j] == colnames(df))
                        df[i,rr] <- dat[i, (j+3)]
                }
        }
        df
}
system.time(result <- harold(dat))

bill <- function(dat) {
        L <- length(item.pool)
    items <- as.matrix(dat[2:4])
    scores <- as.matrix(dat[, 5:7])
    retval <- matrix(NA_real_, nrow = nrow(dat), ncol = L,
    dimnames = list(character(), item.pool))
    retval[cbind(dat$id, match(items, item.pool))] <- scores
    retval
  }
system.time(result <- bill(dat))

phil <- function(dat){
        df <- tapply(as.vector(as.matrix(dat[5:7])),
                list(rep(dat$id,3),as.vector(as.matrix(dat[2:4]))),I)
        df
        }
system.time(result <- phil(dat))

-----Original Message-----
From: Phil Spector [mailto:spec...@stat.berkeley.edu] 
Sent: Thursday, February 25, 2010 5:38 PM
To: Doran, Harold
Cc: r-help@r-project.org
Subject: Re: [R] Restructure some data

Harold -
    Here's what I came up with:

>  tapply(as.vector(as.matrix(dat[5:7])),
+         list(rep(dat$id,3),as.vector(as.matrix(dat[2:4]))),I)
   item1 item10 item2 item3 item4 item5 item7 item9
1    NA     NA     1    NA    NA     1    NA     0
2     0     NA    NA    NA    NA     1     1    NA
3     1     NA     0     1    NA    NA    NA    NA
4    NA     NA    NA     1     0    NA     0    NA
5    NA      1    NA     0     1    NA    NA    NA

I thought there would be a way to use xtabs, but I had
trouble preserving the NAs.

The columns aren't in the right order, and the item6 column is
missing, but it's pretty close.
Thanks for the easily reproducible example, and the interesting
puzzle.

                                        - Phil Spector
                                         Statistical Computing Facility
                                         Department of Statistics
                                         UC Berkeley
                                         spec...@stat.berkeley.edu


On Thu, 25 Feb 2010, Doran, Harold wrote:

> Suppose I have a data frame like "dat" below. For some context, this is the 
> format that represents student's taking a computer adaptive test. first.item 
> is the first item that student was administered and then score.1 is the 
> student's response to that item and so forth.
>
> item.pool <- paste("item", 1:10, sep = "")
> set.seed(54321)
> dat <- data.frame(id = c(1,2,3,4,5), first.item = sample(item.pool, 5, 
> replace=TRUE),
>                second.item = sample(item.pool, 5,replace=TRUE), third.item = 
> sample(item.pool, 5,replace=TRUE),
>                score1 = sample(c(0,1), 5,replace=TRUE), score2 = 
> sample(c(0,1), 5,replace=TRUE), score3 = sample(c(0,1), 5,replace=TRUE))
>
> I need to restructure this into a new format. The new matrix df (after the 
> loop) is exactly what I want in the end. But, I'm annoyed at myself for not 
> thinking of a more efficient way to restructure this without using a loop.
>
> df <- matrix(NA, ncol = length(item.pool), nrow = nrow(dat))
> colnames(df) <- unique(item.pool)
>
> for(i in 1:5){
>                for(j in 2:4){
>                                rr <- which(dat[i,j] == colnames(df))
>                                df[i,rr] <- dat[i, (j+3)]
>                }
> }
>
> Any thoughts?
>
> Harold
>
>       [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Restructure some data

Reply via email to