Thank you very much -- this was very helpful for differentiating among the aggregating methods!
Matt On Wed, Jan 14, 2009 at 3:42 PM, Marc Schwartz <marc_schwa...@comcast.net> wrote: > on 01/14/2009 02:51 PM Matthew Pettis wrote: >> I have a specific question and a general question. >> >> Specific Question: I want to do an analysis on a data frame by 2 or more >> class variables (i.e., use 2 or more columns in a dataframe to do >> statistical classing). Coming from SAS, I'm used to being able to take a >> data set and have the output of the analysis in a dataset for further >> manipulation. I have a data set with vote totals, with one column being the >> office name being voted on, and the other being the party of the candidate. >> My votes are in the column "vc.n". I did the analysis I want with: >> >> work <- by(sd62[,"vc.n"], sd62[,c("office.nm","party.abbr")], sum) >> >> the str() output of work looks like: >> >>> str(work) >> 'by' int [1:9, 1:11] NA 30 NA NA 0 0 0 NA 33 25678 ... >> - attr(*, "dimnames")=List of 2 >> ..$ office.nm : chr [1:9] "ATTORNEY GENERAL" "GOVERNOR & LT GOVERNOR" >> "SECRETARY OF STATE" "STATE AUDITOR" ... >> ..$ party.abbr: chr [1:11] "CP" "DFL" "DFL2" "GP" ... >> - attr(*, "call")= language by.default(data = sd62[, "vc.n"], INDICES = >> sd62[, c("office.nm", "party.abbr")], FUN = sum) >> >> >> >> >> work is now a list. I'd really like to have work be a data frame with 3 >> columns: The rows of the first two columns show the office and party levels >> being considered, and the third being the sum of the votes for that level >> combination. How do I cast this list/output into a data frame? using >> 'as.data.frame' doesn't work. >> >> General Question: I assume the answer to the specific question is dependent >> on my understanding list objects and accessing their attributes. Can anyone >> point me to a good, throrough treatment of these R topics? Specifically how >> to read and interpret the output of the str(), and attributes() function, >> how to extract the values of the 'by' output object into a data frame, etc.? >> >> Thanks, >> Matt > > Matt, > > Welcome to R. > > The help pages for each function, while they can be intentionally terse, > are a good first place to look. Many will include links/references to > related sources. > > "An Introduction to R" is a good general place to start. A more thorough > treatment is in the "R Language Definition" manual. There are also a > plethora of contributed documents: > > http://cran.r-project.org/other-docs.html > > and books on R and using R within specific domains: > > http://www.r-project.org/doc/bib/R-books.html > > > There are (at least) three ways to generate summary statistics based > upon multi-level groupings. These include by(), tapply() and aggregate(). > > The key difference between the three is the class/structure of the > results object and the print (output) method. In the specific case of > aggregate(), it must also return a scalar. Thus for example, unlike with > by() and tapply(), you cannot use summary(), which returns multiple values. > > Thus the choice for which approach to take, to an extent, is founded on > what you may subsequently do with the data. > > As an example, using the same set of data (warpbreaks): > >> str(warpbreaks) > 'data.frame': 54 obs. of 3 variables: > $ breaks : num 26 30 54 25 70 52 51 26 67 18 ... > $ wool : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ... > $ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ... > > > # Use by() > >> by(warpbreaks[, 1], > list(wool = warpbreaks$wool, tension = warpbreaks$tension), sum) > wool: A > tension: L > [1] 401 > ------------------------------------------------------ > wool: B > tension: L > [1] 254 > ------------------------------------------------------ > wool: A > tension: M > [1] 216 > ------------------------------------------------------ > wool: B > tension: M > [1] 259 > ------------------------------------------------------ > wool: A > tension: H > [1] 221 > ------------------------------------------------------ > wool: B > tension: H > [1] 169 > > > > Note, because the result of using by() is at its core, a matrix/table, > you can also do the following, explicitly using the print method for a > table: > >> print.table(by(warpbreaks[, 1], > list(wool = warpbreaks$wool, > tension = warpbreaks$tension), sum)) > tension > wool L M H > A 401 216 221 > B 254 259 169 > > > which gives you printed output in the same format as tapply() below, > without altering the structure of the result itself. > > > # tapply() directly gives you a tabular output > >> tapply(warpbreaks[, 1], > list(wool = warpbreaks$wool, tension = warpbreaks$tension), > sum) > tension > wool L M H > A 401 216 221 > B 254 259 169 > > > > Note that the structure of the result from by() and the result from > tapply() are quite similar: > >> str(by(warpbreaks[, 1], > list(wool = warpbreaks$wool, tension = warpbreaks$tension), > sum)) > by [1:2, 1:3] 401 254 216 259 221 169 > - attr(*, "dimnames")=List of 2 > ..$ wool : chr [1:2] "A" "B" > ..$ tension: chr [1:3] "L" "M" "H" > - attr(*, "call")= language by.default(data = warpbreaks[, 1], INDICES > = list(wool = warpbreaks$wool, tension = warpbreaks$tension), FUN = > sum) > > >> str(tapply(warpbreaks[, 1], > list(wool = warpbreaks$wool, tension = warpbreaks$tension), > sum)) > num [1:2, 1:3] 401 254 216 259 221 169 > - attr(*, "dimnames")=List of 2 > ..$ wool : chr [1:2] "A" "B" > ..$ tension: chr [1:3] "L" "M" "H" > > > Both are at their core, a 2 x 3 matrix. > > The key difference is in the 'class' of the result, which affects > subsequent operations, such as the print method used. > > > > # aggregate() gives you a data frame, with the summary statistic as the > # 'x' column > >> aggregate(warpbreaks[, 1], > list(wool = warpbreaks$wool, tension = warpbreaks$tension), > sum) > wool tension x > 1 A L 401 > 2 B L 254 > 3 A M 216 > 4 B M 259 > 5 A H 221 > 6 B H 169 > > >> str(aggregate(warpbreaks[, 1], > list(wool = warpbreaks$wool, tension = warpbreaks$tension), > sum)) > 'data.frame': 6 obs. of 3 variables: > $ wool : Factor w/ 2 levels "A","B": 1 2 1 2 1 2 > $ tension: Factor w/ 3 levels "L","M","H": 1 1 2 2 3 3 > $ x : num 401 254 216 259 221 169 > > > Thus, bottom line, given your intended application, I would suggest > using aggregate() rather than by(). > > HTH, > > Marc Schwartz > -- One of the penalties for refusing to participate in politics is that you end up being governed by your inferiors. -- Plato It is from the wellspring of our despair and the places that we are broken that we come to repair the world. -- Murray Waas ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.