The reason for the empty levels was I did not put drop=TRUE on the split to remove unused levels. Here is the revised script:
> set.seed(1) # start with a known number > x <- data.frame(cat=sample(LETTERS[1:3],20,TRUE),a=sample(letters[1:4], 20, > TRUE), b=runif(20)) > x cat a b 1 A d 0.82094629 2 B a 0.64706019 3 B c 0.78293276 4 C a 0.55303631 5 A b 0.52971958 6 C b 0.78935623 7 C a 0.02333120 8 B b 0.47723007 9 B d 0.73231374 10 A b 0.69273156 11 A b 0.47761962 12 A c 0.86120948 13 C b 0.43809711 14 B a 0.24479728 15 C d 0.07067905 16 B c 0.09946616 17 C d 0.31627171 18 C a 0.51863426 19 B c 0.66200508 20 C b 0.40683019 > # drop unused groups from the split > (z <- split(x, list(x$cat, x$a), drop=TRUE)) $B.a cat a b 2 B a 0.6470602 14 B a 0.2447973 $C.a cat a b 4 C a 0.55303631 7 C a 0.02333120 18 C a 0.51863426 $A.b cat a b 5 A b 0.5297196 10 A b 0.6927316 11 A b 0.4776196 $B.b cat a b 8 B b 0.4772301 $C.b cat a b 6 C b 0.7893562 13 C b 0.4380971 20 C b 0.4068302 $A.c cat a b 12 A c 0.8612095 $B.c cat a b 3 B c 0.78293276 16 B c 0.09946616 19 B c 0.66200508 $A.d cat a b 1 A d 0.8209463 $B.d cat a b 9 B d 0.7323137 $C.d cat a b 15 C d 0.07067905 17 C d 0.31627171 > # access the value ('b' in this instance); two ways- should be the same > z[[1]]$b [1] 0.6470602 0.2447973 > z$B.a$b [1] 0.6470602 0.2447973 > > > > On Sun, Jul 13, 2008 at 1:26 AM, <[EMAIL PROTECTED]> wrote: > This is almost it. Maybe it is as good as can be expected. The only problem > that I see is that this seems to form a Category/SubCategory pair where none > existed in the original data. For example, A might have two sub-categories a > and b, and B might have two categories c and d. As far as I can tell the > method that you outlined forms a Category/SubCategory pair like B a or B b > where none existed. This results in alot of empty lists and it seems to take > a long time to generate. But if that is as good as it gets then I can live > with it. > > I know that I said one more question. But I have run into a problem. c <- > split(x, x$Category) returns a vector of the rows in each of the categories. > Now I would like to access the "Quantity" column within this split vector. I > can see it listed. I just can't access it. I have tried c[1]$Quantity and > c[1,2] both which give me errors. Any ideas? > > Sorry this is so hard for me. I am more used to C type arrays and C type > arrays of structures. This seems to be somewhat different. > > Thank you. > > Kevin > ---- jim holtman <[EMAIL PROTECTED]> wrote: >> Is this something like what you were asking for? The output of a >> 'split' will be a list of the dataframe subsets for the categories you >> have specified. >> >> > x <- data.frame(g1=sample(LETTERS[1:2],30,TRUE), >> + g2=sample(letters[1:2], 30, TRUE), >> + g3=1:30) >> > y <- split(x, list(x$g1, x$g2)) >> > str(y) >> List of 4 >> $ A.a:'data.frame': 7 obs. of 3 variables: >> ..$ g1: Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 >> ..$ g2: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1 >> ..$ g3: int [1:7] 3 4 6 8 9 13 24 >> $ B.a:'data.frame': 7 obs. of 3 variables: >> ..$ g1: Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 >> ..$ g2: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1 >> ..$ g3: int [1:7] 10 11 16 17 18 20 25 >> $ A.b:'data.frame': 6 obs. of 3 variables: >> ..$ g1: Factor w/ 2 levels "A","B": 1 1 1 1 1 1 >> ..$ g2: Factor w/ 2 levels "a","b": 2 2 2 2 2 2 >> ..$ g3: int [1:6] 2 12 23 26 27 29 >> $ B.b:'data.frame': 10 obs. of 3 variables: >> ..$ g1: Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 2 2 2 >> ..$ g2: Factor w/ 2 levels "a","b": 2 2 2 2 2 2 2 2 2 2 >> ..$ g3: int [1:10] 1 5 7 14 15 19 21 22 28 30 >> > y >> $A.a >> g1 g2 g3 >> 3 A a 3 >> 4 A a 4 >> 6 A a 6 >> 8 A a 8 >> 9 A a 9 >> 13 A a 13 >> 24 A a 24 >> >> $B.a >> g1 g2 g3 >> 10 B a 10 >> 11 B a 11 >> 16 B a 16 >> 17 B a 17 >> 18 B a 18 >> 20 B a 20 >> 25 B a 25 >> >> $A.b >> g1 g2 g3 >> 2 A b 2 >> 12 A b 12 >> 23 A b 23 >> 26 A b 26 >> 27 A b 27 >> 29 A b 29 >> >> $B.b >> g1 g2 g3 >> 1 B b 1 >> 5 B b 5 >> 7 B b 7 >> 14 B b 14 >> 15 B b 15 >> 19 B b 19 >> 21 B b 21 >> 22 B b 22 >> 28 B b 28 >> 30 B b 30 >> >> > y[[2]] >> g1 g2 g3 >> 10 B a 10 >> 11 B a 11 >> 16 B a 16 >> 17 B a 17 >> 18 B a 18 >> 20 B a 20 >> 25 B a 25 >> > >> > >> > >> >> >> On Sat, Jul 12, 2008 at 8:51 PM, <[EMAIL PROTECTED]> wrote: >> > OK. Now I know that I am dealing with a data frame. One last question on >> > this topic. a <- read.csv() gives me a dataframe. If I have 'c <- split(x, >> > x$Category), then what is returned by split in this case? c[1] seems to >> > be OK but c[2] is not right in my mind. If I run ci <- split(nrow(a), >> > a$Category). And then ci[1] seems to be the rows associated with the first >> > category, c[2] is the indices/rows associated with the second category, >> > etc. But this seems different than c[1], c[2], etc. >> > >> > Using the techniques below I can get the information on the categories. >> > Now as an extra level of complexity there are SubCategories within each >> > Category. Assume that the SubCategory names are not unique within the >> > dataset so if I want the SubCategory data I need to retrive the indices >> > (or data) for the Category and SubCategory pair. In other words if I have >> > a Category that ranges from 'A' to 'Z', it is possible that I might have a >> > subcategory A a, A b (where a and b are the sub category names). I also >> > might have B a, B b. I want all of the sub categories A a. NOT the >> > subcategories a (because that might include B a which would be different). >> > I am guessing that this will take more than a simple 'split'. >> > >> > Thank you. >> > >> > Kevin >> > >> > ---- Duncan Murdoch <[EMAIL PROTECTED]> wrote: >> >> On 12/07/2008 3:59 PM, [EMAIL PROTECTED] wrote: >> >> > I am sorry but if read.csv returns a dataframe and a dataframe is like >> >> > a matrix and I have a set of input like below and a[1,] gives me the >> >> > first row, what is the second index? From what I read and your input I >> >> > am guessing that it is the column number. So a[1,1] would return the >> >> > DayOfYear column for the first row, right? What does a$DayOfYear return? >> >> >> >> a$DayOfYear would be the same as a[,1] or a[,"DayOfYear"], i.e. it would >> >> return the entire first column. >> >> >> >> Duncan Murdoch >> >> >> >> > >> >> > Thank you for your patience. >> >> > >> >> > Kevin >> >> > >> >> > ---- Duncan Murdoch <[EMAIL PROTECTED]> wrote: >> >> >> On 12/07/2008 12:31 PM, [EMAIL PROTECTED] wrote: >> >> >>> I am using a simple R statement to read in the file: >> >> >>> >> >> >>> a <- read.csv("Sample.dat", header=TRUE) >> >> >>> >> >> >>> There is alot of data but the first few lines look like: >> >> >>> >> >> >>> DayOfYear,Quantity,Fraction,Category,SubCategory >> >> >>> 1,82,0.0000390392720794458,(Unknown),(Unknown) >> >> >>> 2,78,0.0000371349173438631,(Unknown),(Unknown) >> >> >>> . . . >> >> >>> 71,2,0.0000009521773677913,WOMEN,Piratesses >> >> >>> 72,4,0.0000019043547355827,WOMEN,Piratesses >> >> >>> 73,3,0.0000014282660516870,WOMEN,Piratesses >> >> >>> 74,14,0.0000066652415745395,WOMEN,Piratesses >> >> >>> 75,2,0.0000009521773677913,WOMEN,Piratesses >> >> >>> >> >> >>> If I read the data in as above, the command >> >> >>> >> >> >>> a[1] >> >> >>> >> >> >>> results in the output >> >> >>> >> >> >>> [ reached getOption("max.print") -- omitted 16193 rows ]] >> >> >>> >> >> >>> Shouldn't this be the first row? >> >> >> No, the first row would be a[1,]. read.csv() returns a dataframe, and >> >> >> those are indexed with two indices to treat them like a matrix, or with >> >> >> one index to treat them like a list of their columns. >> >> >> >> >> >> Duncan Murdoch >> >> >> >> >> >>> a$Category[1] >> >> >>> >> >> >>> results in the output >> >> >>> >> >> >>> [1] (Unknown) >> >> >>> 4464 Levels: Tags ... WOMEN >> >> >>> >> >> >>> But >> >> >>> >> >> >>> a$Category[365] >> >> >>> >> >> >>> gives me: >> >> >>> >> >> >>> [1] 7 Plates (Dessert),Western\n120,5,0.0000023804434194784,7 >> >> >>> Plates (Dessert) >> >> >>> 4464 Levels: Tags ... WOMEN >> >> >>> >> >> >>> There is something fundamental about either vectors of the read.csv >> >> >>> command that I am missing here. >> >> >>> >> >> >>> Thank you. >> >> >>> >> >> >>> Kevin >> >> >>> >> >> >>> ---- jim holtman <[EMAIL PROTECTED]> wrote: >> >> >>>> Please provide commented, minimal, self-contained, reproducible code, >> >> >>>> or at least a before/after of what you data would look like. Taking >> >> >>>> a >> >> >>>> guess at what you are asking, here is one way of doing it: >> >> >>>> >> >> >>>> >> >> >>>>> x <- data.frame(cat=sample(LETTERS[1:3],20,TRUE),a=1:20, >> >> >>>>> b=runif(20)) >> >> >>>>> x >> >> >>>> cat a b >> >> >>>> 1 B 1 0.65472393 >> >> >>>> 2 C 2 0.35319727 >> >> >>>> 3 B 3 0.27026015 >> >> >>>> 4 A 4 0.99268406 >> >> >>>> 5 C 5 0.63349326 >> >> >>>> 6 A 6 0.21320814 >> >> >>>> 7 C 7 0.12937235 >> >> >>>> 8 A 8 0.47811803 >> >> >>>> 9 A 9 0.92407447 >> >> >>>> 10 A 10 0.59876097 >> >> >>>> 11 A 11 0.97617069 >> >> >>>> 12 A 12 0.73179251 >> >> >>>> 13 B 13 0.35672691 >> >> >>>> 14 C 14 0.43147369 >> >> >>>> 15 C 15 0.14821156 >> >> >>>> 16 C 16 0.01307758 >> >> >>>> 17 B 17 0.71556607 >> >> >>>> 18 B 18 0.10318424 >> >> >>>> 19 C 19 0.44628435 >> >> >>>> 20 B 20 0.64010105 >> >> >>>>> # create a list of the indices of the data grouped by 'cat' >> >> >>>>> split(seq(nrow(x)), x$cat) >> >> >>>> $A >> >> >>>> [1] 4 6 8 9 10 11 12 >> >> >>>> >> >> >>>> $B >> >> >>>> [1] 1 3 13 17 18 20 >> >> >>>> >> >> >>>> $C >> >> >>>> [1] 2 5 7 14 15 16 19 >> >> >>>> >> >> >>>>> # or do you want the data >> >> >>>>> split(x, x$cat) >> >> >>>> $A >> >> >>>> cat a b >> >> >>>> 4 A 4 0.9926841 >> >> >>>> 6 A 6 0.2132081 >> >> >>>> 8 A 8 0.4781180 >> >> >>>> 9 A 9 0.9240745 >> >> >>>> 10 A 10 0.5987610 >> >> >>>> 11 A 11 0.9761707 >> >> >>>> 12 A 12 0.7317925 >> >> >>>> >> >> >>>> $B >> >> >>>> cat a b >> >> >>>> 1 B 1 0.6547239 >> >> >>>> 3 B 3 0.2702601 >> >> >>>> 13 B 13 0.3567269 >> >> >>>> 17 B 17 0.7155661 >> >> >>>> 18 B 18 0.1031842 >> >> >>>> 20 B 20 0.6401010 >> >> >>>> >> >> >>>> $C >> >> >>>> cat a b >> >> >>>> 2 C 2 0.35319727 >> >> >>>> 5 C 5 0.63349326 >> >> >>>> 7 C 7 0.12937235 >> >> >>>> 14 C 14 0.43147369 >> >> >>>> 15 C 15 0.14821156 >> >> >>>> 16 C 16 0.01307758 >> >> >>>> 19 C 19 0.44628435 >> >> >>>> >> >> >>>> >> >> >>>> On Sat, Jul 12, 2008 at 3:32 AM, <[EMAIL PROTECTED]> wrote: >> >> >>>>> I have search the archive and I could not find what I need so I >> >> >>>>> will try to ask the question here. >> >> >>>>> >> >> >>>>> I read a table in (read.table) >> >> >>>>> >> >> >>>>> a <- read.table(.....) >> >> >>>>> >> >> >>>>> The table has column names like DayOfYear, Quantity, and Category. >> >> >>>>> >> >> >>>>> The values in the row for Category are strings (characters). >> >> >>>>> >> >> >>>>> I want to get all of the rows grouped by Category. The number of >> >> >>>>> unique category names could be around 50. Say for argument sake the >> >> >>>>> number of categories is exactly 50. Can I somehow get a vector of >> >> >>>>> length 50 containing the rows corresponding to the category >> >> >>>>> (another vector)? I realize I can access any row a[i]$Category >> >> >>>>> (right?). But I wanta vector containing the rows corresponding to >> >> >>>>> each distinct Category name. >> >> >>>>> >> >> >>>>> Thank you. >> >> >>>>> >> >> >>>>> Kevin >> >> >>>>> >> >> >>>>> ______________________________________________ >> >> >>>>> R-help@r-project.org mailing list >> >> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >> >> >>>>> PLEASE do read the posting guide >> >> >>>>> http://www.R-project.org/posting-guide.html >> >> >>>>> and provide commented, minimal, self-contained, reproducible code. >> >> >>>>> >> >> >>>> >> >> >>>> -- >> >> >>>> Jim Holtman >> >> >>>> Cincinnati, OH >> >> >>>> +1 513 646 9390 >> >> >>>> >> >> >>>> What is the problem you are trying to solve? >> >> >>> ______________________________________________ >> >> >>> R-help@r-project.org mailing list >> >> >>> https://stat.ethz.ch/mailman/listinfo/r-help >> >> >>> PLEASE do read the posting guide >> >> >>> http://www.R-project.org/posting-guide.html >> >> >>> and provide commented, minimal, self-contained, reproducible code. >> >> >> > >> > >> >> >> >> -- >> Jim Holtman >> Cincinnati, OH >> +1 513 646 9390 >> >> What is the problem you are trying to solve? > > -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve? ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.