The reason for the empty levels was I did not put drop=TRUE on the
split to remove unused levels.  Here is the revised script:

> set.seed(1)  # start with a known number
> x <- data.frame(cat=sample(LETTERS[1:3],20,TRUE),a=sample(letters[1:4], 20, 
> TRUE), b=runif(20))
> x
   cat a          b
1    A d 0.82094629
2    B a 0.64706019
3    B c 0.78293276
4    C a 0.55303631
5    A b 0.52971958
6    C b 0.78935623
7    C a 0.02333120
8    B b 0.47723007
9    B d 0.73231374
10   A b 0.69273156
11   A b 0.47761962
12   A c 0.86120948
13   C b 0.43809711
14   B a 0.24479728
15   C d 0.07067905
16   B c 0.09946616
17   C d 0.31627171
18   C a 0.51863426
19   B c 0.66200508
20   C b 0.40683019
> # drop unused groups from the split
> (z <- split(x, list(x$cat, x$a), drop=TRUE))
$B.a
   cat a         b
2    B a 0.6470602
14   B a 0.2447973

$C.a
   cat a          b
4    C a 0.55303631
7    C a 0.02333120
18   C a 0.51863426

$A.b
   cat a         b
5    A b 0.5297196
10   A b 0.6927316
11   A b 0.4776196

$B.b
  cat a         b
8   B b 0.4772301

$C.b
   cat a         b
6    C b 0.7893562
13   C b 0.4380971
20   C b 0.4068302

$A.c
   cat a         b
12   A c 0.8612095

$B.c
   cat a          b
3    B c 0.78293276
16   B c 0.09946616
19   B c 0.66200508

$A.d
  cat a         b
1   A d 0.8209463

$B.d
  cat a         b
9   B d 0.7323137

$C.d
   cat a          b
15   C d 0.07067905
17   C d 0.31627171

> # access the value ('b' in this instance); two ways- should be the same
> z[[1]]$b
[1] 0.6470602 0.2447973
> z$B.a$b
[1] 0.6470602 0.2447973
>
>
>
>


On Sun, Jul 13, 2008 at 1:26 AM,  <[EMAIL PROTECTED]> wrote:
> This is almost it. Maybe it is as good as can be expected. The only problem 
> that I see is that this seems to form a Category/SubCategory pair where none 
> existed in the original data. For example, A might have two sub-categories a 
> and b, and B might have two categories c and d. As far as I can tell the 
> method that you outlined forms a Category/SubCategory pair like B a or B b 
> where none existed. This results in alot of empty lists and it seems to take 
> a long time to generate. But if that is as good as it gets then I can live 
> with it.
>
> I know that I said one more question. But I have run into a problem. c <- 
> split(x, x$Category) returns a vector of the rows in each of the categories. 
> Now I would like to access the "Quantity" column within this split vector. I 
> can see it listed. I just can't access it. I have tried c[1]$Quantity and 
> c[1,2] both which give me errors. Any ideas?
>
> Sorry this is so hard for me. I am more used to C type arrays and C type 
> arrays of structures. This seems to be somewhat different.
>
> Thank you.
>
> Kevin
> ---- jim holtman <[EMAIL PROTECTED]> wrote:
>> Is this something like what you were asking for?  The output of a
>> 'split' will be a list of the dataframe subsets for the categories you
>> have specified.
>>
>> > x <- data.frame(g1=sample(LETTERS[1:2],30,TRUE),
>> +     g2=sample(letters[1:2], 30, TRUE),
>> +     g3=1:30)
>> > y <- split(x, list(x$g1, x$g2))
>> > str(y)
>> List of 4
>>  $ A.a:'data.frame':    7 obs. of  3 variables:
>>   ..$ g1: Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1
>>   ..$ g2: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1
>>   ..$ g3: int [1:7] 3 4 6 8 9 13 24
>>  $ B.a:'data.frame':    7 obs. of  3 variables:
>>   ..$ g1: Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2
>>   ..$ g2: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1
>>   ..$ g3: int [1:7] 10 11 16 17 18 20 25
>>  $ A.b:'data.frame':    6 obs. of  3 variables:
>>   ..$ g1: Factor w/ 2 levels "A","B": 1 1 1 1 1 1
>>   ..$ g2: Factor w/ 2 levels "a","b": 2 2 2 2 2 2
>>   ..$ g3: int [1:6] 2 12 23 26 27 29
>>  $ B.b:'data.frame':    10 obs. of  3 variables:
>>   ..$ g1: Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 2 2 2
>>   ..$ g2: Factor w/ 2 levels "a","b": 2 2 2 2 2 2 2 2 2 2
>>   ..$ g3: int [1:10] 1 5 7 14 15 19 21 22 28 30
>> > y
>> $A.a
>>    g1 g2 g3
>> 3   A  a  3
>> 4   A  a  4
>> 6   A  a  6
>> 8   A  a  8
>> 9   A  a  9
>> 13  A  a 13
>> 24  A  a 24
>>
>> $B.a
>>    g1 g2 g3
>> 10  B  a 10
>> 11  B  a 11
>> 16  B  a 16
>> 17  B  a 17
>> 18  B  a 18
>> 20  B  a 20
>> 25  B  a 25
>>
>> $A.b
>>    g1 g2 g3
>> 2   A  b  2
>> 12  A  b 12
>> 23  A  b 23
>> 26  A  b 26
>> 27  A  b 27
>> 29  A  b 29
>>
>> $B.b
>>    g1 g2 g3
>> 1   B  b  1
>> 5   B  b  5
>> 7   B  b  7
>> 14  B  b 14
>> 15  B  b 15
>> 19  B  b 19
>> 21  B  b 21
>> 22  B  b 22
>> 28  B  b 28
>> 30  B  b 30
>>
>> > y[[2]]
>>    g1 g2 g3
>> 10  B  a 10
>> 11  B  a 11
>> 16  B  a 16
>> 17  B  a 17
>> 18  B  a 18
>> 20  B  a 20
>> 25  B  a 25
>> >
>> >
>> >
>>
>>
>> On Sat, Jul 12, 2008 at 8:51 PM,  <[EMAIL PROTECTED]> wrote:
>> > OK. Now I know that I am dealing with a data frame. One last question on 
>> > this topic. a <- read.csv() gives me a dataframe. If I have 'c <- split(x, 
>> > x$Category), then what is  returned by split in this case? c[1] seems to 
>> > be OK but c[2] is not right in my mind. If I run ci <- split(nrow(a), 
>> > a$Category). And then ci[1] seems to be the rows associated with the first 
>> > category, c[2] is the indices/rows associated with the second category, 
>> > etc. But this seems different than c[1], c[2], etc.
>> >
>> > Using the techniques below I can get the information on the categories. 
>> > Now as an extra level of complexity there are SubCategories within each 
>> > Category. Assume that the SubCategory names are not unique within the 
>> > dataset so if I want the SubCategory data I need to retrive the indices 
>> > (or data) for the Category and SubCategory pair. In other words if I have 
>> > a Category that ranges from 'A' to 'Z', it is possible that I might have a 
>> > subcategory A a, A b (where a and b are the sub category names). I also 
>> > might have B a, B b. I want all of the sub categories A a. NOT the 
>> > subcategories a (because that might include B a which would be different). 
>> > I am guessing that this will take more than a simple 'split'.
>> >
>> > Thank you.
>> >
>> > Kevin
>> >
>> > ---- Duncan Murdoch <[EMAIL PROTECTED]> wrote:
>> >> On 12/07/2008 3:59 PM, [EMAIL PROTECTED] wrote:
>> >> > I am sorry but if read.csv returns a dataframe and a dataframe is like 
>> >> > a matrix and I have a set of input like below and a[1,] gives me the 
>> >> > first row, what is the second index? From what I read and your input I 
>> >> > am guessing that it is the column number. So a[1,1] would return the 
>> >> > DayOfYear column for the first row, right? What does a$DayOfYear return?
>> >>
>> >> a$DayOfYear would be the same as a[,1] or a[,"DayOfYear"], i.e. it would
>> >> return the entire first column.
>> >>
>> >> Duncan Murdoch
>> >>
>> >> >
>> >> > Thank you for your patience.
>> >> >
>> >> > Kevin
>> >> >
>> >> > ---- Duncan Murdoch <[EMAIL PROTECTED]> wrote:
>> >> >> On 12/07/2008 12:31 PM, [EMAIL PROTECTED] wrote:
>> >> >>> I am using a simple R statement to read in the file:
>> >> >>>
>> >> >>> a <- read.csv("Sample.dat", header=TRUE)
>> >> >>>
>> >> >>> There is alot of data but the first few lines look like:
>> >> >>>
>> >> >>> DayOfYear,Quantity,Fraction,Category,SubCategory
>> >> >>> 1,82,0.0000390392720794458,(Unknown),(Unknown)
>> >> >>> 2,78,0.0000371349173438631,(Unknown),(Unknown)
>> >> >>> . . .
>> >> >>> 71,2,0.0000009521773677913,WOMEN,Piratesses
>> >> >>> 72,4,0.0000019043547355827,WOMEN,Piratesses
>> >> >>> 73,3,0.0000014282660516870,WOMEN,Piratesses
>> >> >>> 74,14,0.0000066652415745395,WOMEN,Piratesses
>> >> >>> 75,2,0.0000009521773677913,WOMEN,Piratesses
>> >> >>>
>> >> >>> If I read the data in as above, the command
>> >> >>>
>> >> >>> a[1]
>> >> >>>
>> >> >>> results in the output
>> >> >>>
>> >> >>> [ reached getOption("max.print") -- omitted 16193 rows ]]
>> >> >>>
>> >> >>> Shouldn't this be the first row?
>> >> >> No, the first row would be a[1,].  read.csv() returns a dataframe, and
>> >> >> those are indexed with two indices to treat them like a matrix, or with
>> >> >> one index to treat them like a list of their columns.
>> >> >>
>> >> >> Duncan Murdoch
>> >> >>
>> >> >>> a$Category[1]
>> >> >>>
>> >> >>> results in the output
>> >> >>>
>> >> >>> [1] (Unknown)
>> >> >>> 4464 Levels:   Tags ... WOMEN
>> >> >>>
>> >> >>> But
>> >> >>>
>> >> >>> a$Category[365]
>> >> >>>
>> >> >>> gives me:
>> >> >>>
>> >> >>> [1] 7 Plates   (Dessert),Western\n120,5,0.0000023804434194784,7 
>> >> >>> Plates   (Dessert)
>> >> >>> 4464 Levels:   Tags ... WOMEN
>> >> >>>
>> >> >>> There is something fundamental about either vectors of the read.csv 
>> >> >>> command that I am missing here.
>> >> >>>
>> >> >>> Thank you.
>> >> >>>
>> >> >>> Kevin
>> >> >>>
>> >> >>> ---- jim holtman <[EMAIL PROTECTED]> wrote:
>> >> >>>> Please provide commented, minimal, self-contained, reproducible code,
>> >> >>>> or at least a before/after of what you data would look like.  Taking 
>> >> >>>> a
>> >> >>>> guess at what you are asking, here is one way of doing it:
>> >> >>>>
>> >> >>>>
>> >> >>>>> x <- data.frame(cat=sample(LETTERS[1:3],20,TRUE),a=1:20, 
>> >> >>>>> b=runif(20))
>> >> >>>>> x
>> >> >>>>    cat  a          b
>> >> >>>> 1    B  1 0.65472393
>> >> >>>> 2    C  2 0.35319727
>> >> >>>> 3    B  3 0.27026015
>> >> >>>> 4    A  4 0.99268406
>> >> >>>> 5    C  5 0.63349326
>> >> >>>> 6    A  6 0.21320814
>> >> >>>> 7    C  7 0.12937235
>> >> >>>> 8    A  8 0.47811803
>> >> >>>> 9    A  9 0.92407447
>> >> >>>> 10   A 10 0.59876097
>> >> >>>> 11   A 11 0.97617069
>> >> >>>> 12   A 12 0.73179251
>> >> >>>> 13   B 13 0.35672691
>> >> >>>> 14   C 14 0.43147369
>> >> >>>> 15   C 15 0.14821156
>> >> >>>> 16   C 16 0.01307758
>> >> >>>> 17   B 17 0.71556607
>> >> >>>> 18   B 18 0.10318424
>> >> >>>> 19   C 19 0.44628435
>> >> >>>> 20   B 20 0.64010105
>> >> >>>>> # create a list of the indices of the data grouped by 'cat'
>> >> >>>>> split(seq(nrow(x)), x$cat)
>> >> >>>> $A
>> >> >>>> [1]  4  6  8  9 10 11 12
>> >> >>>>
>> >> >>>> $B
>> >> >>>> [1]  1  3 13 17 18 20
>> >> >>>>
>> >> >>>> $C
>> >> >>>> [1]  2  5  7 14 15 16 19
>> >> >>>>
>> >> >>>>> # or do you want the data
>> >> >>>>> split(x, x$cat)
>> >> >>>> $A
>> >> >>>>    cat  a         b
>> >> >>>> 4    A  4 0.9926841
>> >> >>>> 6    A  6 0.2132081
>> >> >>>> 8    A  8 0.4781180
>> >> >>>> 9    A  9 0.9240745
>> >> >>>> 10   A 10 0.5987610
>> >> >>>> 11   A 11 0.9761707
>> >> >>>> 12   A 12 0.7317925
>> >> >>>>
>> >> >>>> $B
>> >> >>>>    cat  a         b
>> >> >>>> 1    B  1 0.6547239
>> >> >>>> 3    B  3 0.2702601
>> >> >>>> 13   B 13 0.3567269
>> >> >>>> 17   B 17 0.7155661
>> >> >>>> 18   B 18 0.1031842
>> >> >>>> 20   B 20 0.6401010
>> >> >>>>
>> >> >>>> $C
>> >> >>>>    cat  a          b
>> >> >>>> 2    C  2 0.35319727
>> >> >>>> 5    C  5 0.63349326
>> >> >>>> 7    C  7 0.12937235
>> >> >>>> 14   C 14 0.43147369
>> >> >>>> 15   C 15 0.14821156
>> >> >>>> 16   C 16 0.01307758
>> >> >>>> 19   C 19 0.44628435
>> >> >>>>
>> >> >>>>
>> >> >>>> On Sat, Jul 12, 2008 at 3:32 AM,  <[EMAIL PROTECTED]> wrote:
>> >> >>>>> I have search the archive and I could not find what I need so I 
>> >> >>>>> will try to ask the question here.
>> >> >>>>>
>> >> >>>>> I read a table in (read.table)
>> >> >>>>>
>> >> >>>>> a <- read.table(.....)
>> >> >>>>>
>> >> >>>>> The table has column names like DayOfYear, Quantity, and Category.
>> >> >>>>>
>> >> >>>>> The values in the row for Category are strings (characters).
>> >> >>>>>
>> >> >>>>> I want to get all of the rows grouped by Category. The number of 
>> >> >>>>> unique category names could be around 50. Say for argument sake the 
>> >> >>>>> number of categories is exactly 50. Can I somehow get a vector of 
>> >> >>>>> length 50 containing the rows corresponding to the category 
>> >> >>>>> (another vector)? I realize I can access any row a[i]$Category 
>> >> >>>>> (right?). But I wanta vector containing the rows corresponding to 
>> >> >>>>> each distinct Category name.
>> >> >>>>>
>> >> >>>>> Thank you.
>> >> >>>>>
>> >> >>>>> Kevin
>> >> >>>>>
>> >> >>>>> ______________________________________________
>> >> >>>>> R-help@r-project.org mailing list
>> >> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> >>>>> PLEASE do read the posting guide 
>> >> >>>>> http://www.R-project.org/posting-guide.html
>> >> >>>>> and provide commented, minimal, self-contained, reproducible code.
>> >> >>>>>
>> >> >>>>
>> >> >>>> --
>> >> >>>> Jim Holtman
>> >> >>>> Cincinnati, OH
>> >> >>>> +1 513 646 9390
>> >> >>>>
>> >> >>>> What is the problem you are trying to solve?
>> >> >>> ______________________________________________
>> >> >>> R-help@r-project.org mailing list
>> >> >>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> >>> PLEASE do read the posting guide 
>> >> >>> http://www.R-project.org/posting-guide.html
>> >> >>> and provide commented, minimal, self-contained, reproducible code.
>> >>
>> >
>> >
>>
>>
>>
>> --
>> Jim Holtman
>> Cincinnati, OH
>> +1 513 646 9390
>>
>> What is the problem you are trying to solve?
>
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to