Re: [R] Data manipulation problem

David Winsemius Wed, 07 Apr 2010 06:13:11 -0700

That code throws multiple errors. Can you at least test your codebefore posting?

(And, again, please avoid using function names as names for yourobjects.)


-- David.

On Apr 7, 2010, at 8:54 AM, moleps islon wrote:

So.. here we try again.

##generate dataset
age.cat<-seq(0,100,10)
year<-(1953:(1953+55))
data.vec<-sample(10000:10000,(age.cat*year))
data.matrix<-matrix(data.vec,c(length(age.cat),length(year))
rownames(data.matrix)<-age.cat
colnames(data.matrix)<-year

##divide into 5 year periods
age.div<-cut(year,seq(1950,2010,6),include.lowest=T) ##interval is
beyond my datainterval so I doubt the include.lowest matters
Now what I'd like to do is summarise the rows within the 5-yearintervals.
I did read about apply in its different variants and Dahlgaard, but I
do not know understand how it could be applied in this setting.

I tried making an array and summarise by that (used the vector and
applied it into a
length(age.cat)*max(vector(table(age.div)*length(age.div) array. It
worked but required a bit of tweaking (inserting null columns) and I
find myself in this situation quite often whereby I need to add
multiple columns based on another vector so I'd be very interested in
another more general approach.

//M
On Tue, Apr 6, 2010 at 9:41 PM, David Winsemius <dwinsem...@comcast.net> wrote:
On Apr 6, 2010, at 3:30 PM, David Winsemius wrote:
On Apr 6, 2010, at 9:56 AM, moleps islon wrote:
OK... next question.. Which is still a data manipulation problemso I
believe the heading is still OK.

##So now I read my population data from excel.
No, you read it from a text file and providing the first ten linesof thattext file should have been really easy. Read the Posting Guide foradviceabout offering datasets either as structure() objects with dput ordump oras attached files with "*.txt" extension (not .csv). Just changethe file
name with your file browser.
pop<-read.csv("pop.csv")

typeof(pop) ## yields a list
Really? I would have guessed it to yield just "list".
where I have age-specific population rows
and a yearly column population, where the years are suffixed by X
And had you used class(pop) you would have learned it was adataframe and
even more informative would have been str(pop).
c<-(1953:2008)
No, no, no. Do not use variable names that are important functionnames.The R interpreter can (usually) keep things straight but it is ourbrainsthat experience problems. Other function names to avoid: data,df, cut,
mean, sd, list, vector, matrix
names(pop)<-c
c.div<-cut(c,break=seq(1950,2010,by=5)
(You should have gotten an error here.) After fixing the error,did you
you notice that there were only 3 of the first level???
Watch out for cut(). It uses the default convention of ( , ] ,i.e. open
interval at right
er,
      ^left^
which is backwards to what some (most?) of us think natural.Because ofthat the lowest level gets dropped unless you take specialprecautions.That is undoubtedly why Harrell set up his Hmisc::cut2 to have thedefault
be [ , )
Aggregating across columns? Certainly possible, but maybe not asnatural afit to functions like split as would occur with working acrossrows. Isuppose you could use something like this untested (because_still_ no
sample dataset provided) code:

apply(pop, 1,    # this works a row a time
function(x) tapply(x, list(c.div), sum) ) ) # or use aggregatewhich
uses tapply
I'm not sure it will work, since I don't know if the column nameswouldget carried over into "x" by apply(). You might need to create aseparateindex that used the numeric positions of the columns rather thantheirnames. Perhaps use c.div <- seq(0,(2008-1953)) %/% 5 or somesuch inside
tapply.
Now I'd like to sum the agespecific population over the individual
levels of -c.div- and generate a new table for this withagespecific
rows and columns containing the 5-year bins instead of the original
yearly data. Do I have to program this from scratch or is itpossible
to use an already existing function?
I think you ought to read more introductory material (and thePostingGuide regarding how to offer example datasets). In this case thereare manyfunctions that do data aggregation and most of them should beillustrated in
a good introductory text.

--
David.
//M

qta<- table(cut(age,breaks = seq(0, 100, by = 10),include.lowest =
TRUE),cut(year,breaks=seq(1950,2010,by=5),include.lowest=TRUE

On Mon, Apr 5, 2010 at 10:11 PM, moleps <mole...@gmail.com> wrote:
Thx Erik,
I have no idea what went wrong with the other code snippet, butthis one
works.. Appreciate it.

qta<- table(cut(age,breaks = seq(0, 100, by = 10),include.lowest =
TRUE),cut(year,breaks=seq(1950,2010,by=5),include.lowest=TRUE))

M


On 5. apr. 2010, at 21.45, Erik Iverson wrote:
I don't know what your data are like, since you haven't given a
reproducible example. I was imagining something like:

## generate fake data
age <- sample(20:90, 100, replace = TRUE)
year <- sample(1950:2000, 100, replace = TRUE)

##look at big table
table(age, year)

## categorize data
## see include.lowest and right arguments to cut
age.factor <- cut(age, breaks = seq(20, 90, by = 10),
             include.lowest = TRUE)

year.factor <- cut(year, breaks = seq(1950, 2000, by = 10),
              include.lowest = TRUE)

table(age.factor, year.factor)

moleps wrote:
I already did try the regression modeling approach. However the
epidemiologists (referee) turns out to be quite fond ofcomparing theincidence rates to different standard populations, hence theneed for thislabourius approach. And trying the "cutting" approach I endedup with :
table (age5)
age5
(0,5] (5,10] (10,15] (15,20] (20,25] (25,30] (30,35](35,40](40,45] (45,50] (50,55] (55,60] (60,65] (65,70] (70,75](75,80](80,85] (85,100] 35 34 33 4751 109157 231 362 511 745 926 1002866 547
  247       82       18
table (yr5)
yr5
(1950,1955] (1955,1960] (1960,1965] (1965,1970] (1970,1975]
(1975,1980] (1980,1985] (1985,1990] (1990,1995] (1995,2000](2000,2005](2005,2009] 3 5 55 5
     5           5           5         5           5           5
3
table (yr5,age5)
Error in table(yr5, age5) : all arguments must have the samelength
Sincerely,
M
On 5. apr. 2010, at 20.59, Bert Gunter wrote:
You have tempted, and being weak, I yield to temptation:

"Any good ideas?"

Yes. Don't do this.
(what you probably really want to do is fit a model with ageas a
factor,
which can be done statistically e.g. by logistic regression; or
graphically
using conditioning plots, e.g. via trellis graphics (thelattice
package).
This avoids the arbitrariness and discontinuities of binningby age
range.)

Bert Gunter
Genentech Nonclinical Biostatistics

-----Original Message-----
From: r-help-boun...@r-project.org
[mailto:r-help-boun...@r-project.org] On
Behalf Of moleps
Sent: Monday, April 05, 2010 11:46 AM
To: r-help@r-project.org
Subject: [R] Data manipulation problem

Dear R´ers.

I´ve got a dataset with age and year of diagnosis. In order to
age-standardize the incidence I need to transform the datainto a
matrix
with age-groups (divided in 5 or 10 years) along one axis andyear
divided
into 5 years along the other axis. Each cell should contain the
number of
cases for that age group and for that period.
I.e.
My data format now is
ID-age (to one decimal)-year(yearly data).

What I´d like is

age 1960-1965 1966-1970 etc...
0-5 3 8 10 15
6-10 2 5 8 13
etc..


Any good ideas?

Regards,
M
David Winsemius, MD
West Hartford, CT
On Tue, Apr 6, 2010 at 9:41 PM, David Winsemius <dwinsem...@comcast.net> wrote:
On Apr 6, 2010, at 3:30 PM, David Winsemius wrote:
On Apr 6, 2010, at 9:56 AM, moleps islon wrote:
OK... next question.. Which is still a data manipulation problemso I
believe the heading is still OK.

##So now I read my population data from excel.
No, you read it from a text file and providing the first ten linesof thattext file should have been really easy. Read the Posting Guide foradviceabout offering datasets either as structure() objects with dput ordump oras attached files with "*.txt" extension (not .csv). Just changethe file
name with your file browser.
pop<-read.csv("pop.csv")

typeof(pop) ## yields a list
Really? I would have guessed it to yield just "list".
where I have age-specific population rows
and a yearly column population, where the years are suffixed by X
And had you used class(pop) you would have learned it was adataframe and
even more informative would have been str(pop).
c<-(1953:2008)
No, no, no. Do not use variable names that are important functionnames.The R interpreter can (usually) keep things straight but it is ourbrainsthat experience problems. Other function names to avoid: data,df, cut,
mean, sd, list, vector, matrix
names(pop)<-c
c.div<-cut(c,break=seq(1950,2010,by=5)
(You should have gotten an error here.) After fixing the error,did you
you notice that there were only 3 of the first level???
Watch out for cut(). It uses the default convention of ( , ] ,i.e. open
interval at right
er,
       ^left^
which is backwards to what some (most?) of us think natural.Because ofthat the lowest level gets dropped unless you take specialprecautions.That is undoubtedly why Harrell set up his Hmisc::cut2 to havethe default
be [ , )
Aggregating across columns? Certainly possible, but maybe not asnatural afit to functions like split as would occur with working acrossrows. Isuppose you could use something like this untested (because_still_ no
sample dataset provided) code:

apply(pop, 1,    # this works a row a time
function(x) tapply(x, list(c.div), sum) ) ) # or use aggregatewhich
uses tapply
I'm not sure it will work, since I don't know if the column nameswouldget carried over into "x" by apply(). You might need to create aseparateindex that used the numeric positions of the columns rather thantheirnames. Perhaps use c.div <- seq(0,(2008-1953)) %/% 5 or somesuch inside
tapply.
Now I'd like to sum the agespecific population over the individual
levels of -c.div- and generate a new table for this withagespecific
rows and columns containing the 5-year bins instead of the original
yearly data. Do I have to program this from scratch or is itpossible
to use an already existing function?
I think you ought to read more introductory material (and thePostingGuide regarding how to offer example datasets). In this case thereare manyfunctions that do data aggregation and most of them should beillustrated in
a good introductory text.

--
David.
//M

qta<- table(cut(age,breaks = seq(0, 100, by = 10),include.lowest =
TRUE),cut(year,breaks=seq(1950,2010,by=5),include.lowest=TRUE

On Mon, Apr 5, 2010 at 10:11 PM, moleps <mole...@gmail.com> wrote:
Thx Erik,
I have no idea what went wrong with the other code snippet, butthis one
works.. Appreciate it.

qta<- table(cut(age,breaks = seq(0, 100, by = 10),include.lowest =
TRUE),cut(year,breaks=seq(1950,2010,by=5),include.lowest=TRUE))

M


On 5. apr. 2010, at 21.45, Erik Iverson wrote:
I don't know what your data are like, since you haven't given a
reproducible example. I was imagining something like:

## generate fake data
age <- sample(20:90, 100, replace = TRUE)
year <- sample(1950:2000, 100, replace = TRUE)

##look at big table
table(age, year)

## categorize data
## see include.lowest and right arguments to cut
age.factor <- cut(age, breaks = seq(20, 90, by = 10),
              include.lowest = TRUE)

year.factor <- cut(year, breaks = seq(1950, 2000, by = 10),
               include.lowest = TRUE)

table(age.factor, year.factor)

moleps wrote:
I already did try the regression modeling approach. However the
epidemiologists (referee) turns out to be quite fond ofcomparing theincidence rates to different standard populations, hence theneed for thislabourius approach. And trying the "cutting" approach I endedup with :
table (age5)
age5
(0,5] (5,10] (10,15] (15,20] (20,25] (25,30] (30,35](35,40](40,45] (45,50] (50,55] (55,60] (60,65] (65,70] (70,75](75,80](80,85] (85,100] 35 34 33 4751 109157 231 362 511 745 926 1002866 547
   247       82       18
table (yr5)
yr5
(1950,1955] (1955,1960] (1960,1965] (1965,1970] (1970,1975]
(1975,1980] (1980,1985] (1985,1990] (1990,1995] (1995,2000](2000,2005](2005,2009] 3 5 55 55 5 5 55 5
3
table (yr5,age5)
Error in table(yr5, age5) : all arguments must have the samelength
Sincerely,
M
On 5. apr. 2010, at 20.59, Bert Gunter wrote:
You have tempted, and being weak, I yield to temptation:

"Any good ideas?"

Yes. Don't do this.
(what you probably really want to do is fit a model with ageas a
factor,
which can be done statistically e.g. by logistic regression; or
graphically
using conditioning plots, e.g. via trellis graphics (thelattice
package).
This avoids the arbitrariness and discontinuities of binningby age
range.)

Bert Gunter
Genentech Nonclinical Biostatistics

-----Original Message-----
From: r-help-boun...@r-project.org
[mailto:r-help-boun...@r-project.org] On
Behalf Of moleps
Sent: Monday, April 05, 2010 11:46 AM
To: r-help@r-project.org
Subject: [R] Data manipulation problem

Dear R´ers.

I´ve got a dataset with age and year of diagnosis. In order to
age-standardize the incidence I need to transform the datainto a
matrix
with age-groups (divided in 5 or 10 years) along one axis andyear
divided
into 5 years along the other axis. Each cell should contain the
number of
cases for that age group and for that period.
I.e.
My data format now is
ID-age (to one decimal)-year(yearly data).

What I´d like is

age 1960-1965 1966-1970 etc...
0-5 3 8 10 15
6-10 2 5 8 13
etc..


Any good ideas?

Regards,
M
David Winsemius, MD
West Hartford, CT


David Winsemius, MD
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data manipulation problem

Reply via email to