[R] Cleaning database: grep()? apply()?

2007-11-13 Thread Jonas Malmros
Dear R users,

I have a huge database and I need to adjust it somewhat.

Here is a very little cut out from database:

CODENAME   DATE 
DATA1
4813ADVANCED TELECOM19870.013
3845ADVANCED THERAPEUTIC SYS LTD198710.1
3845ADVANCED THERAPEUTIC SYS LTD19892.463
3845ADVANCED THERAPEUTIC SYS LTD19881.563
2836ADVANCED TISSUE SCI  -CL A  19870.847
2836ADVANCED TISSUE SCI  -CL A   1989   0.872
2836ADVANCED TISSUE SCI  -CL A   1988   0.529

What I need is:
1) Delete all cases containing -CL A (and also -OLD, -ADS, etc) at the end
2) Delete all cases that have less than 3 years of data
3) For each remaining case compute ratio DATA1(1989) / DATA1(1987)
[and then ratios involving other data variables] and output this into
new database consisting of CODE, NAME, RATIOs.

Maybe someone can suggest an effective way to do these things? I
imagine the first one would involve grep(), and 2 and 3 would involve
apply family of functions, but I cannot get my mind around the actual
code to perform this adjustments. I am new to R, I do write code but
usually it consists of for-functions and plotting. I would much
appreciate your help.
Thank you in advance!
-- 
Jonas Malmros
Stockholm University
Stockholm, Sweden

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cleaning database: grep()? apply()?

2007-11-13 Thread jim holtman
Here is how to wittle it down for the first two parts of your
question.  I am not exactly what you are after in the third part.  Is
it that you want specific DATEs or do you want the ratio of the
DATE[max]/DATE[min]?

 x - read.table(textConnection(CODENAME  
  DATE DATA1
+ 4813'ADVANCED TELECOM'19870.013
+ 3845'ADVANCED THERAPEUTIC SYS LTD'198710.1
+ 3845'ADVANCED THERAPEUTIC SYS LTD'19892.463
+ 3845'ADVANCED THERAPEUTIC SYS LTD'19881.563
+ 2836'ADVANCED TISSUE SCI  -CL A'  19870.847
+ 2836'ADVANCED TISSUE SCI  -CL A'   1989   0.872
+ 2836'ADVANCED TISSUE SCI  -CL A'   1988
0.529), header=TRUE)
 # matches on things to delete
 delete_indx - grep(-CL A$|-OLD$|-ADS$, x$NAME)
 # delete them
 x - x[-delete_indx,]
 x
  CODE NAME DATE  DATA1
1 4813 ADVANCED TELECOM 1987  0.013
2 3845 ADVANCED THERAPEUTIC SYS LTD 1987 10.100
3 3845 ADVANCED THERAPEUTIC SYS LTD 1989  2.463
4 3845 ADVANCED THERAPEUTIC SYS LTD 1988  1.563
 # I assume you want to use NAME to check for ranges of data
 date_range - tapply(x$DATE, x$NAME, function(dates) diff(range(dates)))
 date_range
ADVANCED TELECOM ADVANCED THERAPEUTIC SYS LTD
   02
  ADVANCED TISSUE SCI  -CL A
  NA
 # delete ones with less than 3 years
 names_to_delete - names(date_range[date_range  2])
 # delete those entries
 x - x[!(x$NAME %in% names_to_delete),]
 x
  CODE NAME DATE  DATA1
2 3845 ADVANCED THERAPEUTIC SYS LTD 1987 10.100
3 3845 ADVANCED THERAPEUTIC SYS LTD 1989  2.463
4 3845 ADVANCED THERAPEUTIC SYS LTD 1988  1.563




On Nov 13, 2007 2:34 PM, Jonas Malmros [EMAIL PROTECTED] wrote:
 Dear R users,

 I have a huge database and I need to adjust it somewhat.

 Here is a very little cut out from database:

 CODENAME   DATE 
 DATA1
 4813ADVANCED TELECOM19870.013
 3845ADVANCED THERAPEUTIC SYS LTD198710.1
 3845ADVANCED THERAPEUTIC SYS LTD19892.463
 3845ADVANCED THERAPEUTIC SYS LTD19881.563
 2836ADVANCED TISSUE SCI  -CL A  19870.847
 2836ADVANCED TISSUE SCI  -CL A   1989   0.872
 2836ADVANCED TISSUE SCI  -CL A   1988   0.529

 What I need is:
 1) Delete all cases containing -CL A (and also -OLD, -ADS, etc) at the end
 2) Delete all cases that have less than 3 years of data
 3) For each remaining case compute ratio DATA1(1989) / DATA1(1987)
 [and then ratios involving other data variables] and output this into
 new database consisting of CODE, NAME, RATIOs.

 Maybe someone can suggest an effective way to do these things? I
 imagine the first one would involve grep(), and 2 and 3 would involve
 apply family of functions, but I cannot get my mind around the actual
 code to perform this adjustments. I am new to R, I do write code but
 usually it consists of for-functions and plotting. I would much
 appreciate your help.
 Thank you in advance!
 --
 Jonas Malmros
 Stockholm University
 Stockholm, Sweden

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.