Hi Alicia, If I understand this, perhaps: df<-data.frame(VALUE=c("<60","Positive","Negative","Less than 0.30", "12%","<0.2","Unknown"), DESCRIPTION = c("A","A", "B","C","D","E","E"), NUMERIC_VALUE=c(9,9,9,9,9,9,9))
df$NUMERIC_VALUE[df$VALUE == "Positive" & df$DESCRIPTION == "A"]<- 999999999 rules<-data.frame(VALUE = c("<60","Negative","Less than 0.30", "<0.2","Unknown"), DESCRIPTION = c("A", "B", "C", "E", "E"), NEW_VALUE=c(60, -999999,0.29,0.1,777777)) for(row in 1:dim(df)[1]) { if(df$VALUE %in% rules$VALUE && df$DESCRIPTION %in% rules$DESCRIPTION) { which_value<-which(as.character(df$VALUE[row]) == as.character(rules$VALUE)) which_desc<-which(as.character(df$DESCRIPTION[row]) == as.character(rules$DESCRIPTION)) cat(which_value,which_desc,"\n") if(length(which_value) && length(which_desc) && which_value == which_desc) df$NUMERIC_VALUE[row]<-rules$NEW_VALUE[which_value] } } Jim On Tue, Mar 21, 2017 at 9:51 AM, Alicia Ellis <alicia.m.el...@gmail.com> wrote: > The solution proposed below does not accomplish my goal. In the column > labeled NEW_VALUE, there are two NAs where the value in NUMERIC_VALUE for > that row should be. > > Can anyone else suggest a solution to my problem? > > Thanks! > > On Sat, Mar 18, 2017 at 1:33 PM, David Winsemius <dwinsem...@comcast.net> > wrote: > >> >> > On Mar 18, 2017, at 9:52 AM, David Winsemius <dwinsem...@comcast.net> >> wrote: >> > >> > >> >> On Mar 17, 2017, at 11:33 AM, Alicia Ellis <alicia.m.el...@gmail.com> >> wrote: >> >> >> >> am cleaning some very messy health record lab data. Several of the rows >> >> in the VALUE column have text entries and they need to be converted to >> >> numeric in the NUMERIC_VALUE column based on the values in VALUE and >> >> DESCRIPTION. For example: >> >> >> >> df <- data.frame(VALUE = c("<60", "Positive", "Negative", "Less than >> 0.30", >> >> "12%", "<0.2", "Unknown"), >> >> DESCRIPTION = c("A","A", "B", "C", "D", "E", "E"), >> >> NUMERIC_VALUE=c(9, 9,9,9,9,9,9)) >> >> df >> >> >> >> df$NUMERIC_VALUE[df$VALUE == "Positive" & df$DESCRIPTION == >> "A"]=999999999 >> >> >> >> >> >> However, I need to do this for ~500 unique pairings of VALUE and >> >> DESCRIPTION entries. I'm trying to find an easy way to do this without >> >> having to have 500 lines of code for each unique pairing. Some of the >> >> pairings will be changed to the same value (e.g., 99999999, or >> -999999999) >> >> but many will be unique numeric values. >> > >> > I'm not convinced that you have the necessary scientific background to >> do this job properly. There are different sorts of lab tests: enzymatic >> activity, solute concentrations, and viral titers come to mind immediately >> for which the handling would be materially different. If you have a serum >> sodium level of less than 100 mEq/ml, then that is a value inconsistent >> with human life and the value should be set to NA. If you have a value of >> alkaline phosphatase that is 0 it is most suggestive of specimen >> mishandling. If you have a hepatitis B antigen level of "Positive", then it >> seems a perfectly informative value that should not be changed. >> > >> > Furthermore the attempt to change vlaues that you think are missing to >> 99999999, or -999999999 is simply wrongheaded. Learn to use the missing >> value indicator NA rather thna setting these to a numeric value. >> > >> > You should seek advice within your organization before you charge ahead >> with this strategy. >> >> I've been informed that my concerns are misplaced and that the scientific >> concerns I raised were inflated. The "answer" then might be: >> >> merge( df, rules, by=1:2, all.x=TRUE) # Or >> >> merge( df, rules, by=c("VALUE", "DESCRIPTION") , all.x=TRUE) >> >> >> Which returns in this example: >> >> >> VALUE DESCRIPTION NUMERIC_VALUE NEW_VALUE >> 1 <0.2 E 9e+00 0.10 >> 2 <60 A 9e+00 60.00 >> 3 12% D 9e+00 NA >> 4 Less than 0.30 C 9e+00 0.29 >> 5 Negative B 9e+00 -999999.00 >> 6 Positive A 1e+09 NA >> 7 Unknown E 9e+00 777777.00 >> >> I'm leaving it in this form because I remain worried. I have no particular >> concerns about setting a value of "<0.2" to 0.1 but worry about setting >> values of "<60" to 60.00 or "Less than 0.30" to 0.29 don't really accord >> with practice that I have seen in analysis of laboratory values. And I >> remain concerned that using flagrantly false numeric values as indicators >> will create serious errors down the line if this data is ever used by >> someone who has not been involved with its manipulation or has even >> forgotten months later how it was massaged. >> >> -- >> David. >> > -- >> > David. >> >> >> >> >> >> I've started by creating a new object called rules where a SUBSET of df >> >> rows are included with the new value they should be changed to. >> >> >> >> >> >> rules <- data.frame(VALUE = c("<60", "Negative", "Less than 0.30", >> "<0.2", >> >> "Unknown"), >> >> DESCRIPTION = c("A", "B", "C", "E", "E"), >> >> NEW_VALUE=c(60, -999999,0.29,0.1,777777)) >> >> rules >> >> >> >> >> >> I tried doing a loop to change the values in df based on the suggested >> >> value in rules: >> >> >> >> for (i in (1 : nrow(rules))) { >> >> df$NUMERIC_VALUE[df$VALUE == rules[i,1] & df$DESCRIPTION == rules >> >> [i,2]]=rules[i,3] >> >> } >> >> df >> >> >> >> This gives the error: >> >> >> >> Error in Ops.factor(df$VALUE, rules[i, 1]) : >> >> level sets of factors are differentwork and I think because when I write >> >> >> >> If I create rules using the exact same levels as df it works: >> >> >> >> rules <- data.frame(VALUE = c("<60", "Positive", "Negative", "Less than >> >> 0.30", "12%", "<0.2", "Unknown"), >> >> DESCRIPTION = c("A","A", "B", "C", "D", "E", "E"), >> >> NEW_VALUE=c(60, 999999,-999999,0.29,12,0.1,777777)) >> >> rules >> >> >> >> for (i in (1 : nrow(rules))) { >> >> df$NUMERIC_VALUE[df$VALUE == rules[i,1] & df$DESCRIPTION == rules >> >> [i,2]]=rules[i,3] >> >> } >> >> df >> >> >> >> >> >> Can anyone suggest a way to modify my for loop so that it works for a >> >> subset of rows in df and accomplish what I want? Or suggest a >> completely >> >> different method that works? >> >> >> >> >> >> Thanks! >> >> >> >> [[alternative HTML version deleted]] >> >> >> >> ______________________________________________ >> >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> >> https://stat.ethz.ch/mailman/listinfo/r-help >> >> PLEASE do read the posting guide http://www.R-project.org/ >> posting-guide.html >> >> and provide commented, minimal, self-contained, reproducible code. >> > >> > David Winsemius >> > Alameda, CA, USA >> > >> > ______________________________________________ >> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide http://www.R-project.org/ >> posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. >> >> David Winsemius >> Alameda, CA, USA >> >> > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.