The solution proposed below does not accomplish my goal. In the column labeled NEW_VALUE, there are two NAs where the value in NUMERIC_VALUE for that row should be.
Can anyone else suggest a solution to my problem? Thanks! On Sat, Mar 18, 2017 at 1:33 PM, David Winsemius <dwinsem...@comcast.net> wrote: > > > On Mar 18, 2017, at 9:52 AM, David Winsemius <dwinsem...@comcast.net> > wrote: > > > > > >> On Mar 17, 2017, at 11:33 AM, Alicia Ellis <alicia.m.el...@gmail.com> > wrote: > >> > >> am cleaning some very messy health record lab data. Several of the rows > >> in the VALUE column have text entries and they need to be converted to > >> numeric in the NUMERIC_VALUE column based on the values in VALUE and > >> DESCRIPTION. For example: > >> > >> df <- data.frame(VALUE = c("<60", "Positive", "Negative", "Less than > 0.30", > >> "12%", "<0.2", "Unknown"), > >> DESCRIPTION = c("A","A", "B", "C", "D", "E", "E"), > >> NUMERIC_VALUE=c(9, 9,9,9,9,9,9)) > >> df > >> > >> df$NUMERIC_VALUE[df$VALUE == "Positive" & df$DESCRIPTION == > "A"]=999999999 > >> > >> > >> However, I need to do this for ~500 unique pairings of VALUE and > >> DESCRIPTION entries. I'm trying to find an easy way to do this without > >> having to have 500 lines of code for each unique pairing. Some of the > >> pairings will be changed to the same value (e.g., 99999999, or > -999999999) > >> but many will be unique numeric values. > > > > I'm not convinced that you have the necessary scientific background to > do this job properly. There are different sorts of lab tests: enzymatic > activity, solute concentrations, and viral titers come to mind immediately > for which the handling would be materially different. If you have a serum > sodium level of less than 100 mEq/ml, then that is a value inconsistent > with human life and the value should be set to NA. If you have a value of > alkaline phosphatase that is 0 it is most suggestive of specimen > mishandling. If you have a hepatitis B antigen level of "Positive", then it > seems a perfectly informative value that should not be changed. > > > > Furthermore the attempt to change vlaues that you think are missing to > 99999999, or -999999999 is simply wrongheaded. Learn to use the missing > value indicator NA rather thna setting these to a numeric value. > > > > You should seek advice within your organization before you charge ahead > with this strategy. > > I've been informed that my concerns are misplaced and that the scientific > concerns I raised were inflated. The "answer" then might be: > > merge( df, rules, by=1:2, all.x=TRUE) # Or > > merge( df, rules, by=c("VALUE", "DESCRIPTION") , all.x=TRUE) > > > Which returns in this example: > > > VALUE DESCRIPTION NUMERIC_VALUE NEW_VALUE > 1 <0.2 E 9e+00 0.10 > 2 <60 A 9e+00 60.00 > 3 12% D 9e+00 NA > 4 Less than 0.30 C 9e+00 0.29 > 5 Negative B 9e+00 -999999.00 > 6 Positive A 1e+09 NA > 7 Unknown E 9e+00 777777.00 > > I'm leaving it in this form because I remain worried. I have no particular > concerns about setting a value of "<0.2" to 0.1 but worry about setting > values of "<60" to 60.00 or "Less than 0.30" to 0.29 don't really accord > with practice that I have seen in analysis of laboratory values. And I > remain concerned that using flagrantly false numeric values as indicators > will create serious errors down the line if this data is ever used by > someone who has not been involved with its manipulation or has even > forgotten months later how it was massaged. > > -- > David. > > -- > > David. > >> > >> > >> I've started by creating a new object called rules where a SUBSET of df > >> rows are included with the new value they should be changed to. > >> > >> > >> rules <- data.frame(VALUE = c("<60", "Negative", "Less than 0.30", > "<0.2", > >> "Unknown"), > >> DESCRIPTION = c("A", "B", "C", "E", "E"), > >> NEW_VALUE=c(60, -999999,0.29,0.1,777777)) > >> rules > >> > >> > >> I tried doing a loop to change the values in df based on the suggested > >> value in rules: > >> > >> for (i in (1 : nrow(rules))) { > >> df$NUMERIC_VALUE[df$VALUE == rules[i,1] & df$DESCRIPTION == rules > >> [i,2]]=rules[i,3] > >> } > >> df > >> > >> This gives the error: > >> > >> Error in Ops.factor(df$VALUE, rules[i, 1]) : > >> level sets of factors are differentwork and I think because when I write > >> > >> If I create rules using the exact same levels as df it works: > >> > >> rules <- data.frame(VALUE = c("<60", "Positive", "Negative", "Less than > >> 0.30", "12%", "<0.2", "Unknown"), > >> DESCRIPTION = c("A","A", "B", "C", "D", "E", "E"), > >> NEW_VALUE=c(60, 999999,-999999,0.29,12,0.1,777777)) > >> rules > >> > >> for (i in (1 : nrow(rules))) { > >> df$NUMERIC_VALUE[df$VALUE == rules[i,1] & df$DESCRIPTION == rules > >> [i,2]]=rules[i,3] > >> } > >> df > >> > >> > >> Can anyone suggest a way to modify my for loop so that it works for a > >> subset of rows in df and accomplish what I want? Or suggest a > completely > >> different method that works? > >> > >> > >> Thanks! > >> > >> [[alternative HTML version deleted]] > >> > >> ______________________________________________ > >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code. > > > > David Winsemius > > Alameda, CA, USA > > > > ______________________________________________ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > David Winsemius > Alameda, CA, USA > > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.