> On Mar 18, 2017, at 9:52 AM, David Winsemius <dwinsem...@comcast.net> wrote:
> 
> 
>> On Mar 17, 2017, at 11:33 AM, Alicia Ellis <alicia.m.el...@gmail.com> wrote:
>> 
>> am cleaning some very messy health record lab data.  Several of the rows
>> in the VALUE column have text entries and they need to be converted to
>> numeric in the NUMERIC_VALUE column based on the values in VALUE and
>> DESCRIPTION.  For example:
>> 
>> df <- data.frame(VALUE = c("<60", "Positive", "Negative", "Less than 0.30",
>> "12%", "<0.2", "Unknown"),
>>                DESCRIPTION = c("A","A", "B", "C", "D", "E", "E"),
>>                NUMERIC_VALUE=c(9, 9,9,9,9,9,9))
>> df
>> 
>> df$NUMERIC_VALUE[df$VALUE == "Positive" & df$DESCRIPTION == "A"]=999999999
>> 
>> 
>> However, I need to do this for ~500 unique pairings of VALUE and
>> DESCRIPTION entries.  I'm trying to find an easy way to do this without
>> having to have 500 lines of code for each unique pairing.  Some of the
>> pairings will be changed to the same value (e.g., 99999999, or -999999999)
>> but many will be unique numeric values.
> 
> I'm not convinced that you have the necessary scientific background to do 
> this job properly. There are different sorts of lab tests: enzymatic 
> activity, solute concentrations, and viral titers come to mind immediately 
> for which the handling would be materially different. If you have a serum 
> sodium level of less than 100 mEq/ml, then that is a value inconsistent with 
> human life and the value should be set to NA. If you have a value of alkaline 
> phosphatase that is 0 it is most suggestive of specimen mishandling. If you 
> have a hepatitis B antigen level of "Positive", then it seems a perfectly 
> informative value that should not be changed.
> 
> Furthermore the attempt to change vlaues that you think are missing to 
> 99999999, or -999999999 is simply wrongheaded. Learn to use the missing value 
> indicator NA rather thna setting these to a numeric value.
> 
> You should seek advice within your organization before you charge ahead with 
> this strategy.

I've been informed that my concerns are misplaced and that the scientific 
concerns I raised were inflated. The "answer" then might be:

merge( df, rules, by=1:2, all.x=TRUE)  # Or

merge( df, rules, by=c("VALUE", "DESCRIPTION") , all.x=TRUE)


Which returns in this example:


           VALUE DESCRIPTION NUMERIC_VALUE  NEW_VALUE
1           <0.2           E         9e+00       0.10
2            <60           A         9e+00      60.00
3            12%           D         9e+00         NA
4 Less than 0.30           C         9e+00       0.29
5       Negative           B         9e+00 -999999.00
6       Positive           A         1e+09         NA
7        Unknown           E         9e+00  777777.00

I'm leaving it in this form because I remain worried. I have no particular 
concerns about setting a value of "<0.2" to 0.1 but worry about setting values 
of  "<60" to 60.00 or "Less than 0.30" to 0.29 don't really accord with 
practice that I have seen in analysis of laboratory values. And I remain 
concerned that using flagrantly false numeric values as indicators will create 
serious errors down the line if this data is ever used by someone who has not 
been involved with its manipulation or has even forgotten months later how it 
was massaged. 

-- 
David.
> -- 
> David.
>> 
>> 
>> I've started by creating a new object called rules where a SUBSET of df
>> rows are included with the new value they should be changed to.
>> 
>> 
>> rules <- data.frame(VALUE = c("<60",  "Negative", "Less than 0.30", "<0.2",
>> "Unknown"),
>>                   DESCRIPTION = c("A", "B", "C", "E", "E"),
>>                   NEW_VALUE=c(60, -999999,0.29,0.1,777777))
>> rules
>> 
>> 
>> I tried doing a loop to change the values in df based on the suggested
>> value in rules:
>> 
>> for (i in (1 : nrow(rules))) {
>> df$NUMERIC_VALUE[df$VALUE == rules[i,1] & df$DESCRIPTION == rules
>> [i,2]]=rules[i,3]
>> }
>> df
>> 
>> This gives the error:
>> 
>> Error in Ops.factor(df$VALUE, rules[i, 1]) :
>> level sets of factors are differentwork and I think because when I write
>> 
>> If I create rules using the exact same levels as df it works:
>> 
>> rules <- data.frame(VALUE = c("<60", "Positive", "Negative", "Less than
>> 0.30", "12%", "<0.2", "Unknown"),
>>                   DESCRIPTION = c("A","A", "B", "C", "D", "E", "E"),
>>                   NEW_VALUE=c(60, 999999,-999999,0.29,12,0.1,777777))
>> rules
>> 
>> for (i in (1 : nrow(rules))) {
>> df$NUMERIC_VALUE[df$VALUE == rules[i,1] & df$DESCRIPTION == rules
>> [i,2]]=rules[i,3]
>> }
>> df
>> 
>> 
>> Can anyone suggest a way to modify my for loop so that it works for a
>> subset of rows in df and accomplish what I want?  Or suggest a completely
>> different method that works?
>> 
>> 
>> Thanks!
>> 
>>      [[alternative HTML version deleted]]
>> 
>> ______________________________________________
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> David Winsemius
> Alameda, CA, USA
> 
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to