[R] Dynamically defining dplyr across statements (was dplyr: summarise across using variable names and a condition)
Hello All, Thanks Rui for your response to my question. I agree that it is possible to use a workaround. Get most of what you want and then tidy it up afterwards. I too have a workaround that I have pasted below. I wanted to avoid that initially. I felt I was only using a workaround because I hadn't yet figured out how to use the dplyr software properly. Determined how to summarise across conditionally during the weekend. That led me to rename my question as learning this changed the nature of the problem. Below are my "have" and "need" data sets from before, for which I've reordered columns. After that, are some vectors of variable names that are already defined in my code and which I thought might be helpful in producing a solution. After that, is dplyr code that summarizes across conditionally. If the data being submitted to this code were always going to be the same, this would work perfectly. That's not the case though. So the across statements that are needed will be data dependent. Last, I've pasted my version of a workaround. This should work for any dataset. Ideally, I'd like to get a solution that builds on the summarise across code below. It seems likely that would involve dynamically creating the various across statements though, and that might wind up being a lot more complicated and verbose than my workaround. Another possibility might be to do this in one pass using non-dplyr code. If neither of those options works out, it may be that the workaround is actually the way to go. Thanks, Paul Have and need data library(magrittr) library(dplyr) have <- structure(list( ptno = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z"), age1 = c(74, 70, 78, 79, 72, 81, 76, 58, 53, 74, 72, 74, 75, 73, 80, 62, 67, 65, 83, 67, 72, 90, 73, 84, 90, 51), age2 = c(71, 67, 72, 74, 65, 79, 70, 49, 45, 68, 70, 71, 74, 71, 69, 58, 65, 59, 80, 60, 68, 87, 71, 82, 80, 49), gender_male = c(1L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L), gender_female = c(0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L), race_white = c(0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), race_black = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), race_other = c(1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), row.names = c(NA, -26L), class = c("tbl_df", "tbl", "data.frame")) have <- have %>% select(ptno, age1, gender_male, gender_female, age2, everything()) need <-structure(list( age1_mean = 72.8076923076923, age1_std = 9.72838827666425, age2_mean = 68.2307692307692, age2_std = 10.2227498934785, gender_male_prop = 0.576923076923077, gender_female_prop = 0.423076923076923, race_white_prop = 0.769230769230769, race_black_prop = 0.0384615384615385, race_other_prop = 0.192307692307692), row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame")) need <- need %>% select(age1_mean, age1_std, gender_male_prop, gender_female_prop, age2_mean, age2_std, everything()) Vectors of variable names vars_num <- c("age1", "age2") vars_dmy <- c("gender", "race") vars_all <- c("age1", "age2","gender", "race") dplyr conditional summarize across have %>% summarize( across(2:2, list(mean = mean, std = sd)), across(3:4, list(prop = mean)), across(5:5, list(mean = mean, std = sd)), across(6:8, list(prop = mean)) ) %>% all.equal(need) Workaround have %>% summarise(across( .cols = !contains("chai_patient_id"), .fns = list(mean = mean, std = sd), .names = "{col}_{fn}" )) %>% select(starts_with(vars_num) | ends_with("mean")) %>% rename_at(vars(!starts_with(vars_num)), list(~ str_replace(., "mean$", "prop"))) %>% all.equal(need) On Friday, March 26, 2021, 1:08:58 p.m. EDT, Rui Barradas wrote: Hello, Here is a way of doing what the question asks for. There might be others, simpler, but this one works. have %>% summarise(across( .cols = !contains("ptno"), .fns = list(mean = mean, s
[R] dplyr: summarise across using variable names and a condition
Hello All, Would like to be able to summarize across in dplyr using variable names and a condition. Below is an example "have" data set followed by an example "need" data set. After that, I've got a vector of numeric variable names. After that, I've got the very humble beginnings of a dplyr-based solution. What I think I need to be able to do is to submit my variable names to dplyr and then to have a conditional function. If the variable is is in my list of names, calculate the mean and the std. If not, then calculate the mean but label it as a proportion. The question is how to do that. It appears that using variable names might involve !!, or possibly enquo, or possibly quo, but I haven't had much success with these. I imagine I might have been very close but not quite have gotten it. The conditional part seems less difficult but I'm not quite sure how to do that either. Help with this would be greatly appreciated. Thanks, Paul have <- structure(list( ptno = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z"), age1 = c(74, 70, 78, 79, 72, 81, 76, 58, 53, 74, 72, 74, 75, 73, 80, 62, 67, 65, 83, 67, 72, 90, 73, 84, 90, 51), age2 = c(71, 67, 72, 74, 65, 79, 70, 49, 45, 68, 70, 71, 74, 71, 69, 58, 65, 59, 80, 60, 68, 87, 71, 82, 80, 49), gender_male = c(1L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L), gender_female = c(0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L), race_white = c(0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), race_black = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), race_other = c(1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), row.names = c(NA, -26L), class = c("tbl_df", "tbl", "data.frame")) need <-structure(list( age1_mean = 72.8076923076923, age1_std = 9.72838827666425, age2_mean = 68.2307692307692, age2_std = 10.2227498934785, gender_male_prop = 0.576923076923077, gender_female_prop = 0.423076923076923, race_white_prop = 0.769230769230769, race_black_prop = 0.0384615384615385, race_other_prop = 0.192307692307692), row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame")) vars_num <- c("age1", "age2") library(magrittr) library(dplyr) have %>% summarise(across( .cols = !contains("ptno"), .fns = list(mean = mean, std = sd), .names = "{col}_{fn}" )) __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Running list of drugs taken and dropped (via Reduce and accumulate = TRUE or by other means)
Hello All, Would like to keep a running total of what drugs cancer patients have taken and what drugs have been dropped. Searched the Internet and found a way to cumulatively paste a series of drug names. Am having trouble figuring out how to make the paste conditional though. Below is some sample data and code. I'd like to get the paste in the "taken" column to add a drug only when change = 1. I'd also like to get the paste in the "dropped" column to add a drug only when change = -1. Thanks, Paul sample_data <- structure( list( PTNO = c(82320L, 82320L, 82320L), change = c(1, 1, -1), drug = c("cetuximab", "docetaxel", "cetuximab")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -3L) ) %>% mutate( taken = Reduce(function(x1, x2) paste(x1, x2, sep = ", "), drug, accumulate = TRUE), dropped = Reduce(function(x1, x2) paste(x1, x2, sep = ", "), drug, accumulate = TRUE) ) __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R code for if-then-do code blocks
Hi Gabor, Richard, and Thierry, Thanks very much for your replies. Turns out I had already hit on Gabor's idea of "factor out" in writing an initial draft of the code converting from SAS to R. Below is the link Gabor sent describing this and other approaches. https://stackoverflow.com/questions/34096162/dplyr-mutate-replace-on-a-subset-of-rows/34096575#34096575 At the end of this email are some new test data plus a snippet of my initial R code. The R code I have replicates the result from SAS but is quite verbose. That should be obvious from the snippet. I know I can make the code less verbose with a subsequent draft but wonder if I can simplify to the point where the factor out approach gets a fair test. I'd appreciate it if people could share some ways to make the factor out approach less verbose. I'd also like to see how well some of the other approaches might work with these data. I spent considerable time looking at the link Gabor sent as well as the other responses I received. The mutate_cond function in the link seems promising but it wasn't clear to me how I could avoid having to repeat the various conditions using that approach. Thanks again. Paul library(magrittr) library(dplyr) test_data <- structure( list( intPatientId = c("3", "37", "48", "6", "6", "5"), intSurveySessionId = c(1L, 10996L, 19264L, 2841L, 28L, 34897L), a_CCMA02 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_), a_CCMA69 = c(7, NA, 0, 2, NA, 0), a_CCMA70 = c(7, 0, NA, 10, NA, NA), a_CCMA72 = c(7, 2, 3, NA, NA, NA), CCMA2 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,NA_integer_, NA_integer_), a_CCMA05 = c(NA, NA, NA, NA, NA, 0), a_CCMA43 = c(5, 0, 6, 5, NA, NA), a_CCMA44 = c(5, 0, 0, 5, 0, NA), CCMA5 = c(NA, NA, NA, NA, NA, 0) ), class = "data.frame", row.names = c(NA,-6L) ) factor_out <- test_data %>% mutate( CCMA2_cond = case_when( (is.na(a_CCMA02) | a_CCMA02 < 0 | a_CCMA02 > 10) & (!is.na(a_CCMA69) & between(a_CCMA69, 0, 10) & !is.na(a_CCMA70) & between(a_CCMA70, 0, 10) & !is.na(a_CCMA72) & between(a_CCMA72, 0, 10)) ~ "A", (is.na(a_CCMA02) | a_CCMA02 < 0 | a_CCMA02 > 10) & (is.na(a_CCMA69) | a_CCMA69 < 0 | a_CCMA69 >= 10) & !is.na(a_CCMA70) & between(a_CCMA70, 0, 10) & !is.na(a_CCMA72) & between(a_CCMA72, 0, 10) ~ "B", (is.na(a_CCMA02) | a_CCMA02 < 0 | a_CCMA02 > 10) & (is.na(a_CCMA70) | a_CCMA70 < 0 | a_CCMA70 >= 10) & between(a_CCMA69, 0, 10) & between(a_CCMA72, 0, 10) ~ "C", (is.na(a_CCMA02) | a_CCMA02 < 0 | a_CCMA02 > 10) & (is.na(a_CCMA72) | a_CCMA72 < 0 | a_CCMA72 >= 10) & between(a_CCMA69, 0, 10) & between(a_CCMA70, 0, 10) ~ "D") ) %>% mutate( CCMA2 = case_when( CCMA2_cond == "A" & 0.614 + (0.065 * a_CCMA69) + (-0.012 * a_CCMA70) + (0.504 * a_CCMA72) < 0 ~ 0, CCMA2_cond == "A" & 0.614 + (0.065 * a_CCMA69) + (-0.012 * a_CCMA70) + (0.504 * a_CCMA72) > 10 ~ 10, CCMA2_cond == "A" ~ 0.614 + (0.065 * a_CCMA69) + (-0.012 * a_CCMA70) + (0.504 * a_CCMA72), TRUE ~ as.double(CCMA2) ), CCMA2 = case_when( CCMA2_cond == "B" & 0.614 + (0.065 * (a_CCMA70 + a_CCMA72) / 2) + (-0.012 * a_CCMA70) + (0.504 * a_CCMA72) < 0 ~ 0, CCMA2_cond == "B" & 0.614 + (0.065 * (a_CCMA70 + a_CCMA72) / 2) + (-0.012 * a_CCMA70) + (0.504 * a_CCMA72) > 10 ~ 10, CCMA2_cond == "B" ~ 0.614 + (0.065 * (a_CCMA70 + a_CCMA72) / 2) + (-0.012 * a_CCMA70) + (0.504 * a_CCMA72), TRUE ~ as.double(CCMA2) ), CCMA2 = case_when( CCMA2_cond == "C" & 0.614 + (0.065 * a_CCMA69) + (-0.012 *(a_CCMA72 + a_CCMA69) / 2 ) + (0.504 * a_CCMA72) < 0 ~ 0, CCMA2_cond == "C" & 0.614 + (0.065 * a_CCMA69) + (-0.012 *(a_CCMA72 + a_CCMA69) / 2 ) + (0.504 * a_CCMA72) > 10 ~ 10, CCMA2_cond == "C" ~ 0.614 + (0.065 * a_CCMA69) + (-0.012 *(a_CCMA72 + a_CCMA69) / 2 ) + (0.504 * a_CCMA72), TRUE ~ as.double(CCMA2) ), CCMA2 = case_when( CCMA2_cond == "D" & 0.614 + (0.065 * a_CCMA69) + (-0.012 * a_CCMA70 ) + (0.504 *(a_CCMA70 + a_CCMA69) / 2) < 0 ~ 0, CCMA2_cond == "D" & 0.614 + (0.065 * a_CCMA69) + (-0.012 * a_CCMA70 ) + (0.504 *(a_CCMA70 + a_CCMA69) / 2) > 10 ~ 10, CCMA2_cond == "D" ~ 0.614 + (0.065 * a_CCMA69) + (-0.012 * a_CCMA70 ) + (0.504 *(a_CCMA70 + a_CCMA69) / 2), TRUE ~ as.double(CCMA2) ) ) %>% select(-CCMA2_cond) %>% mutate( CCMA5_condA = if_else( (is.na(a_CCMA05) | a_CCMA05 < 0 | a_CCMA05 > 10), 1, 0 ), CCMA5 = ifelse(CCMA5_condA == 1 & between(a_CCMA43, 0, 10) & between(a_CCMA44, 0, 10), 0.216 + (0.257 * a_CCMA43) + (0.828 * a_CCMA44), CCMA5), CCMA5 = ifelse(CCMA5_condA == 1 & between(a_CCMA43, 0, 10) & (is.na(a_CCMA44) | a_CCMA44 < 0 | a_CCMA44 > 10), 0.216 + (0.257 * a_CCMA43) + (0.828 *
[R] R code for if-then-do code blocks
Hello All, Season's greetings! Am trying to replicate some SAS code in R. The SAS code uses if-then-do code blocks. I've been trying to do likewise in R as that seems to be the most reliable way to get the same result. Below is some toy data and some code that does work. There are some things I don't necessarily like about the code though. So I was hoping some people could help make it better. One thing I don't like is that the within function reverses the order of the computed columns such that test1:test5 becomes test5:test1. I've used a mutate to overcome that but would prefer not to have to do so. Another, perhaps very small thing, is the need to calculate an ID variable that becomes the basis for a grouping. I did considerable Internet searching for R code that conditionally computes blocks of code. I didn't find much though and so am wondering if my search terms were not sufficient or if there is some other reason. It occurred to me that maybe if-then-do code blocks like we often see in SAS as are frowned upon and therefore not much implemented. I'd be interested in seeing more R-compatible approaches if this is the case. I've learned that it's a mistake to try and make R be like SAS. It's better to let R be R. Trouble is I'm not always sure how to do that. Thanks, Paul d1 <- data.frame(workshop=rep(1:2,4), gender=rep(c("f","m"),each=4)) library(tibble) library(plyr) d2 <- d1 %>% rownames_to_column("ID") %>% mutate(test1 = NA, test2 = NA, test4 = NA, test5 = NA) %>% ddply("ID", within, if (gender == "f" & workshop == 1) { test1 <- 1 test1 <- 6 + test1 test2 <- 2 + test1 test4 <- 1 test5 <- 1 } else { test1 <- test2 <- test4 <- test5 <- 0 }) __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Extracting sentences with combinations of target words/terms from cancer patient text medical records
Hi Robert, Thank you for your reply. An attempt to solve this via a regular expression query is particularly helpful. Unfortunately, I don't have much time to play around with this just now. Ultimately though, I think I would like to implement a solution something along the lines of what you have done. I have a book on regular expressions that I am now starting to read. In the meantime, the code I'm using is a good way to assess the feasibility of some ideas I'd like to implement. The advantage of your approach I think is that it makes fewer passes through the data. That should make it a lot faster and more efficient than what I've done. I'm currently working with a little more than 2.5 million text records and I think that number will only rise. So efficiency really should matter. I've pasted the latest version of my sample code below. This shows how I'd like to add the result of the text search as a column in a data frame. It also shows how I'd like to append the sentence number to each identified sentence. The single colon that appears where there is no match is not by design. It's something that I need to tidy. My sense is that if I used your regular expression as written, I'd lose the information about the sentence number when I added the result as a column in my data frame. Presumably, I'd need to collapse the information into a single text string, and then the numbering would be lost. If you were going to get the sentence numbers as well, without making several passes through the data like my code does, how would you go about it? Thanks, Paul library(tidyverse) library(stringr) library(lubridate) sentence_match <- function(x){ sentence_extract <- str_extract_all(x, boundary("sentence"), simplify = TRUE) sentence_number <- intersect(str_which(sentence_extract, "breast"), str_which(sentence_extract, "metastatic|stage IV")) sentence_match <- str_c(sentence_number, ": ", sentence_extract[sentence_number], collapse = "") sentence_match } sampletxt <- structure( list( PTNO = c(1, 2, 2, 2), DATE = structure(c(16436, 16436, 16832, 16845), class = "Date"), TYPE = c("Progress note", "CAT scan", "Progress note", "Progress note"), TVAR = c( "This sentence contains the word metastatic. This sentence contains the term stage IV.", "This sentence contains no target words. This sentence also contains no target words.", "This sentence contains the word metastatic and the word breast. This sentence contains no target words.", "This sentence contains the words breast and the term metastatic. This sentence contains the word breast and the term stage IV." ) ), .Names = c("PTNO", "DATE", "TYPE", "TVAR"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,-4L) ) sampletxt$EXTRACTED <- sapply(sampletxt$TVAR, sentence_match) sampletxt$EXTRACTED > sampletxt$EXTRACTED [1] ": " [2] ": " [3] "1: This sentence contains the word metastatic and the word breast. " [4] "1: This sentence contains the words breast and the term metastatic. 2: This sentence contains the word breast and the term stage IV." From: Robert McGehee <rmcge...@walleyetrading.net> Cc: "r-help@r-project.org" <r-help@r-project.org> Sent: Wednesday, July 12, 2017 12:47 PM Subject: RE: [R] Extracting sentences with combinations of target words/terms from cancer patient text medical records Hi Paul, Sounds like you have your answer, but for fun I thought I'd try solving your problem using only a regular expression query and base R. I believe this works: > txt <- "Patient had stage IV breast cancer. Nothing matches this sentence. > Metastatic and breast match this sentence. French bike champion takes stage > IV victory in Tour de France." > pattern <- "([^.?!]*(?=[^.?!]*\\bbreast\\b)(?=[^.?!]*\\b(metastatic|stage > IV)\\b)(?=[\\s.?!])[^.?!]*[.?!])" > regmatches(txt, gregexpr(pattern, txt, perl=TRUE, ignore.case=TRUE))[[1]] [1] "Patient had stage IV breast cancer." [2] " Metastatic and breast match this sentence." Cheers, Robert -Original Message- From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Paul Miller via R-help Sent: Wednesday, July 12, 2017 8:49 AM T
Re: [R] Extracting sentences with combinations of target words/terms from cancer patient text medical records
Hi Bert, Thanks for your reply. It appears that I didn't replace the variable name "sampletxt" with the argument "x" in my function. I've corrected that and now my code seems to be working fine. Paul From: Bert Gunter <bgunter.4...@gmail.com> Cc: R-help <r-help@r-project.org> Sent: Tuesday, July 11, 2017 2:00 PM Subject: Re: [R] Extracting sentences with combinations of target words/terms from cancer patient text medical records Have you looked at the CRAN Natural Language Processing Task View? If not, why not? If so, why were the resources described there inadequate? Bert On Jul 11, 2017 10:49 AM, "Paul Miller via R-help" <r-help@r-project.org> wrote: Hello All, > >I need some help figuring out how to extract combinations of target >words/terms from cancer patient text medical records. I've provided some >sample data and code below to illustrate what I'm trying to do. At the moment, >I'm trying to extract sentences that contain the word "breast" plus either >"metastatic" or "stage IV". > >It's been some time since I used R and I feel a bit rusty. I wrote a function >called "sentence_match" that seemed to work well when applied to a single >piece of text. You can see that by running the section titled > >"Working code". I thought that it might be possible easily to apply my >function to a data set (tibble or df) but that doesn't seem to be the case. My >unsuccessful attempt to do this appears in the section titled "Non-working >code". > >If someone could help me get my code up and running, that would be greatly >appreciated. I'm using a lot of functions from Hadley Wickham's packages, but >that's not particularly necessary. Although I have only a few entries in my >sample data, my actual data are pretty large. Currently, I'm working with over >a million records. Some records contain only a single sentence, but many have >several paragraphs. One concern I had was that, even if I could get my code >working, it would be too inefficient to handle that volume of data. > >Thanks, > >Paul > > >library(tidyverse) >library(stringr) >library(lubridate) > >sentence_match <- function(x){ > sentence_extract <- str_extract_all(sampletxt, boundary("sentence"), > simplify = TRUE) > sentence_number <- intersect(str_which(sentence_ extract, "breast"), > str_which(sentence_extract, "metastatic|stage IV")) > sentence_match <- str_c(sentence_number, ": ", sentence_extract[sentence_ > number], collapse = "") > sentence_match >} > > Working code > >sampletxt <- "This sentence contains the word metastatic and the word breast. >This sentence contains no target words." > >sentence_match(sampletxt) > > Non-working code > >sampletxt <- > structure( >list( > PTNO = c(1, 2, 2, 2), > DATE = structure(c(16436, 16436, 16832, 16845), class = "Date"), > TYPE = c("Progress note", "CAT scan", "Progress note", "Progress note"), > TVAR = c( >"This sentence contains the word metastatic. This sentence contains > the term stage IV.", >"This sentence contains no target words. This sentence also contains > no target words.", >"This sentence contains the word metastatic and the word breast. This > sentence contains no target words.", >"This sentence contains the words breast and the term metastatic. This >sentence contains the word breast and the term stage IV." > ) >), >.Names = c("PTNO", "DATE", "TYPE", "TVAR"), >class = c("tbl_df", > "tbl", "data.frame"), >row.names = c(NA,-4L) > ) > >sampletxt2 <- group_by_at(sampletxt, vars(PTNO, DATE, TYPE)) >sampletxt2 <- > sampletxt2 %>% > mutate( >EXTRACTED = sentence_match(TVAR) > ) > >__ >R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/ listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/ posting-guide.html >and provide commented, minimal, self-contained, reproducible code. > __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Extracting sentences with combinations of target words/terms from cancer patient text medical records
Hello All, I need some help figuring out how to extract combinations of target words/terms from cancer patient text medical records. I've provided some sample data and code below to illustrate what I'm trying to do. At the moment, I'm trying to extract sentences that contain the word "breast" plus either "metastatic" or "stage IV". It's been some time since I used R and I feel a bit rusty. I wrote a function called "sentence_match" that seemed to work well when applied to a single piece of text. You can see that by running the section titled "Working code". I thought that it might be possible easily to apply my function to a data set (tibble or df) but that doesn't seem to be the case. My unsuccessful attempt to do this appears in the section titled "Non-working code". If someone could help me get my code up and running, that would be greatly appreciated. I'm using a lot of functions from Hadley Wickham's packages, but that's not particularly necessary. Although I have only a few entries in my sample data, my actual data are pretty large. Currently, I'm working with over a million records. Some records contain only a single sentence, but many have several paragraphs. One concern I had was that, even if I could get my code working, it would be too inefficient to handle that volume of data. Thanks, Paul library(tidyverse) library(stringr) library(lubridate) sentence_match <- function(x){ sentence_extract <- str_extract_all(sampletxt, boundary("sentence"), simplify = TRUE) sentence_number <- intersect(str_which(sentence_extract, "breast"), str_which(sentence_extract, "metastatic|stage IV")) sentence_match <- str_c(sentence_number, ": ", sentence_extract[sentence_number], collapse = "") sentence_match } Working code sampletxt <- "This sentence contains the word metastatic and the word breast. This sentence contains no target words." sentence_match(sampletxt) Non-working code sampletxt <- structure( list( PTNO = c(1, 2, 2, 2), DATE = structure(c(16436, 16436, 16832, 16845), class = "Date"), TYPE = c("Progress note", "CAT scan", "Progress note", "Progress note"), TVAR = c( "This sentence contains the word metastatic. This sentence contains the term stage IV.", "This sentence contains no target words. This sentence also contains no target words.", "This sentence contains the word metastatic and the word breast. This sentence contains no target words.", "This sentence contains the words breast and the term metastatic. This sentence contains the word breast and the term stage IV." ) ), .Names = c("PTNO", "DATE", "TYPE", "TVAR"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,-4L) ) sampletxt2 <- group_by_at(sampletxt, vars(PTNO, DATE, TYPE)) sampletxt2 <- sampletxt2 %>% mutate( EXTRACTED = sentence_match(TVAR) ) __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.