[R] Dynamically defining dplyr across statements (was dplyr: summarise across using variable names and a condition)

2021-03-29 Thread Paul Miller via R-help
Hello All,

Thanks Rui for your response to my question. I agree that it is possible to use 
a workaround. Get most of what you want and then tidy it up afterwards. I too 
have a workaround that I have pasted below. I wanted to avoid that initially. I 
felt I was only using a workaround because I hadn't yet figured out how to use 
the dplyr software properly.

Determined how to summarise across conditionally during the weekend. That led 
me to rename my question as learning this changed the nature of the problem.

Below are my "have" and "need" data sets from before, for which I've reordered 
columns. After that, are some vectors of variable names that are already 
defined in my code and which I thought might be helpful in producing a 
solution. After that, is dplyr code that summarizes across conditionally. If 
the data being submitted to this code were always going to be the same, this 
would work perfectly. That's not the case though. So the across statements that 
are needed will be data dependent. Last, I've pasted my version of a 
workaround. This should work for any dataset.

Ideally, I'd like to get a solution that builds on the summarise across code 
below. It seems likely that would involve dynamically creating the various 
across statements though, and that might wind up being a lot more complicated 
and verbose than my workaround. Another possibility might be to do this in one 
pass using non-dplyr code. If neither of those options works out, it may be 
that the workaround is actually the way to go.

Thanks,

Paul

 Have and need data 

library(magrittr)
library(dplyr) 

have <- structure(list(
  ptno = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M",
   "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z"),
  age1 = c(74, 70, 78, 79, 72, 81, 76, 58, 53, 74, 72, 74, 75,
   73, 80, 62, 67, 65, 83, 67, 72, 90, 73, 84, 90, 51),
  age2 = c(71, 67, 72, 74, 65, 79, 70, 49, 45, 68, 70, 71, 74,
   71, 69, 58, 65, 59, 80, 60, 68, 87, 71, 82, 80, 49),
  gender_male = c(1L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 0L,
  1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L),
  gender_female = c(0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L,
    0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L),
  race_white = c(0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L,
 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
  race_black = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
  race_other = c(1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L,
 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)),
  row.names = c(NA, -26L), class = c("tbl_df", "tbl", "data.frame"))

have <- have %>%
  select(ptno, age1, gender_male, gender_female, age2, everything())

need <-structure(list(
  age1_mean = 72.8076923076923, age1_std = 9.72838827666425,
  age2_mean = 68.2307692307692, age2_std = 10.2227498934785,
  gender_male_prop = 0.576923076923077, gender_female_prop = 0.423076923076923,
  race_white_prop = 0.769230769230769, race_black_prop = 0.0384615384615385,
  race_other_prop = 0.192307692307692),
  row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))

need <- need %>%
  select(age1_mean, age1_std, gender_male_prop, gender_female_prop, age2_mean, 
age2_std, everything())

 Vectors of variable names 

vars_num <-  c("age1", "age2")
vars_dmy <-  c("gender", "race")
vars_all <-  c("age1", "age2","gender", "race")

 dplyr conditional summarize across 

have %>%
  summarize(
 across(2:2, list(mean = mean, std = sd)),
 across(3:4, list(prop = mean)),
 across(5:5, list(mean = mean, std = sd)),
 across(6:8, list(prop = mean))
  ) %>%
  all.equal(need) 

 Workaround 

have %>%
  summarise(across(
 .cols = !contains("chai_patient_id"),
 .fns = list(mean = mean, std = sd),
 .names = "{col}_{fn}"
  )) %>%
  select(starts_with(vars_num) | ends_with("mean")) %>%
  rename_at(vars(!starts_with(vars_num)), list(~ str_replace(., "mean$", 
"prop"))) %>% 
  all.equal(need)


On Friday, March 26, 2021, 1:08:58 p.m. EDT, Rui Barradas 
 wrote: 

Hello,

Here is a way of doing what the question asks for. There might be 
others, simpler, but this one works.

have %>%
  summarise(across(
    .cols = !contains("ptno"),
    .fns = list(mean = mean, s

[R] dplyr: summarise across using variable names and a condition

2021-03-26 Thread Paul Miller via R-help
Hello All,

Would like to be able to summarize across in dplyr using variable names and a 
condition. Below is an example "have" data set followed by an example "need" 
data set. After that, I've got a vector of numeric variable names. After that, 
I've got the very humble beginnings of a dplyr-based solution.

What I think I need to be able to do is to submit my variable names to dplyr 
and then to have a conditional function. If the variable is is in my list of 
names, calculate the mean and the std. If not, then calculate the mean but 
label it as a proportion. The question is how to do that. It appears that using 
variable names might involve !!, or possibly enquo, or possibly quo, but I 
haven't had much success with these. I imagine I might have been very close but 
not quite have gotten it. The conditional part seems less difficult but I'm not 
quite sure how to do that either.

Help with this would be greatly appreciated.

Thanks,

Paul


have <- structure(list(
    ptno = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", 
"M",
 "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", 
"Z"),
    age1 = c(74, 70, 78, 79, 72, 81, 76, 58, 53, 74, 72, 74, 75,
 73, 80, 62, 67, 65, 83, 67, 72, 90, 73, 84, 90, 51),
    age2 = c(71, 67, 72, 74, 65, 79, 70, 49, 45, 68, 70, 71, 74,
 71, 69, 58, 65, 59, 80, 60, 68, 87, 71, 82, 80, 49),
    gender_male = c(1L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 0L,
    1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L),
    gender_female = c(0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L,
  0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L),
    race_white = c(0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L,
   1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
    race_black = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
   0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
    race_other = c(1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L,
   0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)),
    row.names = c(NA, -26L), class = c("tbl_df", "tbl", "data.frame"))
 

need <-structure(list(
   age1_mean = 72.8076923076923, age1_std = 9.72838827666425,
   age2_mean = 68.2307692307692, age2_std = 10.2227498934785,
   gender_male_prop = 0.576923076923077, gender_female_prop = 
0.423076923076923,
   race_white_prop = 0.769230769230769, race_black_prop = 
0.0384615384615385,
   race_other_prop = 0.192307692307692),
   row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))

vars_num <-  c("age1", "age2")

library(magrittr)
library(dplyr)

have %>%
  summarise(across(
  .cols = !contains("ptno"),
  .fns = list(mean = mean, std = sd),
  .names = "{col}_{fn}"
))

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Running list of drugs taken and dropped (via Reduce and accumulate = TRUE or by other means)

2019-01-10 Thread Paul Miller via R-help
Hello All,

Would like to keep a running total of what drugs cancer patients have taken and 
what drugs have been dropped. Searched the Internet and found a way to 
cumulatively paste a series of drug names. Am having trouble figuring out how 
to make the paste conditional though. 

Below is some sample data and code. I'd like to get the paste in the "taken" 
column to add a drug only when change = 1. I'd also like to get the paste in 
the "dropped" column to add a drug only when change = -1. 

Thanks,

Paul


sample_data <-
  structure(
    list(
  PTNO = c(82320L, 82320L, 82320L),
  change = c(1, 1, -1),
  drug = c("cetuximab", "docetaxel", "cetuximab")),
    class = c("tbl_df", "tbl", "data.frame"),
    row.names = c(NA, -3L)
  ) %>%
  mutate(
    taken = Reduce(function(x1, x2) paste(x1, x2, sep = ", "), drug, accumulate 
= TRUE),
    dropped = Reduce(function(x1, x2) paste(x1, x2, sep = ", "), drug, 
accumulate = TRUE)
  )

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R code for if-then-do code blocks

2018-12-19 Thread Paul Miller via R-help
Hi Gabor, Richard, and Thierry, 

Thanks very much for your replies. Turns out I had already hit on Gabor's idea 
of "factor out" in writing an initial draft of the code converting from SAS to 
R. Below is the link Gabor sent describing this and other approaches. 

https://stackoverflow.com/questions/34096162/dplyr-mutate-replace-on-a-subset-of-rows/34096575#34096575

At the end of this email are some new test data plus a snippet of my initial R 
code. The R code I have replicates the result from SAS but is quite verbose. 
That should be obvious from the snippet. I know I can make the code less 
verbose with a subsequent draft but wonder if I can simplify to the point where 
the factor out approach gets a fair test. I'd appreciate it if people could 
share some ways to make the factor out approach less verbose. I'd also like to 
see how well some of the other approaches might work with these data. I spent 
considerable time looking at the link Gabor sent as well as the other responses 
I received. The mutate_cond function in the link seems promising but it wasn't 
clear to me how I could avoid having to repeat the various conditions using 
that approach. 

Thanks again.

Paul

library(magrittr)
library(dplyr)
 
test_data <-
  structure(
    list(
      intPatientId = c("3", "37", "48", "6", "6", "5"),
  intSurveySessionId = c(1L, 10996L, 19264L, 2841L, 28L, 34897L),
  a_CCMA02 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
NA_integer_, NA_integer_),
  a_CCMA69 = c(7, NA, 0, 2, NA, 0),
  a_CCMA70 = c(7, 0, NA, 10, NA, NA),
  a_CCMA72 = c(7, 2, 3, NA, NA, NA),
  CCMA2 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,NA_integer_, 
NA_integer_),
  a_CCMA05 = c(NA, NA, NA, NA, NA, 0),
  a_CCMA43 = c(5, 0, 6, 5, NA, NA),
  a_CCMA44 = c(5, 0, 0, 5, 0, NA),
  CCMA5 = c(NA, NA, NA, NA, NA, 0)
    ),
    class = "data.frame",
    row.names = c(NA,-6L)
  )

factor_out <- test_data %>%
  mutate(
    CCMA2_cond = case_when(
  (is.na(a_CCMA02) | a_CCMA02 < 0 | a_CCMA02 > 10) &
    (!is.na(a_CCMA69) & between(a_CCMA69, 0, 10) &
   !is.na(a_CCMA70) & between(a_CCMA70, 0, 10) &
   !is.na(a_CCMA72) & between(a_CCMA72, 0, 10)) ~ "A",
  (is.na(a_CCMA02) | a_CCMA02 < 0 | a_CCMA02 > 10) &
    (is.na(a_CCMA69) | a_CCMA69 < 0 | a_CCMA69 >= 10) &
    !is.na(a_CCMA70) & between(a_CCMA70, 0, 10) &
    !is.na(a_CCMA72) & between(a_CCMA72, 0, 10) ~ "B",
  (is.na(a_CCMA02) | a_CCMA02 < 0 | a_CCMA02 > 10) &
    (is.na(a_CCMA70) | a_CCMA70 < 0 | a_CCMA70 >= 10) &
    between(a_CCMA69, 0, 10) & between(a_CCMA72, 0, 10) ~ "C",
  (is.na(a_CCMA02) | a_CCMA02 < 0 | a_CCMA02 > 10) &
    (is.na(a_CCMA72) | a_CCMA72 < 0 | a_CCMA72 >= 10) &
    between(a_CCMA69, 0, 10) & between(a_CCMA70, 0, 10) ~ "D")
  ) %>%
  mutate(
    CCMA2 = case_when(
  CCMA2_cond == "A" & 0.614 + (0.065 * a_CCMA69) + (-0.012 * a_CCMA70) + 
(0.504 * a_CCMA72) < 0  ~ 0,
  CCMA2_cond == "A" & 0.614 + (0.065 * a_CCMA69) + (-0.012 * a_CCMA70) + 
(0.504 * a_CCMA72) > 10 ~ 10,
  CCMA2_cond == "A" ~ 0.614 + (0.065 * a_CCMA69) + (-0.012 * a_CCMA70) + 
(0.504 * a_CCMA72),
  TRUE ~ as.double(CCMA2)
    ),
    CCMA2 = case_when(
  CCMA2_cond == "B" & 0.614 + (0.065 * (a_CCMA70 + a_CCMA72) / 2) + (-0.012 
* a_CCMA70) + (0.504 * a_CCMA72) < 0  ~ 0,
  CCMA2_cond == "B" & 0.614 + (0.065 * (a_CCMA70 + a_CCMA72) / 2) + (-0.012 
* a_CCMA70) + (0.504 * a_CCMA72) > 10 ~ 10,
  CCMA2_cond == "B" ~ 0.614 + (0.065 * (a_CCMA70 + a_CCMA72) / 2) + (-0.012 
* a_CCMA70) + (0.504 * a_CCMA72),
  TRUE ~ as.double(CCMA2)
    ),
    CCMA2 = case_when(
  CCMA2_cond == "C" & 0.614 + (0.065 * a_CCMA69) + (-0.012 *(a_CCMA72 + 
a_CCMA69) / 2 ) + (0.504 * a_CCMA72) < 0  ~ 0,
  CCMA2_cond == "C" & 0.614 + (0.065 * a_CCMA69) + (-0.012 *(a_CCMA72 + 
a_CCMA69) / 2 ) + (0.504 * a_CCMA72) > 10 ~ 10,
  CCMA2_cond == "C" ~ 0.614 + (0.065 * a_CCMA69) + (-0.012 *(a_CCMA72 + 
a_CCMA69) / 2 ) + (0.504 * a_CCMA72),
  TRUE ~ as.double(CCMA2)
    ),
    CCMA2 = case_when(
  CCMA2_cond == "D" & 0.614 + (0.065 * a_CCMA69) + (-0.012 * a_CCMA70 ) + 
(0.504 *(a_CCMA70 + a_CCMA69) / 2) < 0  ~ 0,
  CCMA2_cond == "D" & 0.614 + (0.065 * a_CCMA69) + (-0.012 * a_CCMA70 ) + 
(0.504 *(a_CCMA70 + a_CCMA69) / 2) > 10 ~ 10,
  CCMA2_cond == "D" ~ 0.614 + (0.065 * a_CCMA69) + (-0.012 * a_CCMA70 ) + 
(0.504 *(a_CCMA70 + a_CCMA69) / 2),
  TRUE ~ as.double(CCMA2)
    )
  ) %>%
  select(-CCMA2_cond) %>%
  mutate(
    CCMA5_condA = if_else(
  (is.na(a_CCMA05) | a_CCMA05 < 0 | a_CCMA05 > 10),
  1, 0
    ),
    CCMA5 = ifelse(CCMA5_condA == 1 & between(a_CCMA43, 0, 10) & 
between(a_CCMA44, 0, 10),
   0.216 + (0.257 * a_CCMA43) + (0.828 * a_CCMA44),
   CCMA5),
    CCMA5 = ifelse(CCMA5_condA == 1 & between(a_CCMA43, 0, 10) & 
(is.na(a_CCMA44) | a_CCMA44 < 0 | a_CCMA44 > 10),
   0.216 + (0.257 * a_CCMA43) + (0.828 * 

[R] R code for if-then-do code blocks

2018-12-17 Thread Paul Miller via R-help
Hello All,

Season's greetings!

 Am trying to replicate some SAS code in R. The SAS code uses if-then-do code 
blocks. I've been trying to do likewise in R as that seems to be the most 
reliable way to get the same result. 

Below is some toy data and some code that does work. There are some things I 
don't necessarily like about the code though. So I was hoping some people could 
help make it better. One thing I don't like is that the within function 
reverses the order of the computed columns such that test1:test5 becomes 
test5:test1. I've used a mutate to overcome that but would prefer not to have 
to do so. 

 Another, perhaps very small thing, is the need to calculate an ID variable 
that becomes the basis for a grouping. 

I did considerable Internet searching for R code that conditionally computes 
blocks of code. I didn't find much though and so am wondering if my search 
terms were not sufficient or if there is some other reason. It occurred to me 
that maybe if-then-do code blocks like we often see in SAS as are frowned upon 
and therefore not much implemented. 

I'd be interested in seeing more R-compatible approaches if this is the case. 
I've learned that it's a mistake to try and make R be like SAS. It's better to 
let R be R. Trouble is I'm not always sure how to do that. 

Thanks,

Paul


d1 <- data.frame(workshop=rep(1:2,4),
    gender=rep(c("f","m"),each=4))

library(tibble)
library(plyr)

d2 <- d1 %>%
  rownames_to_column("ID") %>%
  mutate(test1 = NA, test2 = NA, test4 = NA, test5 = NA) %>%
  ddply("ID",
    within,
    if (gender == "f" & workshop == 1) {
  test1 <- 1
  test1 <- 6 + test1
  test2 <- 2 + test1
  test4 <- 1
  test5 <- 1
    } else {
  test1 <- test2 <- test4 <- test5 <- 0
    })

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Extracting sentences with combinations of target words/terms from cancer patient text medical records

2017-07-13 Thread Paul Miller via R-help
Hi Robert,

Thank you for your reply. An attempt to solve this via a regular expression 
query is particularly helpful. Unfortunately, I don't have much time to play 
around with this just now. Ultimately though, I think I would like to implement 
a solution something along the lines of what you have done. I have a book on 
regular expressions that I am now starting to read. In the meantime, the code 
I'm using is a good way to assess the feasibility of some ideas I'd like to 
implement.

The advantage of your approach I think is that it makes fewer passes through 
the data. That should make it a lot faster and more efficient than what I've 
done. I'm currently working with a little more than 2.5 million text records 
and I think that number will only rise. So efficiency really should matter. 

I've pasted the latest version of my sample code below. This shows how I'd like 
to add the result of the text search as a column in a data frame. It also shows 
how I'd like to append the sentence number to each identified sentence. The 
single colon that appears where there is no match is not by design. It's 
something that I need to tidy.

My sense is that if I used your regular expression as written, I'd lose the 
information about the sentence number when I added the result as a column in my 
data frame. Presumably, I'd need to collapse the information into a single text 
string, and then the numbering would be lost. If you were going to get the 
sentence numbers as well, without making several passes through the data like 
my code does, how would you go about it?

Thanks,

Paul


library(tidyverse)
library(stringr)
library(lubridate)
 
sentence_match <- function(x){
  sentence_extract <- str_extract_all(x, boundary("sentence"), simplify = TRUE)
  sentence_number <- intersect(str_which(sentence_extract, "breast"), 
str_which(sentence_extract, "metastatic|stage IV"))
  sentence_match <- str_c(sentence_number, ": ", 
sentence_extract[sentence_number], collapse = "")
  sentence_match
}
 
sampletxt <-
  structure(
list(
  PTNO = c(1, 2, 2, 2),
  DATE = structure(c(16436, 16436, 16832, 16845), class = "Date"),
  TYPE = c("Progress note", "CAT scan", "Progress note", "Progress note"),
  TVAR = c(
"This sentence contains the word metastatic. This sentence contains the 
term stage IV.",
"This sentence contains no target words. This sentence also contains no 
target words.",
"This sentence contains the word metastatic and the word breast. This 
sentence contains no target words.",
"This sentence contains the words breast and the term metastatic. This 
sentence contains the word breast and the term stage IV."
  )
),
.Names = c("PTNO", "DATE", "TYPE", "TVAR"),
class = c("tbl_df",
  "tbl", "data.frame"),
row.names = c(NA,-4L)
  )
 
sampletxt$EXTRACTED <- sapply(sampletxt$TVAR, sentence_match)
sampletxt$EXTRACTED
 
> sampletxt$EXTRACTED
[1] ": 
"   
   
[2] ": 
"   
   
[3] "1: This sentence 
contains the word metastatic and the word breast. 
" 
[4] "1: This sentence contains the words breast and the term metastatic. 2: 
This sentence contains the word breast and the term stage IV."


From: Robert McGehee <rmcge...@walleyetrading.net>

Cc: "r-help@r-project.org" <r-help@r-project.org>
Sent: Wednesday, July 12, 2017 12:47 PM
Subject: RE: [R] Extracting sentences with combinations of target words/terms 
from cancer patient text medical records



Hi Paul,
Sounds like you have your answer, but for fun I thought I'd try solving your 
problem using only a regular expression query and base R. I believe this works:

> txt <- "Patient had stage IV breast cancer. Nothing matches this sentence. 
> Metastatic and breast match this sentence. French bike champion takes stage 
> IV victory in Tour de France."

> pattern <- "([^.?!]*(?=[^.?!]*\\bbreast\\b)(?=[^.?!]*\\b(metastatic|stage 
> IV)\\b)(?=[\\s.?!])[^.?!]*[.?!])"

> regmatches(txt, gregexpr(pattern, txt, perl=TRUE, ignore.case=TRUE))[[1]]
[1] "Patient had stage IV breast cancer."
[2] " Metastatic and breast match this sentence."

Cheers, Robert

-Original Message-
From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Paul Miller via 
R-help
Sent: Wednesday, July 12, 2017 8:49 AM
T

Re: [R] Extracting sentences with combinations of target words/terms from cancer patient text medical records

2017-07-12 Thread Paul Miller via R-help
Hi Bert,

Thanks for your reply. It appears that I didn't replace the variable name 
"sampletxt" with the argument "x" in my function. I've corrected that and now 
my code seems to be working fine.

Paul


From: Bert Gunter <bgunter.4...@gmail.com>

Cc: R-help <r-help@r-project.org>
Sent: Tuesday, July 11, 2017 2:00 PM
Subject: Re: [R] Extracting sentences with combinations of target words/terms 
from cancer patient text medical records



Have you looked at the CRAN Natural Language Processing Task View? If not, why 
not? If so, why were the resources described there inadequate?

Bert


On Jul 11, 2017 10:49 AM, "Paul Miller via R-help" <r-help@r-project.org> wrote:

Hello All,
>
>I need some help figuring out how to extract combinations of target 
>words/terms from cancer patient text medical records. I've provided some 
>sample data and code below to illustrate what I'm trying to do. At the moment, 
>I'm trying to extract sentences that contain the word "breast" plus either 
>"metastatic" or "stage IV".
>
>It's been some time since I used R and I feel a bit rusty. I wrote a function 
>called "sentence_match" that seemed to work well when applied to a single 
>piece of text. You can see that by running the section titled
>
>"Working code". I thought that it might be possible easily to apply my 
>function to a data set (tibble or df) but that doesn't seem to be the case. My 
>unsuccessful attempt to do this appears in the section titled "Non-working 
>code".
>
>If someone could help me get my code up and running, that would be greatly 
>appreciated. I'm using a lot of functions from Hadley Wickham's packages, but 
>that's not particularly necessary. Although I have only a few entries in my 
>sample data, my actual data are pretty large. Currently, I'm working with over 
>a million records. Some records contain only a single sentence, but many have 
>several paragraphs. One concern I had was that, even if I could get my code 
>working, it would be too inefficient to handle that volume of data.
>
>Thanks,
>
>Paul
>
>
>library(tidyverse)
>library(stringr)
>library(lubridate)
>
>sentence_match <- function(x){
>  sentence_extract <- str_extract_all(sampletxt, boundary("sentence"), 
> simplify = TRUE)
>  sentence_number <- intersect(str_which(sentence_ extract, "breast"), 
> str_which(sentence_extract, "metastatic|stage IV"))
>  sentence_match <- str_c(sentence_number, ": ", sentence_extract[sentence_ 
> number], collapse = "")
>  sentence_match
>}
>
> Working code 
>
>sampletxt <- "This sentence contains the word metastatic and the word breast. 
>This sentence contains no target words."
>
>sentence_match(sampletxt)
>
> Non-working code 
>
>sampletxt <-
>  structure(
>list(
>  PTNO = c(1, 2, 2, 2),
>  DATE = structure(c(16436, 16436, 16832, 16845), class = "Date"),
>  TYPE = c("Progress note", "CAT scan", "Progress note", "Progress note"),
>  TVAR = c(
>"This sentence contains the word metastatic. This sentence contains 
> the term stage IV.",
>"This sentence contains no target words. This sentence also contains 
> no target words.",
>"This sentence contains the word metastatic and the word breast. This 
> sentence contains no target words.",
>"This sentence contains the words breast and the term metastatic. This
>sentence contains the word breast and the term stage IV."
>  )
>),
>.Names = c("PTNO", "DATE", "TYPE", "TVAR"),
>class = c("tbl_df",
>  "tbl", "data.frame"),
>row.names = c(NA,-4L)
>  )
>
>sampletxt2 <- group_by_at(sampletxt, vars(PTNO, DATE, TYPE))
>sampletxt2 <-
>  sampletxt2 %>%
>  mutate(
>EXTRACTED = sentence_match(TVAR)
>  )
>
>__ 
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/ listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/ posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Extracting sentences with combinations of target words/terms from cancer patient text medical records

2017-07-11 Thread Paul Miller via R-help
Hello All,

I need some help figuring out how to extract combinations of target words/terms 
from cancer patient text medical records. I've provided some sample data and 
code below to illustrate what I'm trying to do. At the moment, I'm trying to 
extract sentences that contain the word "breast" plus either "metastatic" or 
"stage IV". 

It's been some time since I used R and I feel a bit rusty. I wrote a function 
called "sentence_match" that seemed to work well when applied to a single piece 
of text. You can see that by running the section titled 

"Working code". I thought that it might be possible easily to apply my function 
to a data set (tibble or df) but that doesn't seem to be the case. My 
unsuccessful attempt to do this appears in the section titled "Non-working 
code". 

If someone could help me get my code up and running, that would be greatly 
appreciated. I'm using a lot of functions from Hadley Wickham's packages, but 
that's not particularly necessary. Although I have only a few entries in my 
sample data, my actual data are pretty large. Currently, I'm working with over 
a million records. Some records contain only a single sentence, but many have 
several paragraphs. One concern I had was that, even if I could get my code 
working, it would be too inefficient to handle that volume of data. 

Thanks,

Paul


library(tidyverse)
library(stringr)
library(lubridate)
 
sentence_match <- function(x){
  sentence_extract <- str_extract_all(sampletxt, boundary("sentence"), simplify 
= TRUE)
  sentence_number <- intersect(str_which(sentence_extract, "breast"), 
str_which(sentence_extract, "metastatic|stage IV"))
  sentence_match <- str_c(sentence_number, ": ", 
sentence_extract[sentence_number], collapse = "")
  sentence_match
}
 
 Working code 
 
sampletxt <- "This sentence contains the word metastatic and the word breast. 
This sentence contains no target words."

sentence_match(sampletxt)
 
 Non-working code 
 
sampletxt <-
  structure(
list(
  PTNO = c(1, 2, 2, 2),
  DATE = structure(c(16436, 16436, 16832, 16845), class = "Date"),
  TYPE = c("Progress note", "CAT scan", "Progress note", "Progress note"),
  TVAR = c(
"This sentence contains the word metastatic. This sentence contains the 
term stage IV.",
"This sentence contains no target words. This sentence also contains no 
target words.",
"This sentence contains the word metastatic and the word breast. This 
sentence contains no target words.",
"This sentence contains the words breast and the term metastatic. This 
sentence contains the word breast and the term stage IV."
  )
),
.Names = c("PTNO", "DATE", "TYPE", "TVAR"),
class = c("tbl_df",
  "tbl", "data.frame"),
row.names = c(NA,-4L)
  )
  
sampletxt2 <- group_by_at(sampletxt, vars(PTNO, DATE, TYPE))
sampletxt2 <- 
  sampletxt2 %>%
  mutate(
EXTRACTED = sentence_match(TVAR)
  )

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.