I am working with utterances, statements spoken by children.  From each 
utterance, if one or more words in the statement match a predefined list of 
multiple 'core' words (probably 300 words), then I want to input '1' into 
'Core' (and if none, then input '0' into 'Core').

If there are one or more words in the statement that are NOT core words, then I 
want to input '1' into 'Fringe' (and if there are only core words and nothing 
extra, then input '0' into 'Fringe').  I will not have a list of Fringe words.

Basically, right now I have a child ID and only the utterances.  Here is a 
snippet of my data.

ID      Utterance
1       a baby
2       small
3       yes
4       where's his bed
5       there's his bed
6       where's his pillow
7       what is that on his head
8       hey he has his arm stuck here
9       there there's it
10      now you're gonna go night-night
11      and that's the thing you can turn on
12      yeah where's the music box
13      what is this
14      small
15      there you go baby


The following code runs but isn't doing exactly what I need--which is:  1) the 
ability to detect words from the list and define as core; 2) the ability to 
search the utterance and if there are any words in the utterance that are NOT 
core, to identify those as �1� as I will not have a list of fringe words.

```

library(dplyr)
library(stringr)
library(tidyr)

coreWords <-c("I", "no", "yes", "my", "the", "want", "is", "it", "that", "a", 
"go", "mine", "you", "what", "on", "in", "here", "more", "out", "off", "some", 
"help", "all done", "finished")

str_detect(df,)

dfplus <- df %>%
  mutate(id = row_number()) %>%
  separate_rows(Utterance, sep = ' ') %>%
  mutate(Core = + str_detect(Utterance, str_c(coreWords, collapse = '|')),
         Fringe = + !Core) %>%
  group_by(id) %>%
  mutate(Core = + (sum(Core) > 0),
         Fringe = + (sum(Fringe) > 0)) %>%
  slice(1) %>%
  select(-Utterance) %>%
  left_join(df) %>%
  ungroup() %>%
  select(Utterance, Core, Fringe, ID)

```

The dput() code is:

structure(list(Utterance = c("a baby", "small", "yes", "where's his bed",
"there's his bed", "where's his pillow", "what is that on his head",
"hey he has his arm stuck here", "there there's it", "now you're gonna go 
night-night",
"and that's the thing you can turn on", "yeah where's the music box",
"what is this", "small", "there you go baby ", "what is this for ",
"a ", "and the go goodnight here ", "and what is this ", " what's that sound ",
"what does she say ", "what she say", "should I turn the on so Laura doesn't 
cry ",
"what is this ", "what is that ", "where's clothes ", " where's the baby's 
bedroom ",
"that might be in dad's bed+room ", "yes ", "there you go baby ",
"you're welcome "), Core = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), Fringe = c(0L, 0L, 0L, 1L, 1L, 1L,
0L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), ID = 1:31), row.names = c(NA,
-31L), class = c("tbl_df", "tbl", "data.frame"))

```

The first 10 rows of output looks like this:

Utterance       Core    Fringe  ID
1       a baby  1       0       1
2       small   1       0       2
3       yes     1       0       3
4       where's his bed 1       1       4
5       there's his bed 1       1       5
6       where's his pillow      1       1       6
7       what is that on his head        1       0       7
8       hey he has his arm stuck here   1       1       8
9       there there's it        1       0       9
10      now you're gonna go night-night 1       1       10

For example, in line 1 of the output, �a� is a core word so �1� for core is 
correct.  However, �baby� should be picked up as fringe so there should be �1�, 
not �0�, for fringe. Lines 7 and 9 also have words that should be identified as 
fringe but are not.

Additionally, it seems like if the utterance has parts of a core word in it, 
it�s being counted. For example, �small� is identified as a core word even 
though it's not (but 'all done' is a core word). 'Where's his bed' is 
identified as core and fringe, although none of the words are core.

Any suggestions on what is happening and how to correct it are greatly 
appreciated.

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to