Interesting thoughts about the partial-word matches, and speed On
another real data set, about 73,000 records and 6 columns to search
through for matches (one column of which contains very long character
strings--several paragraphs each), I ran both John's and Bert's
solutions. John's was
Note that John's solution probably includes incorrect partial matches
and that mine fails to match red in this is red. If you change my
proposal to
sapply(strsplit(do.call(paste,zz[,2:3]),\\W), function(x)any(x %in%
alarm.words))
it should agree with Jeff's. Note, however, that you have missed
Thanks everyone. John's original solution worked great. And with
27,000 records, 65 alarm.words, and 6 columns to search, it takes only
about 15 seconds. That is certainly adequate for my needs. But I
will try out the other strategies too.
And thanks also for lot's of new R things to
Yes. This is one of the fundamental challenges in text searching --
defining exactly what text defines a match and what doesn't. So,
continuing your example, one might imagine that heroin and heroine
might both be matches, but maybe heroines shouldn't be (e.g. if the
text contains movie reviews).
Here's a way to do it that uses %in% (i.e. match() ) and uses only a
single, not a double, loop. It should be more efficient.
sapply(strsplit(do.call(paste,zz[,2:3]),[[:space:]]+),
+ function(x)any(x %in% alarm.words))
[1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
The
I think grep is better suited to this:
zz$v5 - grepl( paste0( alarm.words, collapse=| ), do.call( paste, zz[ , 2:3
] ) ) )
---
Jeff NewmillerThe . . Go Live...
Jeff:
Well, it would be much better (no loops!) except, I think, for one
issue: red would match barred and I don't think that this is what
is wanted: the matches should be on whole words not just string
patterns.
So you would need to fix up the matching pattern to make this work,
but it may be a
Just add a word break marker before and after:
zz$v5 - grepl( paste0( \\b(, paste0( alarm.words, collapse=| ), )\\b ),
do.call( paste, zz[ , 2:3 ] ) ) )
---
Jeff NewmillerThe . . Go
Yup, that does it. Let grep figure out what's a word rather than doing
it manually. Forgot about \b
Cheers,
Bert
Bert Gunter
Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom.
-- Clifford Stoll
On Thu, Jul 9, 2015 at 10:30 AM, Jeff Newmiller
Running R 3.1.1 on windows 7
I want to identify as a case any record in a dataframe that contains any
of several keywords in any of several variables.
Example:
# create a dataframe with 4 variables and 10 records
v2 - c(white bird, blue bird, green turtle, quick brown fox,
big black dog, waffle
Dear Chris,
If I understand correctly what you want, how about the following?
rows - apply(zz[, 2:3], 1, function(x) any(sapply(alarm.words, grepl, x=x)))
zz[rows, ]
v1 v2v3 v4
3 -1.022329green turtleronald
Running R 3.1.1 on windows 7
I want to identify as a case any record in a dataframe that contains any
of several keywords in any of several variables.
Example:
# create a dataframe with 4 variables and 10 records
v2 - c(white bird, blue bird, green turtle, quick brown fox,
big black dog, waffle
Dear Christopher,
My usual orientation to this kind of one-off problem is that I'm looking for a
simple correct solution. Computing time is usually much smaller than
programming time.
That said, Bert Gunter's solution was about 5 times faster in a simple check
that I ran with microbenchmark,
13 matches
Mail list logo