Hi Martin and everybody,
sorry for the long delay. Thanks for all the suggestions. With my code
and my training data I found similar numbers to the one below.
Thanks
Cheers
Fabien
I did this to generate and search 40 million unique strings
> grams <- as.character(1:4e7) ## a long time passes...
> system.time(grep("^900001", grams)) ## similar times to grepl
user system elapsed
10.384 0.168 10.543
Is that the basic task you're trying to accomplish? grep(l) goes
quickly to C, so I don't think data.table or other will be markedly
faster if you're looking for an arbitrary regular expression (use
fixed=TRUE if looking for an exact match).
If you're looking for strings that start with a pattern, then in
R-3.3.0 there is
> system.time(res0 <- startsWith(grams, "900001"))
user system elapsed
0.658 0.012 0.669
which returns the same result as grepl
> identical(res0, res1 <- grepl("^900001", grams))
[1] TRUE
One can also parallelize the already vectorized grepl function with
parallel::pvec, with some opportunity for gain (compared to grepl) on
non-Windows
> system.time(res2 <- pvec(seq_along(grams), function(i)
grepl("^900001", grams[i]), mc.cores=8))
user system elapsed
24.996 1.709 3.974
> identical(res0, res2)
[[1]] TRUE
I think anything else would require pre-processing of some kind, and
then some more detail about what your data looks like is required.
--
Dr Fabien Tarrade
Quantitative Analyst/Developer - Data Scientist
Senior data analyst specialised in the modelling, processing and
statistical treatment of data.
PhD in Physics, 10 years of experience as researcher at the forefront of
international scientific research.
Fascinated by finance and data modelling.
Geneva, Switzerland
Email : cont...@fabien-tarrade.eu <mailto:cont...@fabien-tarrade.eu>
Phone : www.fabien-tarrade.eu <http://www.fabien-tarrade.eu>
Phone : +33 (0)6 14 78 70 90
LinkedIn <http://ch.linkedin.com/in/fabientarrade/> Twitter
<https://twitter.com/fabtar> Google
<https://plus.google.com/+FabienTarradeProfile/posts> Facebook
<https://www.facebook.com/fabien.tarrade.eu> Google
<skype:fabtarhiggs?call> Xing <https://www.xing.com/profile/Fabien_Tarrade>
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.