Thanks everybody! Now I understand the need for more details: the patterns for the gsubs are of different kinds.First, I have character strings, I need to replace. Therefore, I have around 5000 stock ticker symbols (e.g. c(‚AAPL’, ‚EBAY’,…) distributed across 10 vectors. Second, I have four vectors with regular expressions, all similar to this on: replace_url <- c(„https?://.*\\s|www.*\\s“)
The text strings I perform the gsub commands on, look like this (no string is longer than 200 characters): 'GOOGL announced new partnership www.url.com. Stock price is up +5%‘ After performing several gsubs in a row, like gsub(replace_url, “[url]“,dataframe$text_column) gsub(replace_ticker_sp500, “[sp500_ticker]“,dataframe$text_column) etc. this string will look like this: '[sp500_ticker] announced new partnership [url]. Stock price is up [positive_percentage]‘ The dataset contains 4 million entries. The code works, but I I cancelled the process after 1 day (my whole system was blocked while R was running). Performing the code on a smaller chunck of data (1 million) took about 12hrs. As far as I can say, replacing the ticker symbols takes the longest, while the regular expressions went quite fast Thanks! Am 05.11.2013 um 11:31 schrieb Prof Brian Ripley <rip...@stats.ox.ac.uk>: > But note too what the help says: > > Performance considerations: > > If you are doing a lot of regular expression matching, including > on very long strings, you will want to consider the options used. > Generally PCRE will be faster than the default regular expression > engine, and ‘fixed = TRUE’ faster still (especially when each > pattern is matched only a few times). > > (and there is more). I don't see perl=TRUE here. > > On 05/11/2013 09:06, Jim Holtman wrote: >> what is missing is any idea of what the 'patterns' are that you are >> searching for. Regular expressions are very sensitive to how you specify >> the pattern. you indicated that you have up to 500 elements in the pattern, >> so what does it look like? alternation and backtracking can be very >> expensive. so a lot more specificity is required. there are whole books >> written on how pattern matching works and what is hard and what is easy. >> this is true for wherever regular expressions are used, not just in R. also >> some idea of what the timing is; are you talking about 1-10-100 >> seconds/minutes/hours. >> >> Sent from my iPad >> >> On Nov 5, 2013, at 3:13, Simon Pickert <simon.pick...@t-online.de> wrote: >> >>> How’s that not reproducible? >>> >>> 1. Data frame, one column with text strings >>> 2. Size of data frame= 4million observations >>> 3. A bunch of gsubs in a row ( gsub(patternvector, >>> “[token]“,dataframe$text_column) ) >>> 4. General question: How to speed up string operations on ‘large' data sets? >>> >>> >>> Please let me know what more information you need in order to reproduce >>> this example? >>> It’s more a general type of question, while I think the description above >>> gives you a specific picture of what I’m doing right now. >>> >>> >>> >>> >>> >>> >>> General question: >>> Am 05.11.2013 um 06:59 schrieb Jeff Newmiller <jdnew...@dcn.davis.ca.us>: >>> >>>> Example not reproducible. Communication fail. Please refer to Posting >>>> Guide. >>>> --------------------------------------------------------------------------- >>>> Jeff Newmiller The ..... ..... Go Live... >>>> DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... >>>> Live: OO#.. Dead: OO#.. Playing >>>> Research Engineer (Solar/Batteries O.O#. #.O#. with >>>> /Software/Embedded Controllers) .OO#. .OO#. rocks...1k >>>> --------------------------------------------------------------------------- >>>> Sent from my phone. Please excuse my brevity. >>>> >>>> Simon Pickert <simon.pick...@t-online.de> wrote: >>>>> Hi R’lers, >>>>> >>>>> I’m running into speeding issues, performing a bunch of >>>>> >>>>> „gsub(patternvector, [token],dataframe$text_column)" >>>>> >>>>> on a data frame containing >4millionentries. >>>>> >>>>> (The “patternvectors“ contain up to 500 elements) >>>>> >>>>> Is there any better/faster way than performing like 20 gsub commands in >>>>> a row? >>>>> >>>>> >>>>> Thanks! >>>>> Simon >>>>> >>>>> ______________________________________________ >>>>> R-help@r-project.org mailing list >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide >>>>> http://www.R-project.org/posting-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > > -- > Brian D. Ripley, rip...@stats.ox.ac.uk > Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ > University of Oxford, Tel: +44 1865 272861 (self) > 1 South Parks Road, +44 1865 272866 (PA) > Oxford OX1 3TG, UK Fax: +44 1865 272595 > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.