Re: [R] speed issue: gsub on large data frame

Simon Pickert Tue, 05 Nov 2013 04:21:13 -0800

Thanks everybody! Now I understand the need for more details:

the patterns for the gsubs are of different kinds.First, I have character 
strings, I need to replace. Therefore, I have around 5000 stock ticker symbols 
(e.g. c(‚AAPL’, ‚EBAY’,…) distributed across 10 vectors. 
Second, I have four vectors with regular expressions, all similar to this on: 
replace_url <- c(„https?://.*\\s|www.*\\s“)


The text strings I perform the gsub commands on, look like this (no string is 
longer than 200 characters):

'GOOGL announced new partnership www.url.com. Stock price is up +5%‘

After performing several gsubs in a row, like

gsub(replace_url, “[url]“,dataframe$text_column) 
gsub(replace_ticker_sp500, “[sp500_ticker]“,dataframe$text_column) 
etc. 

this string will look like this:

'[sp500_ticker] announced new partnership [url]. Stock price is up 
[positive_percentage]‘


The dataset contains 4 million entries. The code works, but I I cancelled the 
process after 1 day (my whole system was blocked while R was running). 
Performing the code on a smaller chunck of data (1 million) took about 12hrs. 
As far as I can say, replacing the ticker symbols takes the longest, while the 
regular expressions went quite fast

Thanks!



Am 05.11.2013 um 11:31 schrieb Prof Brian Ripley <rip...@stats.ox.ac.uk>:

> But note too what the help says:
> 
> Performance considerations:
> 
>     If you are doing a lot of regular expression matching, including
>     on very long strings, you will want to consider the options used.
>     Generally PCRE will be faster than the default regular expression
>     engine, and ‘fixed = TRUE’ faster still (especially when each
>     pattern is matched only a few times).
> 
> (and there is more).  I don't see perl=TRUE here.
> 
> On 05/11/2013 09:06, Jim Holtman wrote:
>> what is missing is any idea of what the 'patterns' are that you are 
>> searching for.  Regular expressions are very sensitive to how you specify 
>> the pattern.  you indicated that you have up to 500 elements in the pattern, 
>> so what does it look like?  alternation and backtracking can be very 
>> expensive.  so a lot more specificity is required.  there are whole books 
>> written on how pattern matching works and what is hard and what is easy.  
>> this is true for wherever regular expressions are used, not just in R.  also 
>> some idea of what the timing is; are you talking about 1-10-100 
>> seconds/minutes/hours.
>> 
>> Sent from my iPad
>> 
>> On Nov 5, 2013, at 3:13, Simon Pickert <simon.pick...@t-online.de> wrote:
>> 
>>> How’s that not reproducible?
>>> 
>>> 1. Data frame, one column with text strings
>>> 2. Size of data frame= 4million observations
>>> 3. A bunch of gsubs in a row (  gsub(patternvector, 
>>> “[token]“,dataframe$text_column)  )
>>> 4. General question: How to speed up string operations on ‘large' data sets?
>>> 
>>> 
>>> Please let me know what more information you need in order to reproduce 
>>> this example?
>>> It’s more a general type of question, while I think the description above 
>>> gives you a specific picture of what I’m doing right now.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> General question:
>>> Am 05.11.2013 um 06:59 schrieb Jeff Newmiller <jdnew...@dcn.davis.ca.us>:
>>> 
>>>> Example not reproducible. Communication fail. Please refer to Posting 
>>>> Guide.
>>>> ---------------------------------------------------------------------------
>>>> Jeff Newmiller                        The     .....       .....  Go Live...
>>>> DCN:<jdnew...@dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
>>>>                                     Live:   OO#.. Dead: OO#..  Playing
>>>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>>>> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
>>>> ---------------------------------------------------------------------------
>>>> Sent from my phone. Please excuse my brevity.
>>>> 
>>>> Simon Pickert <simon.pick...@t-online.de> wrote:
>>>>> Hi R’lers,
>>>>> 
>>>>> I’m running into speeding issues, performing a bunch of
>>>>> 
>>>>> „gsub(patternvector, [token],dataframe$text_column)"
>>>>> 
>>>>> on a data frame containing >4millionentries.
>>>>> 
>>>>> (The “patternvectors“ contain up to 500 elements)
>>>>> 
>>>>> Is there any better/faster way than performing like 20 gsub commands in
>>>>> a row?
>>>>> 
>>>>> 
>>>>> Thanks!
>>>>> Simon
>>>>> 
>>>>> ______________________________________________
>>>>> R-help@r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>> 
>>> 
>>> ______________________________________________
>>> R-help@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> 
>> ______________________________________________
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 
> 
> 
> -- 
> Brian D. Ripley,                  rip...@stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> 
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] speed issue: gsub on large data frame

Reply via email to