Re: [R] speed issue: gsub on large data frame

Prof Brian Ripley Tue, 05 Nov 2013 02:38:55 -0800

But note too what the help says:

Performance considerations:


     If you are doing a lot of regular expression matching, including
     on very long strings, you will want to consider the options used.
     Generally PCRE will be faster than the default regular expression
     engine, and ‘fixed = TRUE’ faster still (especially when each
     pattern is matched only a few times).

(and there is more).  I don't see perl=TRUE here.

On 05/11/2013 09:06, Jim Holtman wrote:

what is missing is any idea of what the 'patterns' are that you are searching 
for.  Regular expressions are very sensitive to how you specify the pattern.  
you indicated that you have up to 500 elements in the pattern, so what does it 
look like?  alternation and backtracking can be very expensive.  so a lot more 
specificity is required.  there are whole books written on how pattern matching 
works and what is hard and what is easy.  this is true for wherever regular 
expressions are used, not just in R.  also some idea of what the timing is; are 
you talking about 1-10-100 seconds/minutes/hours.

Sent from my iPad

On Nov 5, 2013, at 3:13, Simon Pickert <simon.pick...@t-online.de> wrote:

How’s that not reproducible?

1. Data frame, one column with text strings
2. Size of data frame= 4million observations
3. A bunch of gsubs in a row (  gsub(patternvector, 
“[token]“,dataframe$text_column)  )
4. General question: How to speed up string operations on ‘large' data sets?


Please let me know what more information you need in order to reproduce this 
example?
It’s more a general type of question, while I think the description above gives 
you a specific picture of what I’m doing right now.






General question:
Am 05.11.2013 um 06:59 schrieb Jeff Newmiller <jdnew...@dcn.davis.ca.us>:

Example not reproducible. Communication fail. Please refer to Posting Guide.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnew...@dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                     Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.

Simon Pickert <simon.pick...@t-online.de> wrote:

Hi R’lers,

I’m running into speeding issues, performing a bunch of

„gsub(patternvector, [token],dataframe$text_column)"

on a data frame containing >4millionentries.

(The “patternvectors“ contain up to 500 elements)

Is there any better/faster way than performing like 20 gsub commands in
a row?


Thanks!
Simon

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
Brian D. Ripley,                  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] speed issue: gsub on large data frame

Reply via email to