[Rd] strsplit(perl=TRUE), gregexpr(perl=TRUE) very slow for long strings

2017-01-05 Thread William Dunlap via R-devel
While doing some speed testing I noticed that in R-3.2.3 the perl=TRUE
variants of strsplit() and gregexpr() took time proportional to the
square of the number of pattern matches in their input strings.  E.g.,
the attached test function times gsub, strsplit, and gregexpr, with
perl TRUE (PCRE) and FALSE (TRE), when the input string contains 'n'
matches to the given pattern.  Notice the quadratic (in n) time growth
for the StrSplitPCRE and RegExprPCRE columns.

> regex.perf.test(N=2^(11:20))

  N SubTRE SubPCRE StrSplitTRE StrSplitPCRE RegExprTRE RegExprPCRE

elapsed2048   0.000.000.00 0.00   0.000.00
elapsed4096   0.000.000.01 0.00   0.000.00
elapsed8192   0.000.000.00 0.01   0.000.01
elapsed   16384   0.020.000.00 0.05   0.020.08
elapsed   32768   0.000.000.01 0.16   0.000.29
elapsed   65536   0.020.010.04 0.59   0.010.96
elapsed  131072   0.030.020.08 2.37   0.052.43
elapsed  262144   0.060.040.17 9.58   0.109.61
elapsed  524288   0.140.050.3639.14   0.21   38.58
elapsed 1048576   0.300.080.52   155.50   0.40  155.43

I have not looked at R's code, but it is possible that the problem is
caused by PCRE repeatedly scanning (once per match) the entire input
string to make sure it is valid UTF-8.  If so, adding
PCRE_NO_UTF8_CHECK to the flags given to pcre_exec would solve the
problem.  Perhaps R is already doing that in gsub(perl=TRUE).

Here is the test function:

regex.perf.test <- function(N=c(1e4, 2e4, 4e4, 8e4)) {
  makeTestString <- function(n) paste(collapse="",  rep("ab", n))
  s <- lapply(N, makeTestString)
  fns <- list(SubTRE=function(si) gsub("a", "", si, perl=FALSE),
  SubPCRE=function(si) gsub("a", "", si, perl=TRUE),
  StrSplitTRE=function(si) strsplit(si, "a", perl=FALSE),
  StrSplitPCRE=function(si) strsplit(si, "a", perl=TRUE),
  RegExprTRE=function(si) gregexpr("a", si, perl=FALSE),
  RegExprPCRE=function(si) gregexpr("a", si, perl=TRUE))
  times <- lapply(fns, function(fn) sapply(s, function(si)
system.time(fn(si))["elapsed"]))
  do.call("cbind", c(list(N=N), times))
}

Bill Dunlap
TIBCO Software
wdunlap tibco.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] seq.int/seq.default

2017-01-05 Thread Martin Maechler
> Mick Jordan 
> on Wed, 4 Jan 2017 12:49:41 -0800 writes:

> On 1/4/17 8:15 AM, Mick Jordan wrote:
> Here is another difference that I am guessing is unintended.

>> y <- seq.int(1L, 3L, length.out=2)
>> typeof(y)
> [1] "double"
>> x <- seq.default(1L, 3L, length.out=2)
>> typeof(x)
> [1] "integer"

> The if (by == R_MissingArg) branch at line 842 doesn't contain a check 
> for "all INTSXP" unlike the if (to == R_MissingArg) branch.

> Mick

I'll look at this case, too,
thank you once more!

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] seq.int/seq.default

2017-01-05 Thread Martin Maechler
> Mick Jordan 
> on Wed, 4 Jan 2017 08:15:03 -0800 writes:

> On 1/4/17 1:26 AM, Martin Maechler wrote:
>>> Mick Jordan 
>>> on Tue, 3 Jan 2017 07:57:15 -0800 writes:
>> > This is a message for someone familiar with the implementation.
>> > Superficially the R code for seq.default and the C code for seq.int
>> > appear to be semantically very similar. My question is whether, in 
fact,
>> > it is intended that behave identically for all inputs.
>> 
>> Strictly speaking, "no":  As usual, RT?Manual (;-)
>> 
>> The help page says in the very first paragraph ('Description'):
>> 
>> ‘seq’ is a standard generic with a default method.
>> ‘seq.int’ is a primitive which can be much faster but has a few 
restrictions.
>> 
>> > I have found two cases so far where they differ, first
>> > that seq.int will coerce a character string to a real (via
>> > Rf_asReal) whereas seq.default appears to coerce it to NA
>> > and then throws an error:
>> 
>> >> seq.default("2", "5")
>> > Error in seq.default("2", "5") : 'from' cannot be NA, NaN or infinite
>> >> seq.int("2", "5")
>> > [1] 2 3 4 5
>> >>
>> 
>> this may be a bit surprising (if one does _not_ look at the code),
>> indeed, notably because seq.int() is mentioned to have more
>> restrictions than seq() which here calls seq.default().
>> "Surprising" also when considering
>> 
>> > "2":"5"
>> [1] 2 3 4 5
>> 
>> and the documentation of ':' claims 'from:to' to be the same as
>> rep(from,to)  apart from the case of factors.
>> 
>> --- I am considering a small change in  seq.default()
>> which would make it work for this case, compatibly with ":" and 
seq.int().
>> 
>> 
>> > and second, that the error messages for non-numeric arguments differ:
>> 
>> which I find fine... if the functions where meant to be
>> identical, we (the R developers) would be silly to have both,
>> notably as the ".int" suffix  has emerged as confusing the
>> majority of useRs (who don't read help pages).
>> 
>> Rather it has been meant as saying "internal" (including "fast") also 
for other
>> such R functions, but the suffix of course is a potential clash
>> with S3 method naming schemes _and_ the fact that 'int' is used
>> as type name for integer in other languages, notably C.
>> 
>> > seq.default(to=quote(b), by=2)
>> > Error in is.finite(to) : default method not implemented for type 
'symbol'
>> 
>> which I find a very appropriate and helpful message
>> 
>> > seq.int(to=quote(b), by=2)
>> > Error in seq.int(to = quote(b), by = 2) :
>> > 'to' cannot be NA, NaN or infinite
>> 
>> which is true, as well, and there's no "default method" to be
>> mentioned, but you are right that it would be nicer if the
>> message mentioned 'symbol' as well.

> Thanks for the clarifications. It was surprising that seq.int supported 
> more types than seq.default. I was expecting the reverse.

exactly, me too!

> BTW, There are a couple of, admittedly odd, cases, exposed by brute 
> force testing, where seq.int will actually return "missing", which I 
> presume is not intended, and seq.default behaves differently, vis:

>> seq.default(to=1,by=2)
> [1] 1
>> seq.int(to=1,by=2)

>> > x <- seq.int(to=1,by=2)
>> x
> Error: argument "x" is missing, with no default

> Lines 792 and 799 of seq.c return the incoming argument (as opposed to a 
> value based on its coercion to double via asReal) and this can, as in 
> the above example, be "missing".

> Thanks
> Mick Jordan

Thanks a lot, Mick -- you are right!

I'm fixing these  (the line numbers have greatly changed in the
mean time: Remember we work with "R-devel", i.e., the "trunk" :
always available at
https://svn.r-project.org/R/trunk/src/main/seq.c

Martin Maechler
ETH Zurich

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel